taicheng commited on
Commit
eb8940d
·
verified ·
1 Parent(s): 8341b40

Model save

Browse files
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: alignment-handbook/zephyr-7b-sft-full
5
+ tags:
6
+ - trl
7
+ - dpo
8
+ - generated_from_trainer
9
+ model-index:
10
+ - name: zephyr-7b-align-scan-3e-07-0.62-polynomial-3.0
11
+ results: []
12
+ ---
13
+
14
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
+ should probably proofread and complete it, then remove this comment. -->
16
+
17
+ # zephyr-7b-align-scan-3e-07-0.62-polynomial-3.0
18
+
19
+ This model is a fine-tuned version of [alignment-handbook/zephyr-7b-sft-full](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full) on an unknown dataset.
20
+ It achieves the following results on the evaluation set:
21
+ - Loss: 0.8766
22
+ - Rewards/chosen: -0.9523
23
+ - Rewards/rejected: -2.5666
24
+ - Rewards/accuracies: 0.3611
25
+ - Rewards/margins: 1.6143
26
+ - Logps/rejected: -85.2680
27
+ - Logps/chosen: -76.0272
28
+ - Logits/rejected: -2.6367
29
+ - Logits/chosen: -2.6538
30
+
31
+ ## Model description
32
+
33
+ More information needed
34
+
35
+ ## Intended uses & limitations
36
+
37
+ More information needed
38
+
39
+ ## Training and evaluation data
40
+
41
+ More information needed
42
+
43
+ ## Training procedure
44
+
45
+ ### Training hyperparameters
46
+
47
+ The following hyperparameters were used during training:
48
+ - learning_rate: 3e-07
49
+ - train_batch_size: 8
50
+ - eval_batch_size: 8
51
+ - seed: 42
52
+ - distributed_type: multi-GPU
53
+ - num_devices: 4
54
+ - gradient_accumulation_steps: 2
55
+ - total_train_batch_size: 64
56
+ - total_eval_batch_size: 32
57
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
58
+ - lr_scheduler_type: polynomial
59
+ - lr_scheduler_warmup_ratio: 0.1
60
+ - num_epochs: 3
61
+
62
+ ### Training results
63
+
64
+ | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
65
+ |:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
66
+ | 0.6629 | 0.3484 | 100 | 0.6374 | 0.7586 | 0.3997 | 0.3452 | 0.3589 | -80.4837 | -73.2678 | -2.5455 | -2.5615 |
67
+ | 0.7044 | 0.6969 | 200 | 0.6785 | 0.6115 | 0.1187 | 0.3353 | 0.4927 | -80.9369 | -73.5050 | -2.5325 | -2.5487 |
68
+ | 0.3945 | 1.0453 | 300 | 0.6975 | 0.7667 | 0.1071 | 0.3552 | 0.6597 | -80.9557 | -73.2546 | -2.5596 | -2.5753 |
69
+ | 0.3859 | 1.3937 | 400 | 0.7396 | 1.4671 | 0.5658 | 0.3571 | 0.9013 | -80.2158 | -72.1250 | -2.5834 | -2.5995 |
70
+ | 0.3893 | 1.7422 | 500 | 0.7904 | -0.4771 | -1.4060 | 0.3492 | 0.9290 | -83.3962 | -75.2607 | -2.6499 | -2.6659 |
71
+ | 0.3749 | 2.0906 | 600 | 0.8125 | 0.5611 | -0.4847 | 0.3631 | 1.0458 | -81.9100 | -73.5862 | -2.6159 | -2.6321 |
72
+ | 0.3662 | 2.4390 | 700 | 0.8412 | -0.6104 | -2.0869 | 0.3651 | 1.4765 | -84.4944 | -75.4757 | -2.5941 | -2.6112 |
73
+ | 0.3615 | 2.7875 | 800 | 0.8766 | -0.9523 | -2.5666 | 0.3611 | 1.6143 | -85.2680 | -76.0272 | -2.6367 | -2.6538 |
74
+
75
+
76
+ ### Framework versions
77
+
78
+ - Transformers 4.44.2
79
+ - Pytorch 2.4.0
80
+ - Datasets 2.21.0
81
+ - Tokenizers 0.19.1
all_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 3.0,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.4762609164889266,
5
+ "train_runtime": 9670.9508,
6
+ "train_samples": 18340,
7
+ "train_samples_per_second": 5.689,
8
+ "train_steps_per_second": 0.089
9
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.44.2"
6
+ }
train_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 3.0,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.4762609164889266,
5
+ "train_runtime": 9670.9508,
6
+ "train_samples": 18340,
7
+ "train_samples_per_second": 5.689,
8
+ "train_steps_per_second": 0.089
9
+ }
trainer_state.json ADDED
@@ -0,0 +1,1475 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 3.0,
5
+ "eval_steps": 100,
6
+ "global_step": 861,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.003484320557491289,
13
+ "grad_norm": 358.6503200029449,
14
+ "learning_rate": 3.4482758620689654e-09,
15
+ "logits/chosen": -2.5345611572265625,
16
+ "logits/rejected": -2.581700563430786,
17
+ "logps/chosen": -60.002105712890625,
18
+ "logps/rejected": -99.98374938964844,
19
+ "loss": 0.6931,
20
+ "rewards/accuracies": 0.0,
21
+ "rewards/chosen": 0.0,
22
+ "rewards/margins": 0.0,
23
+ "rewards/rejected": 0.0,
24
+ "step": 1
25
+ },
26
+ {
27
+ "epoch": 0.03484320557491289,
28
+ "grad_norm": 335.50819535600175,
29
+ "learning_rate": 3.448275862068965e-08,
30
+ "logits/chosen": -2.5634875297546387,
31
+ "logits/rejected": -2.562131881713867,
32
+ "logps/chosen": -59.66706085205078,
33
+ "logps/rejected": -73.39751434326172,
34
+ "loss": 0.6938,
35
+ "rewards/accuracies": 0.2222222238779068,
36
+ "rewards/chosen": -0.00458882749080658,
37
+ "rewards/margins": 0.005699412431567907,
38
+ "rewards/rejected": -0.010288238525390625,
39
+ "step": 10
40
+ },
41
+ {
42
+ "epoch": 0.06968641114982578,
43
+ "grad_norm": 431.2142268974989,
44
+ "learning_rate": 6.89655172413793e-08,
45
+ "logits/chosen": -2.6049625873565674,
46
+ "logits/rejected": -2.563797950744629,
47
+ "logps/chosen": -104.11083984375,
48
+ "logps/rejected": -94.92049407958984,
49
+ "loss": 0.6946,
50
+ "rewards/accuracies": 0.29374998807907104,
51
+ "rewards/chosen": 0.010060709901154041,
52
+ "rewards/margins": 0.026405800133943558,
53
+ "rewards/rejected": -0.016345087438821793,
54
+ "step": 20
55
+ },
56
+ {
57
+ "epoch": 0.10452961672473868,
58
+ "grad_norm": 412.06640261613535,
59
+ "learning_rate": 1.0344827586206897e-07,
60
+ "logits/chosen": -2.591465711593628,
61
+ "logits/rejected": -2.5714292526245117,
62
+ "logps/chosen": -82.47346496582031,
63
+ "logps/rejected": -91.5473403930664,
64
+ "loss": 0.6905,
65
+ "rewards/accuracies": 0.2750000059604645,
66
+ "rewards/chosen": 0.014319317415356636,
67
+ "rewards/margins": 0.018754545599222183,
68
+ "rewards/rejected": -0.0044352286495268345,
69
+ "step": 30
70
+ },
71
+ {
72
+ "epoch": 0.13937282229965156,
73
+ "grad_norm": 362.45244079994615,
74
+ "learning_rate": 1.379310344827586e-07,
75
+ "logits/chosen": -2.498384475708008,
76
+ "logits/rejected": -2.496023654937744,
77
+ "logps/chosen": -77.94225311279297,
78
+ "logps/rejected": -73.06587219238281,
79
+ "loss": 0.679,
80
+ "rewards/accuracies": 0.23125000298023224,
81
+ "rewards/chosen": -0.01477863360196352,
82
+ "rewards/margins": 0.02524409256875515,
83
+ "rewards/rejected": -0.040022727102041245,
84
+ "step": 40
85
+ },
86
+ {
87
+ "epoch": 0.17421602787456447,
88
+ "grad_norm": 320.1865662448919,
89
+ "learning_rate": 1.7241379310344825e-07,
90
+ "logits/chosen": -2.533860683441162,
91
+ "logits/rejected": -2.537620782852173,
92
+ "logps/chosen": -63.8255729675293,
93
+ "logps/rejected": -76.09891510009766,
94
+ "loss": 0.6708,
95
+ "rewards/accuracies": 0.2562499940395355,
96
+ "rewards/chosen": 0.0704144611954689,
97
+ "rewards/margins": 0.06760600954294205,
98
+ "rewards/rejected": 0.0028084414079785347,
99
+ "step": 50
100
+ },
101
+ {
102
+ "epoch": 0.20905923344947736,
103
+ "grad_norm": 315.3790948714663,
104
+ "learning_rate": 2.0689655172413793e-07,
105
+ "logits/chosen": -2.5066819190979004,
106
+ "logits/rejected": -2.500108480453491,
107
+ "logps/chosen": -72.56768798828125,
108
+ "logps/rejected": -67.95696258544922,
109
+ "loss": 0.6664,
110
+ "rewards/accuracies": 0.34375,
111
+ "rewards/chosen": 0.3960721492767334,
112
+ "rewards/margins": 0.11608312278985977,
113
+ "rewards/rejected": 0.2799890339374542,
114
+ "step": 60
115
+ },
116
+ {
117
+ "epoch": 0.24390243902439024,
118
+ "grad_norm": 311.35919136752943,
119
+ "learning_rate": 2.413793103448276e-07,
120
+ "logits/chosen": -2.529073715209961,
121
+ "logits/rejected": -2.5245697498321533,
122
+ "logps/chosen": -63.03422164916992,
123
+ "logps/rejected": -67.55048370361328,
124
+ "loss": 0.6571,
125
+ "rewards/accuracies": 0.30000001192092896,
126
+ "rewards/chosen": 0.7452259063720703,
127
+ "rewards/margins": 0.20201978087425232,
128
+ "rewards/rejected": 0.5432060956954956,
129
+ "step": 70
130
+ },
131
+ {
132
+ "epoch": 0.2787456445993031,
133
+ "grad_norm": 380.6379547994258,
134
+ "learning_rate": 2.758620689655172e-07,
135
+ "logits/chosen": -2.4771764278411865,
136
+ "logits/rejected": -2.4673564434051514,
137
+ "logps/chosen": -74.3353271484375,
138
+ "logps/rejected": -76.72254943847656,
139
+ "loss": 0.6584,
140
+ "rewards/accuracies": 0.3125,
141
+ "rewards/chosen": 0.8243219256401062,
142
+ "rewards/margins": 0.3330293893814087,
143
+ "rewards/rejected": 0.49129247665405273,
144
+ "step": 80
145
+ },
146
+ {
147
+ "epoch": 0.313588850174216,
148
+ "grad_norm": 290.2304014635751,
149
+ "learning_rate": 2.9922480620155034e-07,
150
+ "logits/chosen": -2.4918248653411865,
151
+ "logits/rejected": -2.505871295928955,
152
+ "logps/chosen": -65.13265228271484,
153
+ "logps/rejected": -69.66169738769531,
154
+ "loss": 0.6595,
155
+ "rewards/accuracies": 0.29374998807907104,
156
+ "rewards/chosen": 0.5267654657363892,
157
+ "rewards/margins": 0.24630899727344513,
158
+ "rewards/rejected": 0.28045645356178284,
159
+ "step": 90
160
+ },
161
+ {
162
+ "epoch": 0.34843205574912894,
163
+ "grad_norm": 362.13853654937515,
164
+ "learning_rate": 2.96640826873385e-07,
165
+ "logits/chosen": -2.475275754928589,
166
+ "logits/rejected": -2.4764583110809326,
167
+ "logps/chosen": -74.12995910644531,
168
+ "logps/rejected": -80.4610366821289,
169
+ "loss": 0.6629,
170
+ "rewards/accuracies": 0.3125,
171
+ "rewards/chosen": 0.7075541019439697,
172
+ "rewards/margins": 0.3413715958595276,
173
+ "rewards/rejected": 0.36618250608444214,
174
+ "step": 100
175
+ },
176
+ {
177
+ "epoch": 0.34843205574912894,
178
+ "eval_logits/chosen": -2.5614964962005615,
179
+ "eval_logits/rejected": -2.545496940612793,
180
+ "eval_logps/chosen": -73.26775360107422,
181
+ "eval_logps/rejected": -80.48373413085938,
182
+ "eval_loss": 0.6374358534812927,
183
+ "eval_rewards/accuracies": 0.3452380895614624,
184
+ "eval_rewards/chosen": 0.7585535645484924,
185
+ "eval_rewards/margins": 0.3588891923427582,
186
+ "eval_rewards/rejected": 0.39966443181037903,
187
+ "eval_runtime": 113.5456,
188
+ "eval_samples_per_second": 17.614,
189
+ "eval_steps_per_second": 0.555,
190
+ "step": 100
191
+ },
192
+ {
193
+ "epoch": 0.3832752613240418,
194
+ "grad_norm": 402.8431072761315,
195
+ "learning_rate": 2.940568475452196e-07,
196
+ "logits/chosen": -2.5034003257751465,
197
+ "logits/rejected": -2.465451717376709,
198
+ "logps/chosen": -72.49577331542969,
199
+ "logps/rejected": -62.814613342285156,
200
+ "loss": 0.6492,
201
+ "rewards/accuracies": 0.28125,
202
+ "rewards/chosen": 0.41803866624832153,
203
+ "rewards/margins": 0.29653844237327576,
204
+ "rewards/rejected": 0.12150021642446518,
205
+ "step": 110
206
+ },
207
+ {
208
+ "epoch": 0.4181184668989547,
209
+ "grad_norm": 276.5279613655658,
210
+ "learning_rate": 2.9147286821705423e-07,
211
+ "logits/chosen": -2.528160333633423,
212
+ "logits/rejected": -2.497156858444214,
213
+ "logps/chosen": -76.75882720947266,
214
+ "logps/rejected": -66.54231262207031,
215
+ "loss": 0.6394,
216
+ "rewards/accuracies": 0.3187499940395355,
217
+ "rewards/chosen": 0.4357032775878906,
218
+ "rewards/margins": 0.3555503487586975,
219
+ "rewards/rejected": 0.08015286922454834,
220
+ "step": 120
221
+ },
222
+ {
223
+ "epoch": 0.4529616724738676,
224
+ "grad_norm": 438.9130471992716,
225
+ "learning_rate": 2.888888888888889e-07,
226
+ "logits/chosen": -2.5648016929626465,
227
+ "logits/rejected": -2.5454888343811035,
228
+ "logps/chosen": -83.2737808227539,
229
+ "logps/rejected": -87.85270690917969,
230
+ "loss": 0.6666,
231
+ "rewards/accuracies": 0.36250001192092896,
232
+ "rewards/chosen": 0.3763393759727478,
233
+ "rewards/margins": 0.5897358655929565,
234
+ "rewards/rejected": -0.21339651942253113,
235
+ "step": 130
236
+ },
237
+ {
238
+ "epoch": 0.4878048780487805,
239
+ "grad_norm": 322.57184174769804,
240
+ "learning_rate": 2.863049095607235e-07,
241
+ "logits/chosen": -2.4603538513183594,
242
+ "logits/rejected": -2.4488110542297363,
243
+ "logps/chosen": -80.25636291503906,
244
+ "logps/rejected": -70.93235778808594,
245
+ "loss": 0.6249,
246
+ "rewards/accuracies": 0.375,
247
+ "rewards/chosen": 0.5512245893478394,
248
+ "rewards/margins": 0.6554535627365112,
249
+ "rewards/rejected": -0.10422901809215546,
250
+ "step": 140
251
+ },
252
+ {
253
+ "epoch": 0.5226480836236934,
254
+ "grad_norm": 359.54734158147613,
255
+ "learning_rate": 2.837209302325581e-07,
256
+ "logits/chosen": -2.5226199626922607,
257
+ "logits/rejected": -2.4783759117126465,
258
+ "logps/chosen": -78.67804718017578,
259
+ "logps/rejected": -79.57527160644531,
260
+ "loss": 0.6633,
261
+ "rewards/accuracies": 0.3187499940395355,
262
+ "rewards/chosen": 0.4999012351036072,
263
+ "rewards/margins": 0.5150824785232544,
264
+ "rewards/rejected": -0.01518118567764759,
265
+ "step": 150
266
+ },
267
+ {
268
+ "epoch": 0.5574912891986062,
269
+ "grad_norm": 312.3603092848705,
270
+ "learning_rate": 2.811369509043928e-07,
271
+ "logits/chosen": -2.486689805984497,
272
+ "logits/rejected": -2.507362127304077,
273
+ "logps/chosen": -64.26698303222656,
274
+ "logps/rejected": -72.27865600585938,
275
+ "loss": 0.6722,
276
+ "rewards/accuracies": 0.26249998807907104,
277
+ "rewards/chosen": 0.15987662971019745,
278
+ "rewards/margins": 0.3494376838207245,
279
+ "rewards/rejected": -0.18956105411052704,
280
+ "step": 160
281
+ },
282
+ {
283
+ "epoch": 0.5923344947735192,
284
+ "grad_norm": 391.74254314560164,
285
+ "learning_rate": 2.7855297157622735e-07,
286
+ "logits/chosen": -2.4955577850341797,
287
+ "logits/rejected": -2.48069167137146,
288
+ "logps/chosen": -69.2925796508789,
289
+ "logps/rejected": -77.45768737792969,
290
+ "loss": 0.6399,
291
+ "rewards/accuracies": 0.3125,
292
+ "rewards/chosen": -0.20161184668540955,
293
+ "rewards/margins": 0.5074432492256165,
294
+ "rewards/rejected": -0.7090551257133484,
295
+ "step": 170
296
+ },
297
+ {
298
+ "epoch": 0.627177700348432,
299
+ "grad_norm": 354.9896897534272,
300
+ "learning_rate": 2.75968992248062e-07,
301
+ "logits/chosen": -2.545736789703369,
302
+ "logits/rejected": -2.534653663635254,
303
+ "logps/chosen": -91.68567657470703,
304
+ "logps/rejected": -86.991943359375,
305
+ "loss": 0.6933,
306
+ "rewards/accuracies": 0.32499998807907104,
307
+ "rewards/chosen": 0.1063336730003357,
308
+ "rewards/margins": 0.40485674142837524,
309
+ "rewards/rejected": -0.29852309823036194,
310
+ "step": 180
311
+ },
312
+ {
313
+ "epoch": 0.662020905923345,
314
+ "grad_norm": 251.86613571396995,
315
+ "learning_rate": 2.733850129198967e-07,
316
+ "logits/chosen": -2.5427966117858887,
317
+ "logits/rejected": -2.5337371826171875,
318
+ "logps/chosen": -70.1039810180664,
319
+ "logps/rejected": -81.1055908203125,
320
+ "loss": 0.6537,
321
+ "rewards/accuracies": 0.3187499940395355,
322
+ "rewards/chosen": 0.587446928024292,
323
+ "rewards/margins": 0.3254398703575134,
324
+ "rewards/rejected": 0.26200705766677856,
325
+ "step": 190
326
+ },
327
+ {
328
+ "epoch": 0.6968641114982579,
329
+ "grad_norm": 448.52473154376446,
330
+ "learning_rate": 2.7080103359173124e-07,
331
+ "logits/chosen": -2.5728795528411865,
332
+ "logits/rejected": -2.580472469329834,
333
+ "logps/chosen": -88.97058868408203,
334
+ "logps/rejected": -91.36177062988281,
335
+ "loss": 0.7044,
336
+ "rewards/accuracies": 0.35624998807907104,
337
+ "rewards/chosen": 1.0451823472976685,
338
+ "rewards/margins": 0.6180461645126343,
339
+ "rewards/rejected": 0.42713627219200134,
340
+ "step": 200
341
+ },
342
+ {
343
+ "epoch": 0.6968641114982579,
344
+ "eval_logits/chosen": -2.548659324645996,
345
+ "eval_logits/rejected": -2.5324509143829346,
346
+ "eval_logps/chosen": -73.5050048828125,
347
+ "eval_logps/rejected": -80.9368667602539,
348
+ "eval_loss": 0.6785250902175903,
349
+ "eval_rewards/accuracies": 0.335317462682724,
350
+ "eval_rewards/chosen": 0.6114616394042969,
351
+ "eval_rewards/margins": 0.4927373230457306,
352
+ "eval_rewards/rejected": 0.11872433125972748,
353
+ "eval_runtime": 113.5716,
354
+ "eval_samples_per_second": 17.61,
355
+ "eval_steps_per_second": 0.555,
356
+ "step": 200
357
+ },
358
+ {
359
+ "epoch": 0.7317073170731707,
360
+ "grad_norm": 539.9357015144649,
361
+ "learning_rate": 2.682170542635659e-07,
362
+ "logits/chosen": -2.5559327602386475,
363
+ "logits/rejected": -2.529869794845581,
364
+ "logps/chosen": -68.37545013427734,
365
+ "logps/rejected": -63.614356994628906,
366
+ "loss": 0.7042,
367
+ "rewards/accuracies": 0.3687500059604645,
368
+ "rewards/chosen": 0.58504319190979,
369
+ "rewards/margins": 0.7152014970779419,
370
+ "rewards/rejected": -0.13015827536582947,
371
+ "step": 210
372
+ },
373
+ {
374
+ "epoch": 0.7665505226480837,
375
+ "grad_norm": 286.2044305000258,
376
+ "learning_rate": 2.6563307493540046e-07,
377
+ "logits/chosen": -2.591370105743408,
378
+ "logits/rejected": -2.5699965953826904,
379
+ "logps/chosen": -72.25666809082031,
380
+ "logps/rejected": -70.89804077148438,
381
+ "loss": 0.6966,
382
+ "rewards/accuracies": 0.25,
383
+ "rewards/chosen": 0.74493408203125,
384
+ "rewards/margins": 0.2705448865890503,
385
+ "rewards/rejected": 0.4743892252445221,
386
+ "step": 220
387
+ },
388
+ {
389
+ "epoch": 0.8013937282229965,
390
+ "grad_norm": 398.39083722167163,
391
+ "learning_rate": 2.6304909560723513e-07,
392
+ "logits/chosen": -2.5968384742736816,
393
+ "logits/rejected": -2.574525833129883,
394
+ "logps/chosen": -88.1639633178711,
395
+ "logps/rejected": -87.76924133300781,
396
+ "loss": 0.685,
397
+ "rewards/accuracies": 0.38749998807907104,
398
+ "rewards/chosen": 1.1387383937835693,
399
+ "rewards/margins": 0.9033845663070679,
400
+ "rewards/rejected": 0.23535379767417908,
401
+ "step": 230
402
+ },
403
+ {
404
+ "epoch": 0.8362369337979094,
405
+ "grad_norm": 339.80066659715754,
406
+ "learning_rate": 2.6046511627906974e-07,
407
+ "logits/chosen": -2.589068651199341,
408
+ "logits/rejected": -2.5533642768859863,
409
+ "logps/chosen": -85.19193267822266,
410
+ "logps/rejected": -79.42808532714844,
411
+ "loss": 0.6884,
412
+ "rewards/accuracies": 0.3687500059604645,
413
+ "rewards/chosen": 0.9607963562011719,
414
+ "rewards/margins": 0.5218255519866943,
415
+ "rewards/rejected": 0.4389708638191223,
416
+ "step": 240
417
+ },
418
+ {
419
+ "epoch": 0.8710801393728222,
420
+ "grad_norm": 341.3674177311404,
421
+ "learning_rate": 2.5788113695090435e-07,
422
+ "logits/chosen": -2.6001765727996826,
423
+ "logits/rejected": -2.56406831741333,
424
+ "logps/chosen": -94.24848937988281,
425
+ "logps/rejected": -89.58869934082031,
426
+ "loss": 0.6149,
427
+ "rewards/accuracies": 0.38749998807907104,
428
+ "rewards/chosen": 0.7444799542427063,
429
+ "rewards/margins": 0.5228375196456909,
430
+ "rewards/rejected": 0.2216424196958542,
431
+ "step": 250
432
+ },
433
+ {
434
+ "epoch": 0.9059233449477352,
435
+ "grad_norm": 283.98484381974555,
436
+ "learning_rate": 2.55297157622739e-07,
437
+ "logits/chosen": -2.506063461303711,
438
+ "logits/rejected": -2.5218896865844727,
439
+ "logps/chosen": -57.92702102661133,
440
+ "logps/rejected": -65.19326782226562,
441
+ "loss": 0.682,
442
+ "rewards/accuracies": 0.29374998807907104,
443
+ "rewards/chosen": 0.6958502531051636,
444
+ "rewards/margins": 0.4261382520198822,
445
+ "rewards/rejected": 0.26971206068992615,
446
+ "step": 260
447
+ },
448
+ {
449
+ "epoch": 0.9407665505226481,
450
+ "grad_norm": 372.62325379603294,
451
+ "learning_rate": 2.5271317829457363e-07,
452
+ "logits/chosen": -2.6025278568267822,
453
+ "logits/rejected": -2.6028153896331787,
454
+ "logps/chosen": -67.53540802001953,
455
+ "logps/rejected": -82.33738708496094,
456
+ "loss": 0.6525,
457
+ "rewards/accuracies": 0.32499998807907104,
458
+ "rewards/chosen": 0.8337510228157043,
459
+ "rewards/margins": 0.713237464427948,
460
+ "rewards/rejected": 0.12051346153020859,
461
+ "step": 270
462
+ },
463
+ {
464
+ "epoch": 0.975609756097561,
465
+ "grad_norm": 314.12151123620725,
466
+ "learning_rate": 2.5012919896640824e-07,
467
+ "logits/chosen": -2.5111606121063232,
468
+ "logits/rejected": -2.487565517425537,
469
+ "logps/chosen": -66.56291198730469,
470
+ "logps/rejected": -70.64192199707031,
471
+ "loss": 0.6164,
472
+ "rewards/accuracies": 0.3375000059604645,
473
+ "rewards/chosen": 0.7883111238479614,
474
+ "rewards/margins": 0.668962836265564,
475
+ "rewards/rejected": 0.11934838443994522,
476
+ "step": 280
477
+ },
478
+ {
479
+ "epoch": 1.0104529616724738,
480
+ "grad_norm": 72.54916768701123,
481
+ "learning_rate": 2.475452196382429e-07,
482
+ "logits/chosen": -2.524649143218994,
483
+ "logits/rejected": -2.4943947792053223,
484
+ "logps/chosen": -69.41844177246094,
485
+ "logps/rejected": -65.09283447265625,
486
+ "loss": 0.5485,
487
+ "rewards/accuracies": 0.38749998807907104,
488
+ "rewards/chosen": 1.642530083656311,
489
+ "rewards/margins": 1.75347900390625,
490
+ "rewards/rejected": -0.11094935238361359,
491
+ "step": 290
492
+ },
493
+ {
494
+ "epoch": 1.0452961672473868,
495
+ "grad_norm": 25.060931202079587,
496
+ "learning_rate": 2.4496124031007747e-07,
497
+ "logits/chosen": -2.5531129837036133,
498
+ "logits/rejected": -2.5395359992980957,
499
+ "logps/chosen": -62.507972717285156,
500
+ "logps/rejected": -72.49549865722656,
501
+ "loss": 0.3945,
502
+ "rewards/accuracies": 0.42500001192092896,
503
+ "rewards/chosen": 2.5053844451904297,
504
+ "rewards/margins": 4.567110061645508,
505
+ "rewards/rejected": -2.06172513961792,
506
+ "step": 300
507
+ },
508
+ {
509
+ "epoch": 1.0452961672473868,
510
+ "eval_logits/chosen": -2.575260877609253,
511
+ "eval_logits/rejected": -2.559643507003784,
512
+ "eval_logps/chosen": -73.25456237792969,
513
+ "eval_logps/rejected": -80.9556655883789,
514
+ "eval_loss": 0.6974567770957947,
515
+ "eval_rewards/accuracies": 0.3551587164402008,
516
+ "eval_rewards/chosen": 0.7667317986488342,
517
+ "eval_rewards/margins": 0.6596661806106567,
518
+ "eval_rewards/rejected": 0.10706562548875809,
519
+ "eval_runtime": 113.4908,
520
+ "eval_samples_per_second": 17.623,
521
+ "eval_steps_per_second": 0.555,
522
+ "step": 300
523
+ },
524
+ {
525
+ "epoch": 1.0801393728222997,
526
+ "grad_norm": 19.10824587006278,
527
+ "learning_rate": 2.4237726098191214e-07,
528
+ "logits/chosen": -2.530690908432007,
529
+ "logits/rejected": -2.5344767570495605,
530
+ "logps/chosen": -63.4401969909668,
531
+ "logps/rejected": -79.88385772705078,
532
+ "loss": 0.4065,
533
+ "rewards/accuracies": 0.4375,
534
+ "rewards/chosen": 2.601799488067627,
535
+ "rewards/margins": 5.476699352264404,
536
+ "rewards/rejected": -2.8749003410339355,
537
+ "step": 310
538
+ },
539
+ {
540
+ "epoch": 1.1149825783972125,
541
+ "grad_norm": 13.055135863878375,
542
+ "learning_rate": 2.397932816537468e-07,
543
+ "logits/chosen": -2.583162546157837,
544
+ "logits/rejected": -2.5704259872436523,
545
+ "logps/chosen": -70.44621276855469,
546
+ "logps/rejected": -82.24694061279297,
547
+ "loss": 0.3998,
548
+ "rewards/accuracies": 0.4749999940395355,
549
+ "rewards/chosen": 2.1784868240356445,
550
+ "rewards/margins": 4.678074359893799,
551
+ "rewards/rejected": -2.499587059020996,
552
+ "step": 320
553
+ },
554
+ {
555
+ "epoch": 1.1498257839721253,
556
+ "grad_norm": 73.70458680382093,
557
+ "learning_rate": 2.3720930232558136e-07,
558
+ "logits/chosen": -2.574796438217163,
559
+ "logits/rejected": -2.550279140472412,
560
+ "logps/chosen": -79.08229064941406,
561
+ "logps/rejected": -82.61952209472656,
562
+ "loss": 0.3644,
563
+ "rewards/accuracies": 0.53125,
564
+ "rewards/chosen": 3.124291181564331,
565
+ "rewards/margins": 5.5998125076293945,
566
+ "rewards/rejected": -2.4755213260650635,
567
+ "step": 330
568
+ },
569
+ {
570
+ "epoch": 1.1846689895470384,
571
+ "grad_norm": 28.850587508009358,
572
+ "learning_rate": 2.34625322997416e-07,
573
+ "logits/chosen": -2.557753562927246,
574
+ "logits/rejected": -2.5632052421569824,
575
+ "logps/chosen": -78.73472595214844,
576
+ "logps/rejected": -99.98568725585938,
577
+ "loss": 0.389,
578
+ "rewards/accuracies": 0.5375000238418579,
579
+ "rewards/chosen": 3.1080851554870605,
580
+ "rewards/margins": 6.023845672607422,
581
+ "rewards/rejected": -2.9157605171203613,
582
+ "step": 340
583
+ },
584
+ {
585
+ "epoch": 1.2195121951219512,
586
+ "grad_norm": 85.64045808230581,
587
+ "learning_rate": 2.3204134366925064e-07,
588
+ "logits/chosen": -2.5850272178649902,
589
+ "logits/rejected": -2.551978588104248,
590
+ "logps/chosen": -64.38944244384766,
591
+ "logps/rejected": -68.43348693847656,
592
+ "loss": 0.3897,
593
+ "rewards/accuracies": 0.4749999940395355,
594
+ "rewards/chosen": 2.608705759048462,
595
+ "rewards/margins": 4.316413402557373,
596
+ "rewards/rejected": -1.7077077627182007,
597
+ "step": 350
598
+ },
599
+ {
600
+ "epoch": 1.254355400696864,
601
+ "grad_norm": 52.17293262552292,
602
+ "learning_rate": 2.2945736434108528e-07,
603
+ "logits/chosen": -2.5734505653381348,
604
+ "logits/rejected": -2.5436837673187256,
605
+ "logps/chosen": -67.31871032714844,
606
+ "logps/rejected": -67.99309539794922,
607
+ "loss": 0.3729,
608
+ "rewards/accuracies": 0.44999998807907104,
609
+ "rewards/chosen": 2.8094611167907715,
610
+ "rewards/margins": 4.253398895263672,
611
+ "rewards/rejected": -1.44393789768219,
612
+ "step": 360
613
+ },
614
+ {
615
+ "epoch": 1.289198606271777,
616
+ "grad_norm": 73.55818831421092,
617
+ "learning_rate": 2.268733850129199e-07,
618
+ "logits/chosen": -2.5357134342193604,
619
+ "logits/rejected": -2.556293487548828,
620
+ "logps/chosen": -65.90876770019531,
621
+ "logps/rejected": -77.08678436279297,
622
+ "loss": 0.4065,
623
+ "rewards/accuracies": 0.44999998807907104,
624
+ "rewards/chosen": 3.1975178718566895,
625
+ "rewards/margins": 4.904288291931152,
626
+ "rewards/rejected": -1.7067703008651733,
627
+ "step": 370
628
+ },
629
+ {
630
+ "epoch": 1.32404181184669,
631
+ "grad_norm": 63.56754205504431,
632
+ "learning_rate": 2.2428940568475453e-07,
633
+ "logits/chosen": -2.543696880340576,
634
+ "logits/rejected": -2.542275905609131,
635
+ "logps/chosen": -79.8168716430664,
636
+ "logps/rejected": -90.68196105957031,
637
+ "loss": 0.383,
638
+ "rewards/accuracies": 0.550000011920929,
639
+ "rewards/chosen": 4.3757524490356445,
640
+ "rewards/margins": 6.989476680755615,
641
+ "rewards/rejected": -2.61372447013855,
642
+ "step": 380
643
+ },
644
+ {
645
+ "epoch": 1.3588850174216027,
646
+ "grad_norm": 84.5304582436996,
647
+ "learning_rate": 2.2170542635658914e-07,
648
+ "logits/chosen": -2.616854190826416,
649
+ "logits/rejected": -2.6028048992156982,
650
+ "logps/chosen": -62.09540939331055,
651
+ "logps/rejected": -74.197265625,
652
+ "loss": 0.3743,
653
+ "rewards/accuracies": 0.4437499940395355,
654
+ "rewards/chosen": 3.0379724502563477,
655
+ "rewards/margins": 4.858659744262695,
656
+ "rewards/rejected": -1.8206875324249268,
657
+ "step": 390
658
+ },
659
+ {
660
+ "epoch": 1.3937282229965158,
661
+ "grad_norm": 45.47639364698017,
662
+ "learning_rate": 2.1912144702842375e-07,
663
+ "logits/chosen": -2.5946788787841797,
664
+ "logits/rejected": -2.564770221710205,
665
+ "logps/chosen": -80.40516662597656,
666
+ "logps/rejected": -101.68209075927734,
667
+ "loss": 0.3859,
668
+ "rewards/accuracies": 0.5249999761581421,
669
+ "rewards/chosen": 3.559967041015625,
670
+ "rewards/margins": 5.592529296875,
671
+ "rewards/rejected": -2.032562732696533,
672
+ "step": 400
673
+ },
674
+ {
675
+ "epoch": 1.3937282229965158,
676
+ "eval_logits/chosen": -2.5995094776153564,
677
+ "eval_logits/rejected": -2.5833675861358643,
678
+ "eval_logps/chosen": -72.12498474121094,
679
+ "eval_logps/rejected": -80.21580505371094,
680
+ "eval_loss": 0.7395845651626587,
681
+ "eval_rewards/accuracies": 0.3571428656578064,
682
+ "eval_rewards/chosen": 1.4670709371566772,
683
+ "eval_rewards/margins": 0.9012959599494934,
684
+ "eval_rewards/rejected": 0.5657750368118286,
685
+ "eval_runtime": 113.4528,
686
+ "eval_samples_per_second": 17.628,
687
+ "eval_steps_per_second": 0.555,
688
+ "step": 400
689
+ },
690
+ {
691
+ "epoch": 1.4285714285714286,
692
+ "grad_norm": 9.035779538567592,
693
+ "learning_rate": 2.1653746770025842e-07,
694
+ "logits/chosen": -2.587205410003662,
695
+ "logits/rejected": -2.577908515930176,
696
+ "logps/chosen": -76.38627624511719,
697
+ "logps/rejected": -81.7760009765625,
698
+ "loss": 0.3719,
699
+ "rewards/accuracies": 0.4749999940395355,
700
+ "rewards/chosen": 3.8607609272003174,
701
+ "rewards/margins": 4.7285661697387695,
702
+ "rewards/rejected": -0.8678053021430969,
703
+ "step": 410
704
+ },
705
+ {
706
+ "epoch": 1.4634146341463414,
707
+ "grad_norm": 31.70641333825559,
708
+ "learning_rate": 2.1395348837209303e-07,
709
+ "logits/chosen": -2.637145519256592,
710
+ "logits/rejected": -2.637343168258667,
711
+ "logps/chosen": -69.94471740722656,
712
+ "logps/rejected": -86.88435363769531,
713
+ "loss": 0.3975,
714
+ "rewards/accuracies": 0.4375,
715
+ "rewards/chosen": 3.812042236328125,
716
+ "rewards/margins": 5.059989929199219,
717
+ "rewards/rejected": -1.2479479312896729,
718
+ "step": 420
719
+ },
720
+ {
721
+ "epoch": 1.4982578397212545,
722
+ "grad_norm": 65.50905056120632,
723
+ "learning_rate": 2.1136950904392762e-07,
724
+ "logits/chosen": -2.594949960708618,
725
+ "logits/rejected": -2.582991123199463,
726
+ "logps/chosen": -62.43513870239258,
727
+ "logps/rejected": -74.3114013671875,
728
+ "loss": 0.3952,
729
+ "rewards/accuracies": 0.46875,
730
+ "rewards/chosen": 3.8608689308166504,
731
+ "rewards/margins": 4.990710258483887,
732
+ "rewards/rejected": -1.1298413276672363,
733
+ "step": 430
734
+ },
735
+ {
736
+ "epoch": 1.533101045296167,
737
+ "grad_norm": 103.25920597551605,
738
+ "learning_rate": 2.0878552971576226e-07,
739
+ "logits/chosen": -2.587477207183838,
740
+ "logits/rejected": -2.596736192703247,
741
+ "logps/chosen": -64.37332916259766,
742
+ "logps/rejected": -75.4191665649414,
743
+ "loss": 0.401,
744
+ "rewards/accuracies": 0.4312500059604645,
745
+ "rewards/chosen": 3.475306987762451,
746
+ "rewards/margins": 4.665290832519531,
747
+ "rewards/rejected": -1.1899840831756592,
748
+ "step": 440
749
+ },
750
+ {
751
+ "epoch": 1.5679442508710801,
752
+ "grad_norm": 46.90858224938077,
753
+ "learning_rate": 2.062015503875969e-07,
754
+ "logits/chosen": -2.641456127166748,
755
+ "logits/rejected": -2.6066718101501465,
756
+ "logps/chosen": -84.70106506347656,
757
+ "logps/rejected": -90.63847351074219,
758
+ "loss": 0.3861,
759
+ "rewards/accuracies": 0.518750011920929,
760
+ "rewards/chosen": 3.8073742389678955,
761
+ "rewards/margins": 6.706219673156738,
762
+ "rewards/rejected": -2.8988451957702637,
763
+ "step": 450
764
+ },
765
+ {
766
+ "epoch": 1.6027874564459932,
767
+ "grad_norm": 73.13862861246201,
768
+ "learning_rate": 2.0361757105943153e-07,
769
+ "logits/chosen": -2.6171023845672607,
770
+ "logits/rejected": -2.5963072776794434,
771
+ "logps/chosen": -70.6004867553711,
772
+ "logps/rejected": -81.99185943603516,
773
+ "loss": 0.4031,
774
+ "rewards/accuracies": 0.5,
775
+ "rewards/chosen": 3.4589905738830566,
776
+ "rewards/margins": 6.551025390625,
777
+ "rewards/rejected": -3.0920345783233643,
778
+ "step": 460
779
+ },
780
+ {
781
+ "epoch": 1.6376306620209058,
782
+ "grad_norm": 55.407817129230025,
783
+ "learning_rate": 2.0103359173126615e-07,
784
+ "logits/chosen": -2.617323637008667,
785
+ "logits/rejected": -2.6086113452911377,
786
+ "logps/chosen": -57.063804626464844,
787
+ "logps/rejected": -75.29525756835938,
788
+ "loss": 0.4097,
789
+ "rewards/accuracies": 0.38749998807907104,
790
+ "rewards/chosen": 2.6279244422912598,
791
+ "rewards/margins": 4.8750505447387695,
792
+ "rewards/rejected": -2.247126817703247,
793
+ "step": 470
794
+ },
795
+ {
796
+ "epoch": 1.6724738675958188,
797
+ "grad_norm": 95.66039903353,
798
+ "learning_rate": 1.9844961240310078e-07,
799
+ "logits/chosen": -2.636000156402588,
800
+ "logits/rejected": -2.619654893875122,
801
+ "logps/chosen": -50.774810791015625,
802
+ "logps/rejected": -55.16005325317383,
803
+ "loss": 0.4027,
804
+ "rewards/accuracies": 0.3499999940395355,
805
+ "rewards/chosen": 2.037757158279419,
806
+ "rewards/margins": 4.008673667907715,
807
+ "rewards/rejected": -1.970916986465454,
808
+ "step": 480
809
+ },
810
+ {
811
+ "epoch": 1.7073170731707317,
812
+ "grad_norm": 194.80847235958515,
813
+ "learning_rate": 1.958656330749354e-07,
814
+ "logits/chosen": -2.6303064823150635,
815
+ "logits/rejected": -2.617170810699463,
816
+ "logps/chosen": -70.10054779052734,
817
+ "logps/rejected": -74.9827880859375,
818
+ "loss": 0.4809,
819
+ "rewards/accuracies": 0.38749998807907104,
820
+ "rewards/chosen": 1.9351723194122314,
821
+ "rewards/margins": 4.985955715179443,
822
+ "rewards/rejected": -3.05078387260437,
823
+ "step": 490
824
+ },
825
+ {
826
+ "epoch": 1.7421602787456445,
827
+ "grad_norm": 134.98085943724894,
828
+ "learning_rate": 1.9328165374677e-07,
829
+ "logits/chosen": -2.5582988262176514,
830
+ "logits/rejected": -2.551264524459839,
831
+ "logps/chosen": -73.02742004394531,
832
+ "logps/rejected": -89.30897521972656,
833
+ "loss": 0.3893,
834
+ "rewards/accuracies": 0.46875,
835
+ "rewards/chosen": 2.0215065479278564,
836
+ "rewards/margins": 5.4846272468566895,
837
+ "rewards/rejected": -3.463120222091675,
838
+ "step": 500
839
+ },
840
+ {
841
+ "epoch": 1.7421602787456445,
842
+ "eval_logits/chosen": -2.6659271717071533,
843
+ "eval_logits/rejected": -2.6499409675598145,
844
+ "eval_logps/chosen": -75.26073455810547,
845
+ "eval_logps/rejected": -83.39617919921875,
846
+ "eval_loss": 0.7903804183006287,
847
+ "eval_rewards/accuracies": 0.3492063581943512,
848
+ "eval_rewards/chosen": -0.47709161043167114,
849
+ "eval_rewards/margins": 0.9289572834968567,
850
+ "eval_rewards/rejected": -1.4060487747192383,
851
+ "eval_runtime": 121.402,
852
+ "eval_samples_per_second": 16.474,
853
+ "eval_steps_per_second": 0.519,
854
+ "step": 500
855
+ },
856
+ {
857
+ "epoch": 1.7770034843205575,
858
+ "grad_norm": 28.87847596006846,
859
+ "learning_rate": 1.9069767441860465e-07,
860
+ "logits/chosen": -2.623152732849121,
861
+ "logits/rejected": -2.60357666015625,
862
+ "logps/chosen": -68.32408142089844,
863
+ "logps/rejected": -75.5625991821289,
864
+ "loss": 0.3872,
865
+ "rewards/accuracies": 0.4625000059604645,
866
+ "rewards/chosen": 1.5080443620681763,
867
+ "rewards/margins": 5.214726448059082,
868
+ "rewards/rejected": -3.706681728363037,
869
+ "step": 510
870
+ },
871
+ {
872
+ "epoch": 1.8118466898954704,
873
+ "grad_norm": 24.57794510996642,
874
+ "learning_rate": 1.8811369509043926e-07,
875
+ "logits/chosen": -2.635575532913208,
876
+ "logits/rejected": -2.630188226699829,
877
+ "logps/chosen": -71.8682861328125,
878
+ "logps/rejected": -82.45121002197266,
879
+ "loss": 0.3947,
880
+ "rewards/accuracies": 0.46875,
881
+ "rewards/chosen": 1.4898183345794678,
882
+ "rewards/margins": 4.737631797790527,
883
+ "rewards/rejected": -3.247814178466797,
884
+ "step": 520
885
+ },
886
+ {
887
+ "epoch": 1.8466898954703832,
888
+ "grad_norm": 43.47081299094027,
889
+ "learning_rate": 1.8552971576227387e-07,
890
+ "logits/chosen": -2.6130404472351074,
891
+ "logits/rejected": -2.6072745323181152,
892
+ "logps/chosen": -67.63585662841797,
893
+ "logps/rejected": -76.81507873535156,
894
+ "loss": 0.4098,
895
+ "rewards/accuracies": 0.45625001192092896,
896
+ "rewards/chosen": 2.294175386428833,
897
+ "rewards/margins": 5.674102306365967,
898
+ "rewards/rejected": -3.379927158355713,
899
+ "step": 530
900
+ },
901
+ {
902
+ "epoch": 1.8815331010452963,
903
+ "grad_norm": 379.9980856009288,
904
+ "learning_rate": 1.829457364341085e-07,
905
+ "logits/chosen": -2.6233439445495605,
906
+ "logits/rejected": -2.6407742500305176,
907
+ "logps/chosen": -59.39795684814453,
908
+ "logps/rejected": -78.36451721191406,
909
+ "loss": 0.4218,
910
+ "rewards/accuracies": 0.45625001192092896,
911
+ "rewards/chosen": 2.6863512992858887,
912
+ "rewards/margins": 5.80073881149292,
913
+ "rewards/rejected": -3.114386796951294,
914
+ "step": 540
915
+ },
916
+ {
917
+ "epoch": 1.916376306620209,
918
+ "grad_norm": 36.16567370320362,
919
+ "learning_rate": 1.8036175710594315e-07,
920
+ "logits/chosen": -2.6051318645477295,
921
+ "logits/rejected": -2.5861499309539795,
922
+ "logps/chosen": -85.02639770507812,
923
+ "logps/rejected": -92.01643371582031,
924
+ "loss": 0.3918,
925
+ "rewards/accuracies": 0.5,
926
+ "rewards/chosen": 3.6012024879455566,
927
+ "rewards/margins": 6.781991004943848,
928
+ "rewards/rejected": -3.180788993835449,
929
+ "step": 550
930
+ },
931
+ {
932
+ "epoch": 1.951219512195122,
933
+ "grad_norm": 113.77212344428047,
934
+ "learning_rate": 1.7777777777777776e-07,
935
+ "logits/chosen": -2.601555347442627,
936
+ "logits/rejected": -2.6108827590942383,
937
+ "logps/chosen": -59.21348190307617,
938
+ "logps/rejected": -73.62416076660156,
939
+ "loss": 0.4326,
940
+ "rewards/accuracies": 0.4375,
941
+ "rewards/chosen": 2.7940502166748047,
942
+ "rewards/margins": 4.857416152954102,
943
+ "rewards/rejected": -2.063365936279297,
944
+ "step": 560
945
+ },
946
+ {
947
+ "epoch": 1.986062717770035,
948
+ "grad_norm": 84.29111668630037,
949
+ "learning_rate": 1.7519379844961235e-07,
950
+ "logits/chosen": -2.676542282104492,
951
+ "logits/rejected": -2.6477246284484863,
952
+ "logps/chosen": -60.66288375854492,
953
+ "logps/rejected": -65.19253540039062,
954
+ "loss": 0.4049,
955
+ "rewards/accuracies": 0.4312500059604645,
956
+ "rewards/chosen": 2.821211338043213,
957
+ "rewards/margins": 4.4517645835876465,
958
+ "rewards/rejected": -1.6305538415908813,
959
+ "step": 570
960
+ },
961
+ {
962
+ "epoch": 2.0209059233449476,
963
+ "grad_norm": 1.1269554094711256,
964
+ "learning_rate": 1.7260981912144704e-07,
965
+ "logits/chosen": -2.685316801071167,
966
+ "logits/rejected": -2.6868529319763184,
967
+ "logps/chosen": -73.15747833251953,
968
+ "logps/rejected": -84.87666320800781,
969
+ "loss": 0.3613,
970
+ "rewards/accuracies": 0.5,
971
+ "rewards/chosen": 3.4528770446777344,
972
+ "rewards/margins": 6.801962852478027,
973
+ "rewards/rejected": -3.349086046218872,
974
+ "step": 580
975
+ },
976
+ {
977
+ "epoch": 2.0557491289198606,
978
+ "grad_norm": 25.976934863727717,
979
+ "learning_rate": 1.7002583979328165e-07,
980
+ "logits/chosen": -2.589168071746826,
981
+ "logits/rejected": -2.530686378479004,
982
+ "logps/chosen": -87.9363021850586,
983
+ "logps/rejected": -83.9460220336914,
984
+ "loss": 0.343,
985
+ "rewards/accuracies": 0.5375000238418579,
986
+ "rewards/chosen": 4.061544895172119,
987
+ "rewards/margins": 7.5588788986206055,
988
+ "rewards/rejected": -3.4973349571228027,
989
+ "step": 590
990
+ },
991
+ {
992
+ "epoch": 2.0905923344947737,
993
+ "grad_norm": 4.291125808983435,
994
+ "learning_rate": 1.6744186046511627e-07,
995
+ "logits/chosen": -2.6059463024139404,
996
+ "logits/rejected": -2.5761430263519287,
997
+ "logps/chosen": -57.32917404174805,
998
+ "logps/rejected": -64.40930938720703,
999
+ "loss": 0.3749,
1000
+ "rewards/accuracies": 0.44999998807907104,
1001
+ "rewards/chosen": 2.808424234390259,
1002
+ "rewards/margins": 5.987191200256348,
1003
+ "rewards/rejected": -3.1787662506103516,
1004
+ "step": 600
1005
+ },
1006
+ {
1007
+ "epoch": 2.0905923344947737,
1008
+ "eval_logits/chosen": -2.6320791244506836,
1009
+ "eval_logits/rejected": -2.6158933639526367,
1010
+ "eval_logps/chosen": -73.58616638183594,
1011
+ "eval_logps/rejected": -81.91004180908203,
1012
+ "eval_loss": 0.8125157952308655,
1013
+ "eval_rewards/accuracies": 0.363095223903656,
1014
+ "eval_rewards/chosen": 0.5611402988433838,
1015
+ "eval_rewards/margins": 1.0457921028137207,
1016
+ "eval_rewards/rejected": -0.48465171456336975,
1017
+ "eval_runtime": 113.5505,
1018
+ "eval_samples_per_second": 17.613,
1019
+ "eval_steps_per_second": 0.555,
1020
+ "step": 600
1021
+ },
1022
+ {
1023
+ "epoch": 2.1254355400696863,
1024
+ "grad_norm": 7.908116387433432,
1025
+ "learning_rate": 1.6485788113695088e-07,
1026
+ "logits/chosen": -2.655932664871216,
1027
+ "logits/rejected": -2.618626117706299,
1028
+ "logps/chosen": -77.79930114746094,
1029
+ "logps/rejected": -76.16334533691406,
1030
+ "loss": 0.3651,
1031
+ "rewards/accuracies": 0.4749999940395355,
1032
+ "rewards/chosen": 3.5249366760253906,
1033
+ "rewards/margins": 6.6772894859313965,
1034
+ "rewards/rejected": -3.152352809906006,
1035
+ "step": 610
1036
+ },
1037
+ {
1038
+ "epoch": 2.1602787456445993,
1039
+ "grad_norm": 12.816821249204288,
1040
+ "learning_rate": 1.6227390180878554e-07,
1041
+ "logits/chosen": -2.6253597736358643,
1042
+ "logits/rejected": -2.570455551147461,
1043
+ "logps/chosen": -72.71697998046875,
1044
+ "logps/rejected": -84.75135040283203,
1045
+ "loss": 0.3665,
1046
+ "rewards/accuracies": 0.518750011920929,
1047
+ "rewards/chosen": 3.5354607105255127,
1048
+ "rewards/margins": 7.969748020172119,
1049
+ "rewards/rejected": -4.434287071228027,
1050
+ "step": 620
1051
+ },
1052
+ {
1053
+ "epoch": 2.1951219512195124,
1054
+ "grad_norm": 8.050269883900128,
1055
+ "learning_rate": 1.5968992248062013e-07,
1056
+ "logits/chosen": -2.5422072410583496,
1057
+ "logits/rejected": -2.559743881225586,
1058
+ "logps/chosen": -67.67036437988281,
1059
+ "logps/rejected": -95.14600372314453,
1060
+ "loss": 0.3449,
1061
+ "rewards/accuracies": 0.5249999761581421,
1062
+ "rewards/chosen": 3.445582866668701,
1063
+ "rewards/margins": 8.744329452514648,
1064
+ "rewards/rejected": -5.2987470626831055,
1065
+ "step": 630
1066
+ },
1067
+ {
1068
+ "epoch": 2.229965156794425,
1069
+ "grad_norm": 5.758994331059265,
1070
+ "learning_rate": 1.5710594315245477e-07,
1071
+ "logits/chosen": -2.5750184059143066,
1072
+ "logits/rejected": -2.5732388496398926,
1073
+ "logps/chosen": -66.1601333618164,
1074
+ "logps/rejected": -90.6006851196289,
1075
+ "loss": 0.3713,
1076
+ "rewards/accuracies": 0.45625001192092896,
1077
+ "rewards/chosen": 2.1229605674743652,
1078
+ "rewards/margins": 6.844090938568115,
1079
+ "rewards/rejected": -4.721129894256592,
1080
+ "step": 640
1081
+ },
1082
+ {
1083
+ "epoch": 2.264808362369338,
1084
+ "grad_norm": 4.106714414729963,
1085
+ "learning_rate": 1.5452196382428938e-07,
1086
+ "logits/chosen": -2.571296215057373,
1087
+ "logits/rejected": -2.5443153381347656,
1088
+ "logps/chosen": -76.41832733154297,
1089
+ "logps/rejected": -84.57535552978516,
1090
+ "loss": 0.3914,
1091
+ "rewards/accuracies": 0.42500001192092896,
1092
+ "rewards/chosen": 2.150641918182373,
1093
+ "rewards/margins": 6.811266899108887,
1094
+ "rewards/rejected": -4.660625457763672,
1095
+ "step": 650
1096
+ },
1097
+ {
1098
+ "epoch": 2.2996515679442506,
1099
+ "grad_norm": 26.12193694012493,
1100
+ "learning_rate": 1.5193798449612402e-07,
1101
+ "logits/chosen": -2.601414203643799,
1102
+ "logits/rejected": -2.5880684852600098,
1103
+ "logps/chosen": -76.15614318847656,
1104
+ "logps/rejected": -84.27168273925781,
1105
+ "loss": 0.3595,
1106
+ "rewards/accuracies": 0.4437499940395355,
1107
+ "rewards/chosen": 1.7724649906158447,
1108
+ "rewards/margins": 7.21490478515625,
1109
+ "rewards/rejected": -5.442440986633301,
1110
+ "step": 660
1111
+ },
1112
+ {
1113
+ "epoch": 2.3344947735191637,
1114
+ "grad_norm": 10.794985780975221,
1115
+ "learning_rate": 1.4935400516795863e-07,
1116
+ "logits/chosen": -2.593954563140869,
1117
+ "logits/rejected": -2.5915369987487793,
1118
+ "logps/chosen": -70.91958618164062,
1119
+ "logps/rejected": -83.38150787353516,
1120
+ "loss": 0.3772,
1121
+ "rewards/accuracies": 0.45625001192092896,
1122
+ "rewards/chosen": 1.1422145366668701,
1123
+ "rewards/margins": 7.140495300292969,
1124
+ "rewards/rejected": -5.998281002044678,
1125
+ "step": 670
1126
+ },
1127
+ {
1128
+ "epoch": 2.3693379790940767,
1129
+ "grad_norm": 2.1208474931059706,
1130
+ "learning_rate": 1.4677002583979327e-07,
1131
+ "logits/chosen": -2.635155439376831,
1132
+ "logits/rejected": -2.622910499572754,
1133
+ "logps/chosen": -70.04927062988281,
1134
+ "logps/rejected": -82.35643005371094,
1135
+ "loss": 0.3572,
1136
+ "rewards/accuracies": 0.48750001192092896,
1137
+ "rewards/chosen": 2.0802841186523438,
1138
+ "rewards/margins": 8.326448440551758,
1139
+ "rewards/rejected": -6.246163845062256,
1140
+ "step": 680
1141
+ },
1142
+ {
1143
+ "epoch": 2.40418118466899,
1144
+ "grad_norm": 10.487007281103297,
1145
+ "learning_rate": 1.4418604651162788e-07,
1146
+ "logits/chosen": -2.6031365394592285,
1147
+ "logits/rejected": -2.5714783668518066,
1148
+ "logps/chosen": -77.78074645996094,
1149
+ "logps/rejected": -97.24638366699219,
1150
+ "loss": 0.3613,
1151
+ "rewards/accuracies": 0.4749999940395355,
1152
+ "rewards/chosen": 2.6356489658355713,
1153
+ "rewards/margins": 8.40825080871582,
1154
+ "rewards/rejected": -5.7726030349731445,
1155
+ "step": 690
1156
+ },
1157
+ {
1158
+ "epoch": 2.4390243902439024,
1159
+ "grad_norm": 7.007416212948012,
1160
+ "learning_rate": 1.4160206718346252e-07,
1161
+ "logits/chosen": -2.5922415256500244,
1162
+ "logits/rejected": -2.5560450553894043,
1163
+ "logps/chosen": -83.61808776855469,
1164
+ "logps/rejected": -89.99815368652344,
1165
+ "loss": 0.3662,
1166
+ "rewards/accuracies": 0.53125,
1167
+ "rewards/chosen": 3.2752013206481934,
1168
+ "rewards/margins": 8.457530975341797,
1169
+ "rewards/rejected": -5.182328701019287,
1170
+ "step": 700
1171
+ },
1172
+ {
1173
+ "epoch": 2.4390243902439024,
1174
+ "eval_logits/chosen": -2.6111857891082764,
1175
+ "eval_logits/rejected": -2.594111442565918,
1176
+ "eval_logps/chosen": -75.47573852539062,
1177
+ "eval_logps/rejected": -84.49435424804688,
1178
+ "eval_loss": 0.8411857485771179,
1179
+ "eval_rewards/accuracies": 0.3650793731212616,
1180
+ "eval_rewards/chosen": -0.6103957891464233,
1181
+ "eval_rewards/margins": 1.476527214050293,
1182
+ "eval_rewards/rejected": -2.086923122406006,
1183
+ "eval_runtime": 113.5556,
1184
+ "eval_samples_per_second": 17.613,
1185
+ "eval_steps_per_second": 0.555,
1186
+ "step": 700
1187
+ },
1188
+ {
1189
+ "epoch": 2.4738675958188154,
1190
+ "grad_norm": 0.4711439948410044,
1191
+ "learning_rate": 1.3901808785529716e-07,
1192
+ "logits/chosen": -2.632316827774048,
1193
+ "logits/rejected": -2.612363576889038,
1194
+ "logps/chosen": -91.01361083984375,
1195
+ "logps/rejected": -111.48197937011719,
1196
+ "loss": 0.3629,
1197
+ "rewards/accuracies": 0.5249999761581421,
1198
+ "rewards/chosen": 2.5902059078216553,
1199
+ "rewards/margins": 9.54753303527832,
1200
+ "rewards/rejected": -6.957326412200928,
1201
+ "step": 710
1202
+ },
1203
+ {
1204
+ "epoch": 2.508710801393728,
1205
+ "grad_norm": 0.03547147611253739,
1206
+ "learning_rate": 1.3643410852713177e-07,
1207
+ "logits/chosen": -2.6036946773529053,
1208
+ "logits/rejected": -2.5696444511413574,
1209
+ "logps/chosen": -66.248779296875,
1210
+ "logps/rejected": -71.01225280761719,
1211
+ "loss": 0.4025,
1212
+ "rewards/accuracies": 0.45625001192092896,
1213
+ "rewards/chosen": 2.9773497581481934,
1214
+ "rewards/margins": 7.88436222076416,
1215
+ "rewards/rejected": -4.907011985778809,
1216
+ "step": 720
1217
+ },
1218
+ {
1219
+ "epoch": 2.543554006968641,
1220
+ "grad_norm": 4.675015379292036,
1221
+ "learning_rate": 1.3385012919896641e-07,
1222
+ "logits/chosen": -2.6531801223754883,
1223
+ "logits/rejected": -2.628028392791748,
1224
+ "logps/chosen": -65.84535217285156,
1225
+ "logps/rejected": -67.12870025634766,
1226
+ "loss": 0.3706,
1227
+ "rewards/accuracies": 0.34375,
1228
+ "rewards/chosen": 2.7976653575897217,
1229
+ "rewards/margins": 6.484448432922363,
1230
+ "rewards/rejected": -3.6867833137512207,
1231
+ "step": 730
1232
+ },
1233
+ {
1234
+ "epoch": 2.578397212543554,
1235
+ "grad_norm": 15.399000464963422,
1236
+ "learning_rate": 1.3126614987080103e-07,
1237
+ "logits/chosen": -2.5426218509674072,
1238
+ "logits/rejected": -2.5461742877960205,
1239
+ "logps/chosen": -61.294456481933594,
1240
+ "logps/rejected": -87.54208374023438,
1241
+ "loss": 0.3653,
1242
+ "rewards/accuracies": 0.4375,
1243
+ "rewards/chosen": 2.92600154876709,
1244
+ "rewards/margins": 8.220406532287598,
1245
+ "rewards/rejected": -5.29440450668335,
1246
+ "step": 740
1247
+ },
1248
+ {
1249
+ "epoch": 2.6132404181184667,
1250
+ "grad_norm": 1.423350223185656,
1251
+ "learning_rate": 1.2868217054263566e-07,
1252
+ "logits/chosen": -2.6260664463043213,
1253
+ "logits/rejected": -2.613114595413208,
1254
+ "logps/chosen": -62.37495040893555,
1255
+ "logps/rejected": -72.44084930419922,
1256
+ "loss": 0.35,
1257
+ "rewards/accuracies": 0.4937500059604645,
1258
+ "rewards/chosen": 3.621440887451172,
1259
+ "rewards/margins": 6.8329291343688965,
1260
+ "rewards/rejected": -3.2114882469177246,
1261
+ "step": 750
1262
+ },
1263
+ {
1264
+ "epoch": 2.64808362369338,
1265
+ "grad_norm": 28.902176788755387,
1266
+ "learning_rate": 1.2609819121447028e-07,
1267
+ "logits/chosen": -2.54823637008667,
1268
+ "logits/rejected": -2.536100387573242,
1269
+ "logps/chosen": -48.50947189331055,
1270
+ "logps/rejected": -61.75891876220703,
1271
+ "loss": 0.3804,
1272
+ "rewards/accuracies": 0.38749998807907104,
1273
+ "rewards/chosen": 2.9782872200012207,
1274
+ "rewards/margins": 6.0528998374938965,
1275
+ "rewards/rejected": -3.074612855911255,
1276
+ "step": 760
1277
+ },
1278
+ {
1279
+ "epoch": 2.682926829268293,
1280
+ "grad_norm": 0.05604179114051096,
1281
+ "learning_rate": 1.2351421188630492e-07,
1282
+ "logits/chosen": -2.6780683994293213,
1283
+ "logits/rejected": -2.659198522567749,
1284
+ "logps/chosen": -71.71595764160156,
1285
+ "logps/rejected": -86.19725036621094,
1286
+ "loss": 0.3745,
1287
+ "rewards/accuracies": 0.45625001192092896,
1288
+ "rewards/chosen": 3.0666489601135254,
1289
+ "rewards/margins": 7.4923295974731445,
1290
+ "rewards/rejected": -4.425681114196777,
1291
+ "step": 770
1292
+ },
1293
+ {
1294
+ "epoch": 2.7177700348432055,
1295
+ "grad_norm": 0.283574851628151,
1296
+ "learning_rate": 1.2093023255813953e-07,
1297
+ "logits/chosen": -2.6133499145507812,
1298
+ "logits/rejected": -2.612217426300049,
1299
+ "logps/chosen": -74.23127746582031,
1300
+ "logps/rejected": -91.57493591308594,
1301
+ "loss": 0.3447,
1302
+ "rewards/accuracies": 0.512499988079071,
1303
+ "rewards/chosen": 4.144797325134277,
1304
+ "rewards/margins": 9.933080673217773,
1305
+ "rewards/rejected": -5.788283348083496,
1306
+ "step": 780
1307
+ },
1308
+ {
1309
+ "epoch": 2.7526132404181185,
1310
+ "grad_norm": 0.8997563908630796,
1311
+ "learning_rate": 1.1834625322997414e-07,
1312
+ "logits/chosen": -2.6512093544006348,
1313
+ "logits/rejected": -2.6401140689849854,
1314
+ "logps/chosen": -64.2149429321289,
1315
+ "logps/rejected": -82.6252212524414,
1316
+ "loss": 0.3748,
1317
+ "rewards/accuracies": 0.4625000059604645,
1318
+ "rewards/chosen": 1.9805446863174438,
1319
+ "rewards/margins": 7.318342685699463,
1320
+ "rewards/rejected": -5.337798118591309,
1321
+ "step": 790
1322
+ },
1323
+ {
1324
+ "epoch": 2.7874564459930316,
1325
+ "grad_norm": 20.667512714719013,
1326
+ "learning_rate": 1.1576227390180877e-07,
1327
+ "logits/chosen": -2.621532440185547,
1328
+ "logits/rejected": -2.5869758129119873,
1329
+ "logps/chosen": -75.92918395996094,
1330
+ "logps/rejected": -84.97443389892578,
1331
+ "loss": 0.3615,
1332
+ "rewards/accuracies": 0.46875,
1333
+ "rewards/chosen": 2.5894510746002197,
1334
+ "rewards/margins": 8.263704299926758,
1335
+ "rewards/rejected": -5.674252510070801,
1336
+ "step": 800
1337
+ },
1338
+ {
1339
+ "epoch": 2.7874564459930316,
1340
+ "eval_logits/chosen": -2.6538193225860596,
1341
+ "eval_logits/rejected": -2.636691093444824,
1342
+ "eval_logps/chosen": -76.02718353271484,
1343
+ "eval_logps/rejected": -85.26797485351562,
1344
+ "eval_loss": 0.8766492605209351,
1345
+ "eval_rewards/accuracies": 0.3611111044883728,
1346
+ "eval_rewards/chosen": -0.9522846937179565,
1347
+ "eval_rewards/margins": 1.6142810583114624,
1348
+ "eval_rewards/rejected": -2.566565990447998,
1349
+ "eval_runtime": 113.6286,
1350
+ "eval_samples_per_second": 17.601,
1351
+ "eval_steps_per_second": 0.554,
1352
+ "step": 800
1353
+ },
1354
+ {
1355
+ "epoch": 2.822299651567944,
1356
+ "grad_norm": 2.1853706150186754,
1357
+ "learning_rate": 1.131782945736434e-07,
1358
+ "logits/chosen": -2.6088814735412598,
1359
+ "logits/rejected": -2.5650715827941895,
1360
+ "logps/chosen": -85.81961822509766,
1361
+ "logps/rejected": -88.84381103515625,
1362
+ "loss": 0.3574,
1363
+ "rewards/accuracies": 0.5249999761581421,
1364
+ "rewards/chosen": 2.4687106609344482,
1365
+ "rewards/margins": 8.304216384887695,
1366
+ "rewards/rejected": -5.835506439208984,
1367
+ "step": 810
1368
+ },
1369
+ {
1370
+ "epoch": 2.857142857142857,
1371
+ "grad_norm": 3.1905831999742365,
1372
+ "learning_rate": 1.1059431524547802e-07,
1373
+ "logits/chosen": -2.621854305267334,
1374
+ "logits/rejected": -2.6245625019073486,
1375
+ "logps/chosen": -87.5467529296875,
1376
+ "logps/rejected": -108.12459564208984,
1377
+ "loss": 0.3825,
1378
+ "rewards/accuracies": 0.512499988079071,
1379
+ "rewards/chosen": 2.308243989944458,
1380
+ "rewards/margins": 8.762375831604004,
1381
+ "rewards/rejected": -6.454131126403809,
1382
+ "step": 820
1383
+ },
1384
+ {
1385
+ "epoch": 2.89198606271777,
1386
+ "grad_norm": 0.7951775803731691,
1387
+ "learning_rate": 1.0801033591731266e-07,
1388
+ "logits/chosen": -2.5292835235595703,
1389
+ "logits/rejected": -2.5408291816711426,
1390
+ "logps/chosen": -59.091575622558594,
1391
+ "logps/rejected": -82.16059112548828,
1392
+ "loss": 0.3596,
1393
+ "rewards/accuracies": 0.4437499940395355,
1394
+ "rewards/chosen": 1.360400915145874,
1395
+ "rewards/margins": 7.509343147277832,
1396
+ "rewards/rejected": -6.148941993713379,
1397
+ "step": 830
1398
+ },
1399
+ {
1400
+ "epoch": 2.926829268292683,
1401
+ "grad_norm": 1.1924411554401215,
1402
+ "learning_rate": 1.054263565891473e-07,
1403
+ "logits/chosen": -2.617638111114502,
1404
+ "logits/rejected": -2.5782742500305176,
1405
+ "logps/chosen": -76.57533264160156,
1406
+ "logps/rejected": -92.98815155029297,
1407
+ "loss": 0.3733,
1408
+ "rewards/accuracies": 0.4937500059604645,
1409
+ "rewards/chosen": 3.365124464035034,
1410
+ "rewards/margins": 9.208218574523926,
1411
+ "rewards/rejected": -5.843094825744629,
1412
+ "step": 840
1413
+ },
1414
+ {
1415
+ "epoch": 2.961672473867596,
1416
+ "grad_norm": 0.8910844109249688,
1417
+ "learning_rate": 1.0284237726098191e-07,
1418
+ "logits/chosen": -2.6461727619171143,
1419
+ "logits/rejected": -2.616791009902954,
1420
+ "logps/chosen": -86.35088348388672,
1421
+ "logps/rejected": -93.9124526977539,
1422
+ "loss": 0.3609,
1423
+ "rewards/accuracies": 0.518750011920929,
1424
+ "rewards/chosen": 3.4399936199188232,
1425
+ "rewards/margins": 8.806478500366211,
1426
+ "rewards/rejected": -5.366484642028809,
1427
+ "step": 850
1428
+ },
1429
+ {
1430
+ "epoch": 2.996515679442509,
1431
+ "grad_norm": 43.871948824299515,
1432
+ "learning_rate": 1.0025839793281653e-07,
1433
+ "logits/chosen": -2.576416015625,
1434
+ "logits/rejected": -2.5809175968170166,
1435
+ "logps/chosen": -69.1751708984375,
1436
+ "logps/rejected": -86.57472229003906,
1437
+ "loss": 0.3403,
1438
+ "rewards/accuracies": 0.45625001192092896,
1439
+ "rewards/chosen": 3.5098342895507812,
1440
+ "rewards/margins": 9.559127807617188,
1441
+ "rewards/rejected": -6.049294471740723,
1442
+ "step": 860
1443
+ },
1444
+ {
1445
+ "epoch": 3.0,
1446
+ "step": 861,
1447
+ "total_flos": 0.0,
1448
+ "train_loss": 0.4762609164889266,
1449
+ "train_runtime": 9670.9508,
1450
+ "train_samples_per_second": 5.689,
1451
+ "train_steps_per_second": 0.089
1452
+ }
1453
+ ],
1454
+ "logging_steps": 10,
1455
+ "max_steps": 861,
1456
+ "num_input_tokens_seen": 0,
1457
+ "num_train_epochs": 3,
1458
+ "save_steps": 100,
1459
+ "stateful_callbacks": {
1460
+ "TrainerControl": {
1461
+ "args": {
1462
+ "should_epoch_stop": false,
1463
+ "should_evaluate": false,
1464
+ "should_log": false,
1465
+ "should_save": true,
1466
+ "should_training_stop": true
1467
+ },
1468
+ "attributes": {}
1469
+ }
1470
+ },
1471
+ "total_flos": 0.0,
1472
+ "train_batch_size": 8,
1473
+ "trial_name": null,
1474
+ "trial_params": null
1475
+ }