File size: 5,299 Bytes
5c51859
 
 
d3bda95
5c51859
 
 
d3bda95
 
 
 
 
5c51859
 
 
 
 
 
 
 
 
 
d3bda95
5c51859
 
d3bda95
 
 
 
 
 
 
 
5c51859
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
base_model: argsearch/llama-7b-sft-float32
tags:
- alignment-handbook
- trl
- dpo
- generated_from_trainer
- trl
- dpo
- generated_from_trainer
datasets:
- Dahoas/full-hh-rlhf
model-index:
- name: llama-7b-sft-DPO
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# llama-7b-sft-DPO

This model is a fine-tuned version of [argsearch/llama-7b-sft-float32](https://huggingface.co/argsearch/llama-7b-sft-float32) on the Dahoas/full-hh-rlhf dataset.
It achieves the following results on the evaluation set:
- Loss: 0.6525
- Rewards/chosen: 0.3315
- Rewards/rejected: 0.1953
- Rewards/accuracies: 0.6080
- Rewards/margins: 0.1362
- Logps/rejected: -633.3815
- Logps/chosen: -690.5654
- Logits/rejected: -1.9212
- Logits/chosen: -1.9766

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-07
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 2
- total_train_batch_size: 64
- total_eval_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1

### Training results

| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
| 0.6884        | 0.06  | 100  | 0.6886          | 0.0879         | 0.0774           | 0.5647             | 0.0105          | -645.1731      | -714.9250    | -2.7786         | -2.8754       |
| 0.6769        | 0.11  | 200  | 0.6809          | 0.2546         | 0.2194           | 0.5747             | 0.0352          | -630.9728      | -698.2556    | -2.6094         | -2.6971       |
| 0.6734        | 0.17  | 300  | 0.6755          | 0.2980         | 0.2471           | 0.5833             | 0.0508          | -628.1946      | -693.9142    | -2.5226         | -2.6062       |
| 0.6684        | 0.23  | 400  | 0.6713          | 0.3480         | 0.2822           | 0.5888             | 0.0658          | -624.6848      | -688.9108    | -2.4007         | -2.4782       |
| 0.6647        | 0.29  | 500  | 0.6671          | 0.3495         | 0.2706           | 0.6048             | 0.0789          | -625.8477      | -688.7593    | -2.3026         | -2.3749       |
| 0.6598        | 0.34  | 600  | 0.6636          | 0.3311         | 0.2429           | 0.6058             | 0.0882          | -628.6143      | -690.6030    | -2.1694         | -2.2345       |
| 0.6598        | 0.4   | 700  | 0.6606          | 0.2824         | 0.1853           | 0.6106             | 0.0971          | -634.3779      | -695.4718    | -1.9252         | -1.9781       |
| 0.6563        | 0.46  | 800  | 0.6585          | 0.3476         | 0.2374           | 0.6071             | 0.1102          | -629.1707      | -688.9521    | -2.0030         | -2.0599       |
| 0.6636        | 0.51  | 900  | 0.6572          | 0.3569         | 0.2427           | 0.6119             | 0.1142          | -628.6379      | -688.0209    | -1.9872         | -2.0440       |
| 0.6436        | 0.57  | 1000 | 0.6558          | 0.2921         | 0.1732           | 0.6096             | 0.1190          | -635.5912      | -694.4999    | -1.9618         | -2.0181       |
| 0.6759        | 0.63  | 1100 | 0.6548          | 0.3436         | 0.2165           | 0.6071             | 0.1272          | -631.2626      | -689.3489    | -1.9627         | -2.0198       |
| 0.6679        | 0.69  | 1200 | 0.6542          | 0.3533         | 0.2212           | 0.6077             | 0.1321          | -630.7878      | -688.3820    | -1.9058         | -1.9598       |
| 0.6358        | 0.74  | 1300 | 0.6533          | 0.3363         | 0.2036           | 0.6074             | 0.1327          | -632.5449      | -690.0779    | -1.9447         | -2.0015       |
| 0.6473        | 0.8   | 1400 | 0.6528          | 0.3378         | 0.2021           | 0.6080             | 0.1357          | -632.6981      | -689.9300    | -1.9072         | -1.9621       |
| 0.6447        | 0.86  | 1500 | 0.6526          | 0.3221         | 0.1869           | 0.6080             | 0.1352          | -634.2156      | -691.5005    | -1.9226         | -1.9781       |
| 0.6546        | 0.91  | 1600 | 0.6525          | 0.3303         | 0.1941           | 0.6074             | 0.1362          | -633.5018      | -690.6824    | -1.9134         | -1.9684       |
| 0.6725        | 0.97  | 1700 | 0.6525          | 0.3312         | 0.1950           | 0.6074             | 0.1363          | -633.4115      | -690.5892    | -1.9098         | -1.9645       |


### Framework versions

- Transformers 4.39.0.dev0
- Pytorch 2.3.0+cu121
- Datasets 2.14.6
- Tokenizers 0.15.2