gonzalo-santamaria-iic
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -130,15 +130,11 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
|
130 |
|
131 |
### Training Data
|
132 |
|
133 |
-
A combination of both public and private datasets
|
134 |
|
135 |
### Training Procedure
|
136 |
|
137 |
-
|
138 |
-
|
139 |
-
#### Preprocessing [optional]
|
140 |
-
|
141 |
-
[More Information Needed]
|
142 |
|
143 |
|
144 |
#### Training Hyperparameters
|
|
|
130 |
|
131 |
### Training Data
|
132 |
|
133 |
+
A combination of both public and private datasets designed in the IIC. The dataset consists of 21975 conversations in Spanish, with the format `chatml` and has the same structure as the [Anthropic/hh-rlhf dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf). Each conversation has two variants: `chosen` and `rejected`, where the only thing that changes is the last answer of the assistant. The last answer in the `chosen` variant is considered a better answer than the one in the `rejected` variant. Different techniques have been used to generate the dataset, which we explain in depth in the paper (**coming soon**).
|
134 |
|
135 |
### Training Procedure
|
136 |
|
137 |
+
We use the [Transformer Reinforcement Learning](https://huggingface.co/docs/trl/index) (TRL) library. Specifically, we have applied [the script they have published](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py) as an example for using DPO to the dataset we have generated.
|
|
|
|
|
|
|
|
|
138 |
|
139 |
|
140 |
#### Training Hyperparameters
|