File size: 5,054 Bytes
2aa5cce 11979a6 2aa5cce 881b410 2aa5cce 11979a6 8210fd4 11979a6 8210fd4 11979a6 8210fd4 11979a6 ab4a005 11979a6 8210fd4 11979a6 8210fd4 11979a6 8210fd4 11979a6 ab4a005 11979a6 5f0d4fe 10f07da 5f0d4fe 10f07da 5f0d4fe 10f07da 5f0d4fe 11979a6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
language:
- en
tags:
- deepspeed
- chatgpt
- opt
- sft
- rlhf
license: apache-2.0
datasets:
- Dahoas/full-hh-rlhf
- Dahoas/synthetic-instruct-gptj-pairwise
- yitingxie/rlhf-reward-datasets
- openai/webgpt_comparisons
- stanfordnlp/SHP
---
---
# ChatGPT OPT 1.3B DeepSpeed Reinforcement Learning from Human Feedback Actor Model
*chat-opt-1.3b-rlhf-actor-deepspeed*
This model consists of the final step of a modified pipeline the to the traditional training process of Chat-GPT models, which is comprised of a three-step procedure of [supervised fine tuning](https://huggingface.co/AdamG012/chat-opt-1.3b-sft-deepspeed), [reward model](https://huggingface.co/AdamG012/chat-opt-350m-reward-deepspeed) and **reinforcement learning from human feedback models**; [actor](https://huggingface.co/AdamG012/chat-opt-1.3b-rlhf-actor-deepspeed), [actor EMA](https://huggingface.co/AdamG012/chat-opt-1.3b-rlhf-actor-ema-deepspeed) and [critic](https://huggingface.co/AdamG012/chat-opt-1.3b-rlhf-critic-deepspeed) models.
This project's main goal was to make proper use of existing frameworks that revolve around the minimisation of training costs and thus the eventual improvements towards both the feasibility and usability of ChatGPT-like models. The framework selected here is DeepSpeed which has been instrumental in the development of this model and through this framework it was possible to train the ChatGPT-like model on much larger data-sets with a reasonable number of GPUs and consequently achieve significantly better performance.
This model follows the blog of ChatGPT and the paper of InstructGPT and especially the [Microsoft DeepSpeed Chat Blog](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat).
## Our Training Methodology and Speedup Recipes
The training process simply involves a single python run of DeepSpeed-Chat which initiates the whole 3-step pipeline, saving all models in the process:
``` bash
python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node
```
This pipeline can be broken up into three key steps:
1. **Supervised fine-tuning (SFT):** See [here](https://huggingface.co/AdamG012/chat-opt-1.3b-sft-deepspeed/).
2. **Reward Model (RM) fine-tuning:** See [here](https://huggingface.co/AdamG012/chat-opt-350m-reward-deepspeed).
3. **Reinforcement-learning from Human feedback (RLHF) fine-tuning:** At the completion of the prior two steps, the final RLHF fine-tuning can be initiated. This involves the collection of both the *fine-tuned model* from step 1 and the *reward model* from step 2 and train them on the data-set with comparisons. This generates both an **actor** and [critic](https://huggingface.co/AdamG012/chat-opt-1.3b-rlhf-critic-deepspeed). I also generate an [actor model with an exponential moving average (EMA)](https://huggingface.co/AdamG012/chat-opt-1.3b-rlhf-actor-ema-deepspeed) which is known to improve conversational response quality.
To view the details behind each step head into their respective links and view the model card there.
### Reinforcement learning from human feedback
**Model Configurations:**
| Parameter | Value |
|:-----------------------|:------|
| Parameters | 1.3B |
| Model type | OPT |
| FFN Dimensions | 8192 |
| Hidden Size | 2048 |
| Max Position Embedding | 2048 |
| Attention Heads | 16 |
| Hidden layers | 24 |
**Training Configurations:**
| Parameter | Value |
|:-----------------------|:------|
| Train Batch size | 32 |
| Train micro batch size | 4 |
| ZeRO stage | 2 |
| FP16 | True |
| Gradient clipping | 1.0 |
| Dropout | 0.1 |
| Attention Dropout | 0.0 |
| Attention Dropout | 0.0 |
| Prescale gradients | False |
## Installation
If using through the HuggingFace transformers library:
``` python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("AdamG012/chat-opt-1.3b-rlhf-actor-deepspeed")
model = AutoModelForCausalLM.from_pretrained("AdamG012/chat-opt-1.3b-rlhf-actor-deepspeed")
```
If you would like to clone from source:
```bash
# Make sure you have git-lfs installed (https://git-lfs.github.com)
git lfs install
git clone https://huggingface.co/AdamG012/chat-opt-1.3b-rlhf-actor-deepspeed
# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1
```
## **Acknowledgements**
We thank the following papers and open-source repositories. We especially thank DeepSpeed for their frameworks as well.
* [1] Schulman, John, et al. "Introducing ChatGPT", https://openai.com/blog/chatgpt (2022).
* [2] Transformers [Hugging Face (github.com)](https://github.com/huggingface)
* [3] DeepSpeed Chat [DeepSpeed Chat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat)
|