|
--- |
|
base_model: meta-llama/Llama-2-7b-hf |
|
--- |
|
|
|
# Model Details |
|
|
|
- SFT based on [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) with merged alpaca datasets |
|
- DPO: trained on top of SFT model as LoRa Adapter, with merged [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) data |
|
- PPO: trained on top of dpo model and reward model, with multi-adapters, with [PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) data for futher RLHF |
|
- Trained with Deepspeed ZeRO-1 + TRL + QLoRA + Flash-Attntion 2 |
|
|
|
|
|
## Model and Training Details |
|
|
|
- **Finetuned from model:** [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) |
|
|
|
- **Dataset:** |
|
- SFT (mixed train): |
|
- [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) |
|
- [vicgalle/alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4) |
|
- DPO (mixed train): |
|
- [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) |
|
- [Unified-Language-Model-Alignment/Anthropic_HH_Golden](https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden) |
|
- PPO: |
|
- [PKU-Alignment/PKU-SafeRLHF-10K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K) |
|
- [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) |
|
- [PKU-Alignment/PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) |
|
|
|
### Training Results |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/miik5Tb6A8G6sDTlnQA-V.png) |
|
|
|
### Evaluation |
|
|
|
The reward score and toxicity scores are computed and compared with [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) data on SFT/DPO/PPO models |
|
|
|
| Model | Toxicity | Reward | |
|
| ----- |:--------:|:--------:| |
|
| SFT_v0.1 | 0.0698 | -0.2828 | |
|
| DPO_v0.1 | 0.0356 | -0.2633 | |
|
| PPO_v0.1 | 0.0321 | 0.38 | |
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/m-k6kUuIJVTkYM2l3uBPd.png) |
|
|
|
### Compute Infrastructure |
|
|
|
The model is trained using 8 * RTX-3090-24GB/A100-PCIE-40GB |
|
|
|
### Inference |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True,) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,) |
|
|
|
tokenizer.pad_token = tokenizer.eos_token |
|
tokenizer.eos_token = DEFINE_EOS_TOKEN |
|
model.config.eos_token = DEFINE_EOS_TOKEN |
|
model.config.eos_token_id = tokenizer.eos_token_id |
|
|
|
def format_prompt(question): |
|
return f"###Question: {question}\n###Answer: " |
|
|
|
instruction = "Your text here" |
|
input = format_prompt(instruction) |
|
inputs = tokenizer(input, return_tensors='pt') |
|
output = model.generate(inputs['input_ids'], max_new_tokens=512, do_sample=False, top_p=1) |
|
output = tokenizer.decode(output[0], skip_special_tokens=True) |
|
print(output) |
|
|
|
``` |
|
## Model Card Authors |
|
|
|
Yiyu (Michael) Ren |
|
|
|
## Model Card Contact |
|
|
|
Email: [email protected] |
|
|
|
### Framework versions |
|
|
|
- PEFT 0.8.2 |