renyiyu
/

llama-2-7b-ppo-lora-v0.1

Model card Files Files and versions Community

llama-2-7b-ppo-lora-v0.1 / README.md

renyiyu's picture

Create README.md

8e8a9cd verified 11 months ago

|

history blame contribute delete

3.13 kB

	---
	base_model: meta-llama/Llama-2-7b-hf
	---

	# Model Details

	- SFT based on [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) with merged alpaca datasets
	- DPO: trained on top of SFT model as LoRa Adapter, with merged [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) data
	- PPO: trained on top of dpo model and reward model, with multi-adapters, with [PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) data for futher RLHF
	- Trained with Deepspeed ZeRO-1 + TRL + QLoRA + Flash-Attntion 2


	## Model and Training Details

	- Finetuned from model: [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)

	- Dataset:
	- SFT (mixed train):
	- [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned)
	- [vicgalle/alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4)
	- DPO (mixed train):
	- [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
	- [Unified-Language-Model-Alignment/Anthropic_HH_Golden](https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden)
	- PPO:
	- [PKU-Alignment/PKU-SafeRLHF-10K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K)
	- [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K)
	- [PKU-Alignment/PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF)

	### Training Results

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/miik5Tb6A8G6sDTlnQA-V.png)

	### Evaluation

	The reward score and toxicity scores are computed and compared with [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) data on SFT/DPO/PPO models

	\| Model \| Toxicity \| Reward \|
	\| ----- \|:--------:\|:--------:\|
	\| SFT_v0.1 \| 0.0698 \| -0.2828 \|
	\| DPO_v0.1 \| 0.0356 \| -0.2633 \|
	\| PPO_v0.1 \| 0.0321 \| 0.38 \|
	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/m-k6kUuIJVTkYM2l3uBPd.png)

	### Compute Infrastructure

	The model is trained using 8 * RTX-3090-24GB/A100-PCIE-40GB

	### Inference
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True,)
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,)

	tokenizer.pad_token = tokenizer.eos_token
	tokenizer.eos_token = DEFINE_EOS_TOKEN
	model.config.eos_token = DEFINE_EOS_TOKEN
	model.config.eos_token_id = tokenizer.eos_token_id

	def format_prompt(question):
	return f"###Question: {question}\n###Answer: "

	instruction = "Your text here"
	input = format_prompt(instruction)
	inputs = tokenizer(input, return_tensors='pt')
	output = model.generate(inputs['input_ids'], max_new_tokens=512, do_sample=False, top_p=1)
	output = tokenizer.decode(output[0], skip_special_tokens=True)
	print(output)

	```
	## Model Card Authors

	Yiyu (Michael) Ren

	## Model Card Contact

	Email: [email protected]

	### Framework versions

	- PEFT 0.8.2