Update README.md

a88e417 almost 2 years ago

6.07 kB

	---
	license: mit
	datasets:
	- openai/webgpt_comparisons
	- openai/summarize_from_feedback
	- Anthropic/hh-rlhf
	language:
	- en
	---

	# Reward model on deberta-v2-xxlarge (1.5B)

	Reward model used in RLHF which is trained on webgpt, summarize from human feedback and Open Assistant user ranked dataset

	# Model Details

	## Model Description

	- Developed by: [More Information Needed]
	- Shared by [optional]: [More Information Needed]
	- Model type: [More Information Needed]
	- Language(s) (NLP): [More Information Needed]
	- License: [More Information Needed]
	- Finetuned from model [optional]: [More Information Needed]

	## Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: [Open Assistant](https://github.com/LAION-AI/Open-Assistant)
	- Paper : [Instruct GPT](https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf) : We try to replicate as close as we can on our hardware and existing datasets
	- Demo [optional]: [More Information Needed]

	# Uses

	This model was trained with human feedback comparison examples, which penalize bad or rude sentence with lower scores.

	## Direct Use

	```
	model_name = 'theblackcat102/deberta-v2-xxlarge-rm'
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	prompt = "I just got out of prison, any suggestion?"
	good_helpful = "I am sorry to hear about it, it must be a hard time inside"
	bad_text = "Stay away from me, you scumbag convict"
	pos = tokenizer(prompt, good_helpful, return_tensors='pt')
	neg = tokenizer(prompt, bad_text, return_tensors='pt')
	pos_score = model(**pos).logits[0]
	neg_score = model(**neg).logits[0]
	print(pos_score, neg_score)
	>> tensor([-1.3449], grad_fn=<SelectBackward0>) tensor([-2.0942], grad_fn=<SelectBackward0>)
	```



	## Downstream Use [optional]

	<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

	[More Information Needed]

	## Out-of-Scope Use

	<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

	[More Information Needed]

	# Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	[More Information Needed]

	## Recommendations

	How to use it as a rank function

	```python
	def divide_chunks(l, n):
	# looping till length l
	for i in range(0, len(l), n):
	yield l[i:i + n]

	@torch.no_grad()
	def rank_model_fn(samples, **kwargs):
	output_scores = []
	for chunk_samples in divide_chunks(samples, 16):
	is_empty = []
	prefixes, postfixes = [], []
	for sample in chunk_samples:
	prefix, postfix = sample.split('[SEP]')
	postfix = postfix.strip()
	if len(postfix) == 0 or len(set(postfix)) <= 3:
	is_empty.append(True)
	else:
	is_empty.append(False)
	postfixes.append(postfix)
	prefixes.append(prefix)
	is_empty = np.array(is_empty)
	inputs = rank_tokenizer(prefixes, postfixes, return_tensors="pt", padding=True)
	inputs.pop("token_type_ids", None)
	inputs = { key: tensor.cuda() for key, tensor in inputs.items() }
	scores = rank_model(**inputs).logits[:, 0].detach().cpu()
	scores[is_empty] = -4
	output_scores += [ s for s in scores ]
	return torch.from_numpy(np.array(output_scores))
	```

	## How to Get Started with the Model

	Use the code below to get started with the model.

	[More Information Needed]

	# Training Details


	## Training Procedure

	checkout our training repo [here](https://github.com/LAION-AI/Open-Assistant/tree/main/model/reward/instructor)


	### Preprocessing [optional]

	[More Information Needed]


	### Training Hyperparameters

	```yaml
	model_name: microsoft/deberta-v2-xxlarge
	learning_rate: 2e-6
	scheduler: cosine
	gradient_checkpointing: false
	gradient_accumulation_steps: 12
	per_device_train_batch_size: 1
	per_device_eval_batch_size: 4
	warmup_steps: 600
	eval_steps: 1000000
	save_steps: 1000
	max_length: 512
	num_train_epochs: 2
	datasets:
	- webgpt
	- hfsummary
	- anthropic_rlhf
	- oa_private
	```

	### Speeds, Sizes, Times [optional]

	Trained on 8 A100 80G model, since we are using the same batch strategy as InstructGPT, using a batch_size of 1 actually equals to (N-1) batch where N refers to number of negative examples. Which is why I recommend using the largest VRAM GPU you can find to train this model.

	# Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	## Testing Data, Factors & Metrics

	### Testing Data

	<!-- This should link to a Data Card if possible. -->

	[More Information Needed]

	### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	[More Information Needed]

	## Results

	[More Information Needed]

	### Summary



	# Model Examination [optional]

	<!-- Relevant interpretability work for the model goes here -->

	[More Information Needed]


	# Technical Specifications [optional]

	## Model Architecture and Objective

	[More Information Needed]

	## Compute Infrastructure

	[More Information Needed]

	### Hardware

	[More Information Needed]

	### Software

	[More Information Needed]

	# Citation [optional]

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	[More Information Needed]

	APA:

	[More Information Needed]

	# Glossary [optional]

	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

	[More Information Needed]

	# More Information [optional]

	[More Information Needed]

	# Model Card Authors [optional]

	[More Information Needed]

	# Model Card Contact

	[More Information Needed]