license: mit
datasets:
- openai/webgpt_comparisons
- openai/summarize_from_feedback
- Anthropic/hh-rlhf
language:
- en
Reward model on deberta-v2-xxlarge (1.5B)
Reward model used in RLHF which is trained on webgpt, summarize from human feedback and Open Assistant user ranked dataset
Model Details
Model Description
- Developed by: [More Information Needed]
- Shared by [optional]: [More Information Needed]
- Model type: [More Information Needed]
- Language(s) (NLP): [More Information Needed]
- License: [More Information Needed]
- Finetuned from model [optional]: [More Information Needed]
Model Sources [optional]
- Repository: Open Assistant
- Paper : Instruct GPT : We try to replicate as close as we can on our hardware and existing datasets
- Demo [optional]: [More Information Needed]
Uses
This model was trained with human feedback comparison examples, which penalize bad or rude sentence with lower scores.
Direct Use
model_name = 'theblackcat102/deberta-v2-xxlarge-rm'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "I just got out of prison, any suggestion?"
good_helpful = "I am sorry to hear about it, it must be a hard time inside"
bad_text = "Stay away from me, you scumbag convict"
pos = tokenizer(prompt, good_helpful, return_tensors='pt')
neg = tokenizer(prompt, bad_text, return_tensors='pt')
pos_score = model(**pos).logits[0]
neg_score = model(**neg).logits[0]
print(pos_score, neg_score)
>> tensor([-1.3449], grad_fn=<SelectBackward0>) tensor([-2.0942], grad_fn=<SelectBackward0>)
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
[More Information Needed]
Bias, Risks, and Limitations
[More Information Needed]
Recommendations
How to use it as a rank function
def divide_chunks(l, n):
# looping till length l
for i in range(0, len(l), n):
yield l[i:i + n]
@torch.no_grad()
def rank_model_fn(samples, **kwargs):
output_scores = []
for chunk_samples in divide_chunks(samples, 16):
is_empty = []
prefixes, postfixes = [], []
for sample in chunk_samples:
prefix, postfix = sample.split('[SEP]')
postfix = postfix.strip()
if len(postfix) == 0 or len(set(postfix)) <= 3:
is_empty.append(True)
else:
is_empty.append(False)
postfixes.append(postfix)
prefixes.append(prefix)
is_empty = np.array(is_empty)
inputs = rank_tokenizer(prefixes, postfixes, return_tensors="pt", padding=True)
inputs.pop("token_type_ids", None)
inputs = { key: tensor.cuda() for key, tensor in inputs.items() }
scores = rank_model(**inputs).logits[:, 0].detach().cpu()
scores[is_empty] = -4
output_scores += [ s for s in scores ]
return torch.from_numpy(np.array(output_scores))
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Procedure
checkout our training repo here
Preprocessing [optional]
[More Information Needed]
Training Hyperparameters
model_name: microsoft/deberta-v2-xxlarge
learning_rate: 2e-6
scheduler: cosine
gradient_checkpointing: false
gradient_accumulation_steps: 12
per_device_train_batch_size: 1
per_device_eval_batch_size: 4
warmup_steps: 600
eval_steps: 1000000
save_steps: 1000
max_length: 512
num_train_epochs: 2
datasets:
- webgpt
- hfsummary
- anthropic_rlhf
- oa_private
Speeds, Sizes, Times [optional]
Trained on 8 A100 80G model, since we are using the same batch strategy as InstructGPT, using a batch_size of 1 actually equals to (N-1) batch where N refers to number of negative examples. Which is why I recommend using the largest VRAM GPU you can find to train this model.
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
[More Information Needed]
Summary
Model Examination [optional]
[More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]