|
# Training |
|
|
|
This is the 10k steps English supervised-fine-tuning (SFT) model of GPT-J using SODA dataset for Chai Competition. |
|
|
|
- **Language:** English |
|
- **Finetuned from:** [EleutherAI / GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) |
|
- **Code:** [Open-Assistant/model/model_training](https://github.com/LAION-AI/Open-Assistant/tree/main/model/model_training) |
|
- **Dataset:** 10 percent from [SODA dataset](https://huggingface.co/datasets/allenai/soda) |
|
|
|
# Why OpenAssistant framework: |
|
- Easy to setup training with change config from dataset and model is all you need |
|
- Data processing available for almost popular conversation datasets: SODA, Vicuna, OpenAssistant, ... |
|
|
|
# Configuration: |
|
|
|
You need to add this to default config file `configs/config.yaml` |
|
|
|
|
|
``` |
|
data: |
|
soda-only: |
|
datasets: |
|
- soda: |
|
fraction: 0.1 |
|
input_max_length: 1024 |
|
``` |
|
|
|
|
|
``` |
|
gptj-chai: |
|
dtype: fp16 |
|
log_dir: gptj-soda |
|
model_name: EleutherAI/gpt-j-6b |
|
output_dir: output/gptj-soda-chai |
|
max_length: 1024 |
|
warmup_steps: 100 |
|
gradient_checkpointing: true |
|
gradient_accumulation_steps: 1 |
|
per_device_train_batch_size: 8 |
|
per_device_eval_batch_size: 8 |
|
eval_steps: 5000 |
|
save_steps: 5000 |
|
num_train_epochs: 1 |
|
save_total_limit: 1 |
|
use_flash_attention: false |
|
``` |
|
|
|
# Command to train: |
|
|
|
```bash |
|
deepspeed trainer_sft.py --local_rank=0 --configs defaults gptj-chai soda-only --cache_dir data_cache --deepspeed |
|
``` |
|
|
|
# Interactive Demo Code: |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
|
|
class ChatBot(): |
|
def __init__(self, path="/mnt/hdd/duyphung/gptj-soda-chai/checkpoint-10000/"): |
|
self.tokenizer = AutoTokenizer.from_pretrained(path) |
|
self.model = AutoModelForCausalLM.from_pretrained(path).half().cuda().eval() |
|
self.model.pad_token_id = self.tokenizer.eos_token_id |
|
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id |
|
|
|
def chat(self, message): |
|
enc_dict = self.tokenizer( |
|
message, |
|
return_tensors='pt' |
|
) |
|
for x in enc_dict: |
|
enc_dict[x] = enc_dict[x].cuda() |
|
chat_history_ids = self.model.generate( |
|
input_ids=enc_dict['input_ids'], |
|
attention_mask=enc_dict['attention_mask'], |
|
max_new_tokens=64, |
|
temperature=0.7, |
|
do_sample=True, |
|
top_k=0, |
|
top_p=0.95, |
|
) |
|
out = chat_history_ids[:, enc_dict['input_ids'].shape[-1]:][0] |
|
return self.tokenizer.decode(out, skip_special_tokens=True) |
|
|
|
|
|
if __name__ == "__main__": |
|
bot_name = 'Bot:' |
|
prompt = "<|prompter|>" |
|
chat_history = [] |
|
|
|
bot = ChatBot() |
|
while True: |
|
message = input("Me: ") |
|
chat_history.append(f'Me: {message}') |
|
prompt = prompt + message + "<|endoftext|><|assistant|>" |
|
response = bot.chat(prompt) |
|
print(f'{bot_name} {response}') |
|
prompt = prompt + response + "<|endoftext|><|prompter|>" |
|
``` |
|
|
|
|