|
--- |
|
license: cc-by-nc-nd-4.0 |
|
language: |
|
- es |
|
pipeline_tag: text-generation |
|
tags: |
|
- dialogue |
|
- conversational |
|
- gpt |
|
- gpt2 |
|
- text-generation |
|
- spanish |
|
- dialogpt |
|
- chitchat |
|
- ITG |
|
inference: false |
|
--- |
|
|
|
# DialoGPT-medium-spanish-chitchat |
|
|
|
## Description |
|
|
|
This is a **transformer-decoder** [GPT-2 model](https://huggingface.co/gpt2), adapted for the **single-turn dialogue task in Spanish**. We fine-tuned a [DialoGPT-medium](https://huggingface.co/microsoft/DialoGPT-medium) 345M parameter model from Microsoft, following the CLM (Causal Language Modelling) objective. |
|
|
|
--- |
|
|
|
## Dataset |
|
|
|
We used one of the datasets available in the [Bot Framework Tools repository](https://github.com/microsoft/botframework-cli). We processed [the professional-styled personality chat dataset in Spanish](https://github.com/microsoft/botframework-cli/blob/main/packages/qnamaker/docs/chit-chat-dataset.md), the file is available [to download here](https://qnamakerstore.blob.core.windows.net/qnamakerdata/editorial/spanish/qna_chitchat_professional.tsv) |
|
|
|
--- |
|
|
|
## Example inference script |
|
|
|
### Check at this example script to run our model in inference mode |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
CHAT_TURNS = 5 |
|
MAX_LENGTH = 1000 |
|
|
|
model = AutoModelForCausalLM.from_pretrained('ITG/DialoGPT-medium-spanish-chitchat') |
|
tokenizer = AutoTokenizer.from_pretrained('ITG/DialoGPT-medium-spanish-chitchat') |
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
model.to(device) |
|
for i in range(CHAT_TURNS): |
|
user_input = input(f"Step - {i} >> user prompt ->") |
|
with torch.no_grad(): |
|
# User turn, where "user_input" is the question (single-turn dialogue task) |
|
user_inputs_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors="pt") |
|
user_inputs_ids = user_inputs_ids.to(device) |
|
# The chat history adds the generated tokens for the answer |
|
chat_history = model.generate(user_inputs_ids, max_length=MAX_LENGTH, pad_token_id=tokenizer.eos_token_id) |
|
# decode just the last generated output tokens from the model (do not include the user prompt again) |
|
step_model_answer = tokenizer.decode(chat_history[:, user_inputs_ids.shape[-1]:][0], skip_special_tokens=True) |
|
print(f"Step - {i} >> DialoGPT-spanish model answer -> {step_model_answer}") |
|
``` |
|
--- |
|
|
|
## Examples |
|
|
|
### Single-turn generation examples from an interactive environment |
|
|
|
|Role | Response | |
|
|---------|--------| |
|
|User | Hola, buenos dias | |
|
| Bot | Buenos días| |
|
|User | ¿Qué tal estás? | |
|
| Bot | Fenomenal, gracias. | |
|
|User | ¡Tengo hambre! | |
|
| Bot | Quizá picar algo te ayude. | |
|
|User | Estoy triste | |
|
| Bot | Siento escuchar eso. | |
|
|User | ¿Tienes personalidad? | |
|
| Bot | Soy un bot. | |
|
|User | ¡Cuenta un chiste! | |
|
| Bot | En realidad no tengo tanta gracia. | |
|
|User | Jajajajajajajaja | |
|
| Bot | ¡Me alegro de que te rías! | |
|
|
|
--- |
|
|
|
## Fine-tuning hyper-parameters |
|
|
|
| **Hyper-parameter** | **Value** | |
|
|:----------------------------------------:|:---------------------------:| |
|
| Validation partition (%) | 20% | |
|
| Training batch size | 8 | |
|
| Learning rate | 5e-4 | |
|
| Max training epochs | 20 | |
|
| Warmup training steps (%) | 6% | |
|
| Weight decay | 0.01 | |
|
| Optimiser (beta1, beta2, epsilon) | AdamW (0.9, 0.999, 1e-08) | |
|
| Monitoring metric (delta, patience) | Validation loss (0.1, 3) | |
|
|
|
|
|
## Fine-tuning in a different dataset or style |
|
|
|
If you want to fine-tune your own dialogue model, we recommend you to start from the [DialoGPT model](https://huggingface.co/microsoft/DialoGPT-medium). |
|
You can check the [original GitHub repository](https://github.com/microsoft/DialoGPT). |
|
|
|
## Limitations |
|
|
|
- This model uses the original English-based tokenizer from [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). |
|
Spanish tokenization is not considered but it has similarities in grammatical structure for encoding text. This overlap may help the model transfer its knowledge from English to Spanish. |
|
Moreover, the BPE (Byte Pair Encoding) implementation of the GPT-2 tokenizer **can assign a representation to every Unicode string**. |
|
**From the GPT-2 paper**: |
|
> Since our approach can assign a probability to any Unicode string, this allows us to evaluate our LMs on any dataset regardless of pre-processing, tokenization, or vocab size. |
|
- This model is intended to be used **just for single-turn chitchat conversations in Spanish**. |
|
- This model's generation capabilities are limited to the extent of the aforementioned fine-tuning dataset. |
|
- This model generates short answers, providing general context dialogue in a professional style for the Spanish language. |