I帽igo L贸pez-Riob贸o Botana commited on
Commit
8e5fab5
1 Parent(s): 6f72295

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -1
README.md CHANGED
@@ -98,7 +98,11 @@ You can check the [original GitHub repository](https://github.com/microsoft/Dial
98
 
99
  ## Limitations
100
 
101
- - This model uses the original English-based tokenizer from [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). Spanish tokenization is not considered but it has similarities in grammatical structure for encoding text. This overlap may help the model transfer its knowledge from English to Spanish.
 
 
 
 
102
  - This model is intended to be used **just for single-turn chitchat conversations in Spanish**.
103
  - This model's generation capabilities are limited to the extent of the aforementioned fine-tuning dataset.
104
  - This model generates short answers, providing general context dialogue in a professional style.
 
98
 
99
  ## Limitations
100
 
101
+ - This model uses the original English-based tokenizer from [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf).
102
+ Spanish tokenization is not considered but it has similarities in grammatical structure for encoding text. This overlap may help the model transfer its knowledge from English to Spanish.
103
+ Moreover, the BPE (Byte Pair Encoding) implementation of the GPT-2 tokenizer **can assign a representation to every Unicode string**.
104
+ **From the GPT-2 paper**:
105
+ > Since our approach can assign a probability to any Unicode string, this allows us to evaluate our LMs on any dataset regardless of pre-processing, tokenization, or vocab size.
106
  - This model is intended to be used **just for single-turn chitchat conversations in Spanish**.
107
  - This model's generation capabilities are limited to the extent of the aforementioned fine-tuning dataset.
108
  - This model generates short answers, providing general context dialogue in a professional style.