I帽igo L贸pez-Riob贸o Botana
commited on
Commit
路
8e5fab5
1
Parent(s):
6f72295
Update README.md
Browse files
README.md
CHANGED
@@ -98,7 +98,11 @@ You can check the [original GitHub repository](https://github.com/microsoft/Dial
|
|
98 |
|
99 |
## Limitations
|
100 |
|
101 |
-
- This model uses the original English-based tokenizer from [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf).
|
|
|
|
|
|
|
|
|
102 |
- This model is intended to be used **just for single-turn chitchat conversations in Spanish**.
|
103 |
- This model's generation capabilities are limited to the extent of the aforementioned fine-tuning dataset.
|
104 |
- This model generates short answers, providing general context dialogue in a professional style.
|
|
|
98 |
|
99 |
## Limitations
|
100 |
|
101 |
+
- This model uses the original English-based tokenizer from [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf).
|
102 |
+
Spanish tokenization is not considered but it has similarities in grammatical structure for encoding text. This overlap may help the model transfer its knowledge from English to Spanish.
|
103 |
+
Moreover, the BPE (Byte Pair Encoding) implementation of the GPT-2 tokenizer **can assign a representation to every Unicode string**.
|
104 |
+
**From the GPT-2 paper**:
|
105 |
+
> Since our approach can assign a probability to any Unicode string, this allows us to evaluate our LMs on any dataset regardless of pre-processing, tokenization, or vocab size.
|
106 |
- This model is intended to be used **just for single-turn chitchat conversations in Spanish**.
|
107 |
- This model's generation capabilities are limited to the extent of the aforementioned fine-tuning dataset.
|
108 |
- This model generates short answers, providing general context dialogue in a professional style.
|