Update README.md
Browse files
README.md
CHANGED
@@ -1 +1,171 @@
|
|
1 |
-
Wandb train run:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Wandb train run:
|
2 |
+
|
3 |
+
|
4 |
+
---
|
5 |
+
language:
|
6 |
+
- nl
|
7 |
+
- en
|
8 |
+
- multilingual
|
9 |
+
license: apache-2.0
|
10 |
+
tags:
|
11 |
+
- dutch
|
12 |
+
- english
|
13 |
+
- t5
|
14 |
+
- t5x
|
15 |
+
- ul2
|
16 |
+
- seq2seq
|
17 |
+
- translation
|
18 |
+
datasets:
|
19 |
+
- yhavinga/mc4_nl_cleaned
|
20 |
+
- yhavinga/nedd_wiki_news
|
21 |
+
pipeline_tag: translation
|
22 |
+
widget:
|
23 |
+
- text: >-
|
24 |
+
Redistricting and West Virginia’s shrinking population forced the state’s
|
25 |
+
Republican Legislature to pit Mr. McKinley, a six-term Republican with a
|
26 |
+
pragmatic bent, against Mr. Mooney, who has served four terms marked more
|
27 |
+
by conservative rhetoric than legislative achievements.
|
28 |
+
- text: >-
|
29 |
+
It is a painful and tragic spectacle that rises before me: I have drawn
|
30 |
+
back the curtain from the rottenness of man. This word, in my mouth, is at
|
31 |
+
least free from one suspicion: that it involves a moral accusation against
|
32 |
+
humanity.
|
33 |
+
- text: >-
|
34 |
+
Young Wehling was hunched in his chair, his head in his hand. He was so
|
35 |
+
rumpled, so still and colorless as to be virtually invisible. His
|
36 |
+
camouflage was perfect, since the waiting room had a disorderly and
|
37 |
+
demoralized air, too. Chairs and ashtrays had been moved away from the
|
38 |
+
walls. The floor was paved with spattered dropcloths.
|
39 |
+
---
|
40 |
+
|
41 |
+
# ul2-large-en-nl for English to Dutch translation
|
42 |
+
|
43 |
+
Fine-tuned T5 model on English to Dutch translation that was pretrained on Dutch using a UL2 (Mixture-of-Denoisers) objective.
|
44 |
+
The T5 model was introduced in
|
45 |
+
[this paper](https://arxiv.org/abs/1910.10683)
|
46 |
+
and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer).
|
47 |
+
The UL2 objective was introduced in
|
48 |
+
[this paper](https://arxiv.org/abs/2205.05131)
|
49 |
+
and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2).
|
50 |
+
|
51 |
+
|
52 |
+
|
53 |
+
## Model description
|
54 |
+
|
55 |
+
T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format.
|
56 |
+
|
57 |
+
`ul2-large-en-nl-v3` T5 is a transformers model fine-tuned on parallel sentence and paragraph pairs
|
58 |
+
sampled from books.
|
59 |
+
|
60 |
+
This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining:
|
61 |
+
- GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202)
|
62 |
+
- Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning
|
63 |
+
- Pre-trained on self-supervised objective only without mixing in the downstream tasks
|
64 |
+
- No parameter sharing between embedding and classifier layer
|
65 |
+
|
66 |
+
|
67 |
+
### UL2 pretraining objective
|
68 |
+
|
69 |
+
This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training
|
70 |
+
paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where
|
71 |
+
the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers
|
72 |
+
that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of
|
73 |
+
three denoising tasks:
|
74 |
+
|
75 |
+
1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective;
|
76 |
+
2. X-denoising (or extreme span corruption); and
|
77 |
+
3. S-denoising (or sequential PrefixLM).
|
78 |
+
|
79 |
+
During pre-training, we sample from the available denoising tasks based on user-specified ratios.
|
80 |
+
UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training
|
81 |
+
denoising task. During the pre-training, a paradigm token is inserted to the input
|
82 |
+
(`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand.
|
83 |
+
Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream
|
84 |
+
fine-tuning tasks.
|
85 |
+
|
86 |
+
## Intended uses & limitations
|
87 |
+
|
88 |
+
This model was fine-tuned on parallel sentence and paragraph pairs and can be used
|
89 |
+
for machine translation.
|
90 |
+
|
91 |
+
### How to use
|
92 |
+
|
93 |
+
Here is how to use this model in PyTorch:
|
94 |
+
|
95 |
+
```python
|
96 |
+
model_name = "yhavinga/ul2-large-en-nl-v3"
|
97 |
+
from transformers import AutoTokenizer
|
98 |
+
from transformers import AutoModelForSeq2SeqLM
|
99 |
+
from transformers import pipeline
|
100 |
+
import torch
|
101 |
+
device_num = 0 if torch.cuda.is_available() else -1
|
102 |
+
device = "cpu" if device_num < 0 else f"cuda:{device_num}"
|
103 |
+
|
104 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
|
105 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, use_auth_token=True).to(
|
106 |
+
device
|
107 |
+
)
|
108 |
+
params = {"max_length": 370, "num_beams": 4, "early_stopping": True}
|
109 |
+
translator = pipeline("translation", tokenizer=tokenizer, model=model, device=device_num)
|
110 |
+
print(translator("Young Wehling was hunched in his chair, his head in his hand. He was so rumpled, so still and colorless as to be virtually invisible.",
|
111 |
+
**params)[0]['translation_text'])
|
112 |
+
```
|
113 |
+
|
114 |
+
|
115 |
+
### Limitations and bias
|
116 |
+
|
117 |
+
The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral.
|
118 |
+
Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
|
119 |
+
|
120 |
+
## Training data
|
121 |
+
|
122 |
+
The `ul2-large-en-nl` T5 model was pre-trained simultaneously on a combination of several datasets,
|
123 |
+
including the `full` config of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web
|
124 |
+
crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), and a subset of "mc4_nl_cleaned"
|
125 |
+
containing only texts from Dutch newspapers.
|
126 |
+
|
127 |
+
After pre-training, the model was
|
128 |
+
fine-tuned on a translation dataset containing 13 million sentence and paragraph pairs
|
129 |
+
sampled from books.
|
130 |
+
|
131 |
+
## Training procedure
|
132 |
+
|
133 |
+
### Preprocessing
|
134 |
+
|
135 |
+
The ul2-large-en-nl T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens.
|
136 |
+
The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`, known from the original T5 paper,
|
137 |
+
`[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline.
|
138 |
+
During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens.
|
139 |
+
The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises
|
140 |
+
between `dutch` and `Dutch`.
|
141 |
+
Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens.
|
142 |
+
|
143 |
+
### Fine-tuning
|
144 |
+
|
145 |
+
This model was fine-tuned on a dataset containing 13M sentence and paragraph translation pairs sampled
|
146 |
+
from books for three epochs.
|
147 |
+
|
148 |
+
Wandb run https://wandb.ai/yepster/ul2-large-de-neddx2-en-nl/runs/30arxggk?workspace=user-yepster
|
149 |
+
|
150 |
+
* Pre-trained model used as starting point: yhavinga/ul2-large-dutch-english (3150k checkpoint)
|
151 |
+
|
152 |
+
For the concluding ~half epoch, a HuggingFace Flax based trainer was used with the following settings:
|
153 |
+
|
154 |
+
- **Batch Size**: Total effective batch size of 512, achieved via per-device settings and gradient accumulation.
|
155 |
+
- **Learning Rate**: Set at 0.0009, with linear schedule and 500 step warmup.
|
156 |
+
- **Optimizer**: AdamW with beta1=0.9, beta2=0.997, epsilon=1e-8.
|
157 |
+
- **Weight Decay**: Configured to 0.001 for regularization.
|
158 |
+
- **Additional Parameters**: Dropout rate of 0.01, label smoothing factor of 0.11, and sequence length of 370 tokens. Model datatype is bfloat16, z_loss at 0.0001.
|
159 |
+
|
160 |
+
## Evaluation results
|
161 |
+
|
162 |
+
TBD
|
163 |
+
|
164 |
+
## Acknowledgements
|
165 |
+
|
166 |
+
This project would not have been possible without compute generously provided by Google through the
|
167 |
+
[TPU Research Cloud](https://sites.research.google/trc/).
|
168 |
+
Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions.
|
169 |
+
Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework.
|
170 |
+
|
171 |
+
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
|