|
--- |
|
language: |
|
- nl |
|
- en |
|
- multilingual |
|
license: apache-2.0 |
|
tags: |
|
- dutch |
|
- english |
|
- t5 |
|
- t5x |
|
- ul2 |
|
- seq2seq |
|
- translation |
|
datasets: |
|
- yhavinga/mc4_nl_cleaned |
|
- yhavinga/nedd_wiki_news |
|
pipeline_tag: translation |
|
widget: |
|
- text: >- |
|
Redistricting and West Virginia’s shrinking population forced the state’s |
|
Republican Legislature to pit Mr. McKinley, a six-term Republican with a |
|
pragmatic bent, against Mr. Mooney, who has served four terms marked more |
|
by conservative rhetoric than legislative achievements. |
|
- text: >- |
|
It is a painful and tragic spectacle that rises before me: I have drawn |
|
back the curtain from the rottenness of man. This word, in my mouth, is at |
|
least free from one suspicion: that it involves a moral accusation against |
|
humanity. |
|
- text: >- |
|
Young Wehling was hunched in his chair, his head in his hand. He was so |
|
rumpled, so still and colorless as to be virtually invisible. His |
|
camouflage was perfect, since the waiting room had a disorderly and |
|
demoralized air, too. Chairs and ashtrays had been moved away from the |
|
walls. The floor was paved with spattered dropcloths. |
|
--- |
|
|
|
# ul2-large-en-nl for English to Dutch translation |
|
|
|
Fine-tuned T5 model on English to Dutch translation that was pretrained on Dutch using a UL2 (Mixture-of-Denoisers) objective. |
|
The T5 model was introduced in |
|
[this paper](https://arxiv.org/abs/1910.10683) |
|
and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer). |
|
The UL2 objective was introduced in |
|
[this paper](https://arxiv.org/abs/2205.05131) |
|
and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2). |
|
|
|
|
|
|
|
## Model description |
|
|
|
T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format. |
|
|
|
`ul2-large-en-nl-v3` T5 is a transformers model fine-tuned on parallel sentence and paragraph pairs |
|
sampled from books. |
|
|
|
This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining: |
|
- GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202) |
|
- Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning |
|
- Pre-trained on self-supervised objective only without mixing in the downstream tasks |
|
- No parameter sharing between embedding and classifier layer |
|
|
|
|
|
### UL2 pretraining objective |
|
|
|
This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training |
|
paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where |
|
the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers |
|
that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of |
|
three denoising tasks: |
|
|
|
1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective; |
|
2. X-denoising (or extreme span corruption); and |
|
3. S-denoising (or sequential PrefixLM). |
|
|
|
During pre-training, we sample from the available denoising tasks based on user-specified ratios. |
|
UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training |
|
denoising task. During the pre-training, a paradigm token is inserted to the input |
|
(`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand. |
|
Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream |
|
fine-tuning tasks. |
|
|
|
## Intended uses & limitations |
|
|
|
This model was fine-tuned on parallel sentence and paragraph pairs and can be used |
|
for machine translation. |
|
|
|
### How to use |
|
|
|
Here is how to use this model in PyTorch: |
|
|
|
```python |
|
model_name = "yhavinga/ul2-large-en-nl-v3" |
|
from transformers import AutoTokenizer |
|
from transformers import AutoModelForSeq2SeqLM |
|
from transformers import pipeline |
|
import torch |
|
device_num = 0 if torch.cuda.is_available() else -1 |
|
device = "cpu" if device_num < 0 else f"cuda:{device_num}" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, use_auth_token=True).to( |
|
device |
|
) |
|
params = {"max_length": 370, "num_beams": 4, "early_stopping": True} |
|
translator = pipeline("translation", tokenizer=tokenizer, model=model, device=device_num) |
|
print(translator("Young Wehling was hunched in his chair, his head in his hand. He was so rumpled, so still and colorless as to be virtually invisible.", |
|
**params)[0]['translation_text']) |
|
``` |
|
|
|
|
|
### Limitations and bias |
|
|
|
The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral. |
|
Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model. |
|
|
|
## Training data |
|
|
|
The `ul2-large-en-nl` T5 model was pre-trained simultaneously on a combination of several datasets, |
|
including the `full` config of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web |
|
crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), and a subset of "mc4_nl_cleaned" |
|
containing only texts from Dutch newspapers. |
|
|
|
After pre-training, the model was |
|
fine-tuned on a translation dataset containing 13 million sentence and paragraph pairs |
|
sampled from books. |
|
|
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
The ul2-large-en-nl T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens. |
|
The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`, known from the original T5 paper, |
|
`[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline. |
|
During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens. |
|
The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises |
|
between `dutch` and `Dutch`. |
|
Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens. |
|
|
|
### Fine-tuning |
|
|
|
This model was fine-tuned on a dataset containing 13M sentence and paragraph translation pairs sampled |
|
from books for three epochs. |
|
|
|
Wandb run https://wandb.ai/yepster/ul2-large-de-neddx2-en-nl/runs/30arxggk?workspace=user-yepster |
|
|
|
* Pre-trained model used as starting point: yhavinga/ul2-large-dutch-english (3150k checkpoint) |
|
|
|
For the concluding ~half epoch, a HuggingFace Flax based trainer was used with the following settings: |
|
|
|
- **Batch Size**: Total effective batch size of 512, achieved via per-device settings and gradient accumulation. |
|
- **Learning Rate**: Set at 0.0009, with linear schedule and 500 step warmup. |
|
- **Optimizer**: AdamW with beta1=0.9, beta2=0.997, epsilon=1e-8. |
|
- **Weight Decay**: Configured to 0.001 for regularization. |
|
- **Additional Parameters**: Dropout rate of 0.01, label smoothing factor of 0.11, and sequence length of 370 tokens. Model datatype is bfloat16, z_loss at 0.0001. |
|
|
|
## Evaluation results |
|
|
|
TBD |
|
|
|
## Acknowledgements |
|
|
|
This project would not have been possible without compute generously provided by Google through the |
|
[TPU Research Cloud](https://sites.research.google/trc/). |
|
Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions. |
|
Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework. |
|
|
|
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/) |
|
|