yhavinga
/

ul2-large-en-nl-v3

+Wandb train run:
+---
+language:
+- nl
+- en
+- multilingual
+license: apache-2.0
+tags:
+- dutch
+- english
+- t5
+- t5x
+- ul2
+- seq2seq
+- translation
+datasets:
+- yhavinga/mc4_nl_cleaned
+- yhavinga/nedd_wiki_news
+pipeline_tag: translation
+widget:
+  - text: >-
+      Redistricting and West Virginia’s shrinking population forced the state’s
+      Republican Legislature to pit Mr. McKinley, a six-term Republican with a
+      pragmatic bent, against Mr. Mooney, who has served four terms marked more
+      by conservative rhetoric than legislative achievements.
+  - text: >-
+      It is a painful and tragic spectacle that rises before me: I have drawn
+      back the curtain from the rottenness of man. This word, in my mouth, is at
+      least free from one suspicion: that it involves a moral accusation against
+      humanity.
+  - text: >-
+      Young Wehling was hunched in his chair, his head in his hand. He was so
+      rumpled, so still and colorless as to be virtually invisible. His
+      camouflage was perfect, since the waiting room had a disorderly and
+      demoralized air, too. Chairs and ashtrays had been moved away from the
+      walls. The floor was paved with spattered dropcloths.
+---
+# ul2-large-en-nl for English to Dutch translation
+Fine-tuned T5 model on English to Dutch translation that was pretrained on Dutch using a UL2 (Mixture-of-Denoisers) objective.
+The T5 model was introduced in
+[this paper](https://arxiv.org/abs/1910.10683)
+and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer).
+The UL2 objective was introduced in
+[this paper](https://arxiv.org/abs/2205.05131)
+and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2).
+## Model description
+T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format.
+`ul2-large-en-nl-v3` T5 is a transformers model fine-tuned on parallel sentence and paragraph pairs
+sampled from books.
+This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining:
+- GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202)
+- Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning
+- Pre-trained on self-supervised objective only without mixing in the downstream tasks
+- No parameter sharing between embedding and classifier layer
+### UL2 pretraining objective
+This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training
+paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where
+the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers
+that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of
+three denoising tasks:
+1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective;
+2. X-denoising (or extreme span corruption); and
+3. S-denoising (or sequential PrefixLM).
+During pre-training, we sample from the available denoising tasks based on user-specified ratios.
+UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training
+denoising task. During the pre-training, a paradigm token is inserted to the input
+(`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand.
+Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream
+fine-tuning tasks.
+## Intended uses & limitations
+This model was fine-tuned on parallel sentence and paragraph pairs and can be used
+for machine translation.
+### How to use
+Here is how to use this model in PyTorch:
+```python
+model_name = "yhavinga/ul2-large-en-nl-v3"
+from transformers import AutoTokenizer
+from transformers import AutoModelForSeq2SeqLM
+from transformers import pipeline
+import torch
+device_num = 0 if torch.cuda.is_available() else -1
+device = "cpu" if device_num < 0 else f"cuda:{device_num}"
+tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
+model = AutoModelForSeq2SeqLM.from_pretrained(model_name, use_auth_token=True).to(
+    device
+)
+params = {"max_length": 370, "num_beams": 4, "early_stopping": True}
+translator = pipeline("translation", tokenizer=tokenizer, model=model, device=device_num)
+print(translator("Young Wehling was hunched in his chair, his head in his hand. He was so rumpled, so still and colorless as to be virtually invisible.",
+               **params)[0]['translation_text'])
+```
+### Limitations and bias
+The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral.
+Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
+## Training data
+The `ul2-large-en-nl` T5 model was pre-trained simultaneously on a combination of several datasets,
+including the `full` config of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web
+crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), and a subset of "mc4_nl_cleaned"
+containing only texts from Dutch newspapers.
+After pre-training, the model was
+fine-tuned on a translation dataset containing 13 million sentence and paragraph pairs
+sampled from books.
+## Training procedure
+### Preprocessing
+The ul2-large-en-nl T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens.
+The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`,  known from the original T5 paper,
+`[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline.
+During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens.
+The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises
+between `dutch` and `Dutch`.
+Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens.
+### Fine-tuning
+This model was fine-tuned on a dataset containing 13M sentence and paragraph translation pairs sampled
+from books for three epochs.
+Wandb run https://wandb.ai/yepster/ul2-large-de-neddx2-en-nl/runs/30arxggk?workspace=user-yepster
+* Pre-trained model used as starting point: yhavinga/ul2-large-dutch-english (3150k checkpoint)
+For the concluding ~half epoch, a HuggingFace Flax based trainer was used with the following settings:
+ - **Batch Size**: Total effective batch size of 512, achieved via per-device settings and gradient accumulation.
+ - **Learning Rate**: Set at 0.0009, with linear schedule and 500 step warmup.
+ - **Optimizer**: AdamW with beta1=0.9, beta2=0.997, epsilon=1e-8.
+ - **Weight Decay**: Configured to 0.001 for regularization.
+ - **Additional Parameters**: Dropout rate of 0.01, label smoothing factor of 0.11, and sequence length of 370 tokens. Model datatype is bfloat16, z_loss at 0.0001.
+## Evaluation results
+TBD
+## Acknowledgements
+This project would not have been possible without compute generously provided by Google through the
+[TPU Research Cloud](https://sites.research.google/trc/).
+Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions.
+Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework.
+Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)