yhavinga commited on
Commit
454bda3
·
1 Parent(s): 9d3393d
README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ language:
4
+ - nl
5
+ - en
6
+ - multilingual
7
+ license: apache-2.0
8
+ tags:
9
+ - dutch
10
+ - english
11
+ - t5
12
+ - t5x
13
+ - ul2
14
+ - seq2seq
15
+ - translation
16
+ datasets:
17
+ - yhavinga/mc4_nl_cleaned
18
+ - yhavinga/nedd_wiki_news
19
+ pipeline_tag: translation
20
+ widget:
21
+ - text: >-
22
+ Redistricting and West Virginia’s shrinking population forced the state’s
23
+ Republican Legislature to pit Mr. McKinley, a six-term Republican with a
24
+ pragmatic bent, against Mr. Mooney, who has served four terms marked more
25
+ by conservative rhetoric than legislative achievements.
26
+ - text: >-
27
+ It is a painful and tragic spectacle that rises before me: I have drawn
28
+ back the curtain from the rottenness of man. This word, in my mouth, is at
29
+ least free from one suspicion: that it involves a moral accusation against
30
+ humanity.
31
+ - text: >-
32
+ Young Wehling was hunched in his chair, his head in his hand. He was so
33
+ rumpled, so still and colorless as to be virtually invisible. His
34
+ camouflage was perfect, since the waiting room had a disorderly and
35
+ demoralized air, too. Chairs and ashtrays had been moved away from the
36
+ walls. The floor was paved with spattered dropcloths.
37
+ ---
38
+
39
+ # ul2-base-nl36-en-nl for English to Dutch translation
40
+
41
+ Fine-tuned T5 model on English to Dutch translation that was pretrained on Dutch using a UL2 (Mixture-of-Denoisers) objective.
42
+ The T5 model was introduced in
43
+ [this paper](https://arxiv.org/abs/1910.10683)
44
+ and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer).
45
+ The UL2 objective was introduced in
46
+ [this paper](https://arxiv.org/abs/2205.05131)
47
+ and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2).
48
+
49
+
50
+
51
+ ## Model description
52
+
53
+ T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format.
54
+
55
+ `ul2-base-nl36-en-nl` T5 is a transformers model fine-tuned on parallel sentence and paragraph pairs
56
+ sampled from books.
57
+
58
+ This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining:
59
+ - GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202)
60
+ - Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning
61
+ - Pre-trained on self-supervised objective only without mixing in the downstream tasks
62
+ - No parameter sharing between embedding and classifier layer
63
+
64
+ The "efficient" T5 architecture findings presented in [this paper](https://arxiv.org/abs/2109.10686) were also applied,
65
+ which suggests that a Deep-Narrow model architecture is favorable for downstream performance compared to other model
66
+ architectures of similar parameter count. Specifically, the model depth is defined as the number of transformer blocks
67
+ that are stacked sequentially.
68
+ This model uses the [t5-efficient-base-nl36](https://huggingface.co/google/t5-efficient-base-nl36) architecture's
69
+ layer depth, which means both the encoder and the decoder have 36 transformer layers compared to the original T5 "base"
70
+ model's architecture of 12 transformer layers.
71
+
72
+ ### UL2 pretraining objective
73
+
74
+ This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training
75
+ paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where
76
+ the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers
77
+ that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of
78
+ three denoising tasks:
79
+
80
+ 1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective;
81
+ 2. X-denoising (or extreme span corruption); and
82
+ 3. S-denoising (or sequential PrefixLM).
83
+
84
+ During pre-training, we sample from the available denoising tasks based on user-specified ratios.
85
+ UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training
86
+ denoising task. During the pre-training, a paradigm token is inserted to the input
87
+ (`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand.
88
+ Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream
89
+ fine-tuning tasks.
90
+
91
+ ## Intended uses & limitations
92
+
93
+ This model was fine-tuned on parallel sentence and paragraph pairs and can be used
94
+ for machine translation.
95
+
96
+ ### How to use
97
+
98
+ Here is how to use this model in PyTorch:
99
+
100
+ ```python
101
+ model_name = "yhavinga/ul2-base-nl36-en-nl"
102
+ from transformers import AutoTokenizer
103
+ from transformers import AutoModelForSeq2SeqLM
104
+ from transformers import pipeline
105
+ import torch
106
+ device_num = 0 if torch.cuda.is_available() else -1
107
+ device = "cpu" if device_num < 0 else f"cuda:{device_num}"
108
+
109
+ tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
110
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name, use_auth_token=True).to(
111
+ device
112
+ )
113
+ params = {"max_length": 370, "num_beams": 4, "early_stopping": True}
114
+ translator = pipeline("translation", tokenizer=tokenizer, model=model, device=device_num)
115
+ print(translator("Young Wehling was hunched in his chair, his head in his hand. He was so rumpled, so still and colorless as to be virtually invisible.",
116
+ **params)[0]['translation_text'])
117
+ ```
118
+
119
+
120
+ ### Limitations and bias
121
+
122
+ The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral.
123
+ Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
124
+
125
+ ## Training data
126
+
127
+ The `ul2-base-nl36-en-nl` T5 model was pre-trained simultaneously on a combination of several datasets,
128
+ including the `full` config of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web
129
+ crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), and a subset of "mc4_nl_cleaned"
130
+ containing only texts from Dutch and Belgian newspapers. This last dataset is oversampled to bias the model
131
+ towards descriptions of events in the Netherlands and Belgium.
132
+
133
+ After pre-training, the model was
134
+ fine-tuned on a translation dataset containing 13 million sentence and paragraph pairs
135
+ sampled from books.
136
+
137
+
138
+
139
+ ## Training procedure
140
+
141
+ ### Preprocessing
142
+
143
+ The ul2-base-nl36-en-nl T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens.
144
+ The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`, known from the original T5 paper,
145
+ `[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline.
146
+ During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens.
147
+ The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises
148
+ between `dutch` and `Dutch`.
149
+ Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens.
150
+
151
+ ### Fine-tuning
152
+
153
+ This model was fine-tuned on a dataset containing 13M sentence and paragraph translation pairs sampled from books.
154
+
155
+ * Pre-trained model used as starting point: yhavinga/ul2-base-nl36-dutch
156
+ * Amount of fine-tune training steps: 43415
157
+ * Batch size: 512 (gradient accumulation steps: 16)
158
+ * Sequence length: 370 tokens
159
+ * Model dtype: bfloat16
160
+ * z_loss: 0.0001
161
+ * Optimizer: adamw_hf beta1: 0.9 beta2: 0.9969 eps: 1e-08
162
+ * Dropout rate: 0.01
163
+ * Learning rate: 0.0009 with linear decay to 0 and warmup for 500 steps
164
+ * Label smoothing factor: 0.11
165
+ * Bleu score: 44.2
166
+
167
+ ### Model list
168
+
169
+ Models in this series:
170
+
171
+
172
+ | | ul2-base-en-nl | ul2-base-nl36-en-nl | ul2-large-en-nl |
173
+ |:---------------------|:-----------------|:----------------------|:------------------|
174
+ | model_type | t5 | t5 | t5 |
175
+ | _pipeline_tag | translation | translation | translation |
176
+ | d_model | 768 | 768 | 1024 |
177
+ | d_ff | 2048 | 3072 | 2816 |
178
+ | num_heads | 12 | 12 | 16 |
179
+ | d_kv | 64 | 64 | 64 |
180
+ | num_layers | 12 | 36 | 24 |
181
+ | num_decoder_layers | 12 | 36 | 24 |
182
+ | feed_forward_proj | gated-silu | gated-silu | gated-silu |
183
+ | dense_act_fn | silu | silu | silu |
184
+ | vocab_size | 32128 | 32128 | 32128 |
185
+ | tie_word_embeddings | 0 | 0 | 0 |
186
+ | torch_dtype | float32 | float32 | float32 |
187
+ | _gin_batch_size | 128 | 64 | 64 |
188
+ | _gin_z_loss | 0.0001 | 0.0001 | 0.0001 |
189
+ | _gin_t5_config_dtype | 'bfloat16' | 'bfloat16' | 'bfloat16' |
190
+
191
+ ## Evaluation results
192
+
193
+ See the evaluation section in the interactive [Pre-training Dutch T5 Models](https://huggingface.co/spaces/yhavinga/pre-training-dutch-t5-models) blog.
194
+
195
+ ## Acknowledgements
196
+
197
+ This project would not have been possible without compute generously provided by Google through the
198
+ [TPU Research Cloud](https://sites.research.google/trc/).
199
+ Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions.
200
+ Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework.
201
+
202
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
203
+
added_tokens.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"[new_id_17]": 32117, "[new_id_20]": 32120, "[new_id_13]": 32113, "[new_id_2]": 32102, "[new_id_16]": 32116, "[new_id_7]": 32107, "[new_id_5]": 32105, "[new_id_1]": 32101, "[new_id_15]": 32115, "[new_id_12]": 32112, "[new_id_0]": 32100, "[new_id_11]": 32111, "[new_id_25]": 32125, "[new_id_24]": 32124, "[new_id_10]": 32110, "[new_id_27]": 32127, "[new_id_23]": 32123, "[new_id_14]": 32114, "[new_id_22]": 32122, "[new_id_21]": 32121, "[new_id_19]": 32119, "[new_id_3]": 32103, "[new_id_4]": 32104, "[new_id_18]": 32118, "[new_id_9]": 32109, "[new_id_8]": 32108, "[new_id_26]": 32126, "[new_id_6]": 32106}
config.gin ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __gin__ import dynamic_registration
2
+ import __main__ as train_script
3
+ import seqio
4
+ import t5.data.mixtures
5
+ from t5x import adafactor
6
+ from t5x.examples.t5 import network
7
+ from t5x import gin_utils
8
+ from t5x import models
9
+ from t5x import partitioning
10
+ from t5x import trainer
11
+ from t5x import utils
12
+ import tasks.nedd_tasks
13
+ import tasks.ul2_tasks as tasks2
14
+
15
+ # Macros:
16
+ # ==============================================================================
17
+ BATCH_SIZE = 64
18
+ DROPOUT_RATE = 0.0
19
+ LABEL_SMOOTHING = 0.0
20
+ LOSS_NORMALIZING_FACTOR = None
21
+ MIXTURE_OR_TASK_MODULE = None
22
+ MIXTURE_OR_TASK_NAME = 'ul2_mc4_nedd_wiki_news_mix_1'
23
+ MODEL = @models.EncoderDecoderModel()
24
+ MODEL_DIR = 'ul2_base_nl36_mc4_nedd_wiki_news_nl'
25
+ OPTIMIZER = @adafactor.Adafactor()
26
+ RANDOM_SEED = None
27
+ SHUFFLE_TRAIN_EXAMPLES = True
28
+ TASK_FEATURE_LENGTHS = {'inputs': 512, 'targets': 512}
29
+ TRAIN_STEPS = 2000000
30
+ USE_CACHED_TASKS = False
31
+ USE_HARDWARE_RNG = False
32
+ VOCABULARY = @seqio.SentencePieceVocabulary()
33
+ Z_LOSS = 0.0001
34
+
35
+ # Parameters for adafactor.Adafactor:
36
+ # ==============================================================================
37
+ adafactor.Adafactor.decay_rate = 0.8
38
+ adafactor.Adafactor.logical_factor_rules = \
39
+ @adafactor.standard_logical_factor_rules()
40
+ adafactor.Adafactor.step_offset = 0
41
+
42
+ # Parameters for utils.CheckpointConfig:
43
+ # ==============================================================================
44
+ utils.CheckpointConfig.restore = @utils.RestoreCheckpointConfig()
45
+ utils.CheckpointConfig.save = @utils.SaveCheckpointConfig()
46
+
47
+ # Parameters for utils.create_learning_rate_scheduler:
48
+ # ==============================================================================
49
+ utils.create_learning_rate_scheduler.base_learning_rate = 1.0
50
+ utils.create_learning_rate_scheduler.factors = 'constant * rsqrt_decay'
51
+ utils.create_learning_rate_scheduler.warmup_steps = 10000
52
+
53
+ # Parameters for train/utils.DatasetConfig:
54
+ # ==============================================================================
55
+ train/utils.DatasetConfig.batch_size = %BATCH_SIZE
56
+ train/utils.DatasetConfig.mixture_or_task_name = %MIXTURE_OR_TASK_NAME
57
+ train/utils.DatasetConfig.module = %MIXTURE_OR_TASK_MODULE
58
+ train/utils.DatasetConfig.pack = True
59
+ train/utils.DatasetConfig.seed = None
60
+ train/utils.DatasetConfig.shuffle = %SHUFFLE_TRAIN_EXAMPLES
61
+ train/utils.DatasetConfig.split = 'train'
62
+ train/utils.DatasetConfig.task_feature_lengths = %TASK_FEATURE_LENGTHS
63
+ train/utils.DatasetConfig.use_cached = %USE_CACHED_TASKS
64
+
65
+ # Parameters for train_eval/utils.DatasetConfig:
66
+ # ==============================================================================
67
+ train_eval/utils.DatasetConfig.batch_size = %BATCH_SIZE
68
+ train_eval/utils.DatasetConfig.mixture_or_task_name = %MIXTURE_OR_TASK_NAME
69
+ train_eval/utils.DatasetConfig.module = %MIXTURE_OR_TASK_MODULE
70
+ train_eval/utils.DatasetConfig.pack = True
71
+ train_eval/utils.DatasetConfig.seed = 42
72
+ train_eval/utils.DatasetConfig.shuffle = False
73
+ train_eval/utils.DatasetConfig.split = 'validation'
74
+ train_eval/utils.DatasetConfig.task_feature_lengths = %TASK_FEATURE_LENGTHS
75
+ train_eval/utils.DatasetConfig.use_cached = %USE_CACHED_TASKS
76
+
77
+ # Parameters for models.EncoderDecoderModel:
78
+ # ==============================================================================
79
+ models.EncoderDecoderModel.input_vocabulary = %VOCABULARY
80
+ models.EncoderDecoderModel.label_smoothing = %LABEL_SMOOTHING
81
+ models.EncoderDecoderModel.loss_normalizing_factor = %LOSS_NORMALIZING_FACTOR
82
+ models.EncoderDecoderModel.module = @network.Transformer()
83
+ models.EncoderDecoderModel.optimizer_def = %OPTIMIZER
84
+ models.EncoderDecoderModel.output_vocabulary = %VOCABULARY
85
+ models.EncoderDecoderModel.z_loss = %Z_LOSS
86
+
87
+ # Parameters for partitioning.PjitPartitioner:
88
+ # ==============================================================================
89
+ partitioning.PjitPartitioner.logical_axis_rules = \
90
+ @partitioning.standard_logical_axis_rules()
91
+ partitioning.PjitPartitioner.model_parallel_submesh = None
92
+ partitioning.PjitPartitioner.num_partitions = 1
93
+
94
+ # Parameters for utils.RestoreCheckpointConfig:
95
+ # ==============================================================================
96
+ utils.RestoreCheckpointConfig.path = []
97
+
98
+ # Parameters for utils.SaveCheckpointConfig:
99
+ # ==============================================================================
100
+ utils.SaveCheckpointConfig.dtype = 'float32'
101
+ utils.SaveCheckpointConfig.keep = 4
102
+ utils.SaveCheckpointConfig.period = 50000
103
+ utils.SaveCheckpointConfig.save_dataset = False
104
+ utils.SaveCheckpointConfig.use_gda = False
105
+
106
+ # Parameters for seqio.SentencePieceVocabulary:
107
+ # ==============================================================================
108
+ seqio.SentencePieceVocabulary.sentencepiece_model_file = \
109
+ 'gs://t5-dutch-english/vocabs/nedd.32000.128extra/spiece.model'
110
+
111
+ # Parameters for network.T5Config:
112
+ # ==============================================================================
113
+ network.T5Config.dropout_rate = %DROPOUT_RATE
114
+ network.T5Config.dtype = 'bfloat16'
115
+ network.T5Config.emb_dim = 768
116
+ network.T5Config.head_dim = 64
117
+ network.T5Config.logits_via_embedding = False
118
+ network.T5Config.mlp_activations = ('gelu', 'linear')
119
+ network.T5Config.mlp_dim = 3072
120
+ network.T5Config.num_decoder_layers = 36
121
+ network.T5Config.num_encoder_layers = 36
122
+ network.T5Config.num_heads = 12
123
+ network.T5Config.vocab_size = 32128
124
+
125
+ # Parameters for train_script.train:
126
+ # ==============================================================================
127
+ train_script.train.checkpoint_cfg = @utils.CheckpointConfig()
128
+ train_script.train.eval_period = 2000
129
+ train_script.train.eval_steps = 20
130
+ train_script.train.infer_eval_dataset_cfg = None
131
+ train_script.train.model = %MODEL
132
+ train_script.train.model_dir = %MODEL_DIR
133
+ train_script.train.partitioner = @partitioning.PjitPartitioner()
134
+ train_script.train.random_seed = %RANDOM_SEED
135
+ train_script.train.stats_period = 100
136
+ train_script.train.summarize_config_fn = @gin_utils.summarize_gin_config
137
+ train_script.train.total_steps = %TRAIN_STEPS
138
+ train_script.train.train_dataset_cfg = @train/utils.DatasetConfig()
139
+ train_script.train.train_eval_dataset_cfg = @train_eval/utils.DatasetConfig()
140
+ train_script.train.trainer_cls = @trainer.Trainer
141
+ train_script.train.use_hardware_rng = %USE_HARDWARE_RNG
142
+
143
+ # Parameters for trainer.Trainer:
144
+ # ==============================================================================
145
+ trainer.Trainer.learning_rate_fn = @utils.create_learning_rate_scheduler()
146
+ trainer.Trainer.num_microbatches = None
147
+
148
+ # Parameters for network.Transformer:
149
+ # ==============================================================================
150
+ network.Transformer.config = @network.T5Config()
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./",
3
+ "architectures": [
4
+ "T5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 3072,
7
+ "d_kv": 64,
8
+ "d_model": 768,
9
+ "decoder_start_token_id": 0,
10
+ "dense_act_fn": "silu",
11
+ "dropout_rate": 0.01,
12
+ "early_stopping": true,
13
+ "eos_token_id": 1,
14
+ "feed_forward_proj": "gated-silu",
15
+ "initializer_factor": 1.0,
16
+ "is_encoder_decoder": true,
17
+ "is_gated_act": true,
18
+ "layer_norm_epsilon": 1e-06,
19
+ "max_length": 370,
20
+ "model_type": "t5",
21
+ "num_beams": 4,
22
+ "num_decoder_layers": 36,
23
+ "num_heads": 12,
24
+ "num_layers": 36,
25
+ "output_past": true,
26
+ "pad_token_id": 0,
27
+ "relative_attention_max_distance": 128,
28
+ "relative_attention_num_buckets": 32,
29
+ "tie_word_embeddings": false,
30
+ "torch_dtype": "float32",
31
+ "transformers_version": "4.24.0",
32
+ "use_cache": true,
33
+ "vocab_size": 32128
34
+ }
events.out.tfevents.1673453219.t1v-n-c82e3785-w-0.4133.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b6e252bece32e07b67707a9eb56c2bd1599dfda1084432a03d0f9f0d746f74b
3
+ size 1941504
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0db5c8d7d9b492a2d2fe68a2197442fe1f12709055f0da7b85d8fec2cb08a34e
3
+ size 1677466902
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:952bd79ab8ec1a7c8fc294ecb8cd05851a4fd570b000166ed5bc36331afe44ce
3
+ size 3255881749
run_s2s_ul2-base-nl36-neddx2-en-nl.sh ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ export CORES=`grep -c ^processor /proc/cpuinfo`
2
+ export CORES=`echo "scale=0; ${CORES} * 0.8 / 1" | bc`
3
+
4
+ #export XLA_PYTHON_CLIENT_PREALLOCATE=false
5
+ export SOURCE_LANG="en"
6
+ export TARGET_LANG="nl"
7
+ export HF_PROJECT="ul2-base-nl36-neddx2-en-nl"
8
+ #
9
+ export DATASET="/home/yeb/data/nedd_x_dataset/nedd_x_dataset.py"
10
+ #export DATASET_CONFIG="dict"
11
+ export DATASET_CONFIG="voc8k_beta_3buf"
12
+ export MODEL_NAME_OR_PATH="yhavinga/ul2-base-nl36-dutch"
13
+ export TOKENIZER_NAME="yhavinga/ul2-base-nl36-dutch"
14
+ export MODEL_PATH="${HOME}/data/${HF_PROJECT}" # Path to the model
15
+ export HF_DATASETS_CACHE=/mnt/ramdisk
16
+
17
+ # 52k 8k 32ksp
18
+ #l 472 500
19
+ #b0 328 352
20
+ #b1 472 480 370
21
+ #b2 1920 1984
22
+
23
+ mkdir -p ${MODEL_PATH}
24
+
25
+ python ../run_s2s_flax_pmap_multiseq.py \
26
+ --output_dir="${MODEL_PATH}" \
27
+ --model_name_or_path ${MODEL_NAME_OR_PATH} \
28
+ --tokenizer_name ${TOKENIZER_NAME} \
29
+ --use_fast_tokenizer="False" \
30
+ --use_auth_token="True" \
31
+ --dataset_name_list ${DATASET}\
32
+ --dataset_config_name_list "${DATASET_CONFIG}"\
33
+ --id_filter_list "<not>-b2-" \
34
+ --max_train_samples_list "0" \
35
+ --max_eval_samples_list "2000" \
36
+ --max_predict_samples_list "128" \
37
+ --preprocessing_num_workers="${CORES}" \
38
+ --source_lang="${SOURCE_LANG}" \
39
+ --target_lang="${TARGET_LANG}" \
40
+ --metric_name="sacrebleu" \
41
+ --do_train --do_eval --do_predict \
42
+ --predict_with_generate \
43
+ --learning_rate="0.0009" \
44
+ --adam_beta1="0.9" \
45
+ --adam_beta2="0.9969" \
46
+ --adam_epsilon="1e-8" \
47
+ --weight_decay="0.001" \
48
+ --label_smoothing_factor="0.11" \
49
+ --length_penalty="1.3" \
50
+ --warmup_steps 500 \
51
+ --dropout_rate="0.01" \
52
+ --dtype "bfloat16" \
53
+ --z_loss "1e-4" \
54
+ --dynamic_loss_scaling="False" \
55
+ --per_device_train_batch_size 4 \
56
+ --per_device_eval_batch_size 4 \
57
+ --gradient_accumulation_steps 16 \
58
+ --overwrite_output_dir \
59
+ --max_source_length_list 370 \
60
+ --max_target_length_list 370 \
61
+ --num_beams 5 \
62
+ --overwrite_output_dir \
63
+ --logging_steps 5 \
64
+ --save_steps 800 \
65
+ --eval_steps 800 \
66
+ --num_train_epochs 2 \
67
+ --max_eval_samples 512 \
68
+ --validation_split_count 2000 \
69
+ --wandb_project="${HF_PROJECT}" \
70
+ --wandb_job_type="pmap"
71
+
72
+ # --resume_from_checkpoint="${MODEL_PATH}" \
73
+ # --max_train_samples="1_064_886" \
74
+ # --max_eval_samples 256 \
75
+ # --max_predict_samples 256 \
special_tokens_map.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "pad_token": "<pad>",
106
+ "unk_token": "<unk>"
107
+ }
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:caa6e2f21aeec181276ab80273e3f869ce303ccb8602d68e0524783c3581092d
3
+ size 800223
spiece.vocab ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "extra_ids": 100,
106
+ "name_or_path": "yhavinga/ul2-base-nl36-dutch",
107
+ "pad_token": "<pad>",
108
+ "sp_model_kwargs": {},
109
+ "special_tokens_map_file": null,
110
+ "tokenizer_class": "T5Tokenizer",
111
+ "unk_token": "<unk>",
112
+ "use_fast_tokenizer": false
113
+ }
training_state.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"step": 691215}