File size: 2,247 Bytes
45025d1
d754ae4
 
45025d1
 
e335e15
f2168f5
e335e15
 
 
4194ec1
e335e15
 
 
 
 
 
8b1fdb1
 
e335e15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
language: ro
inference: false
license: apache-2.0
---

This is a pretrained-from-scratch **T5v1.1 base** model (**247M** parameters) on the [t5x](https://github.com/google-research/t5x) platform.

Training was performed on a clean 80GB Romanian text corpus for 4M steps with these [scripts](https://github.com/dumitrescustefan/t5x_models). The model was trained with an encoder sequence length of 512 and a decoder sequence length of 256. 

**!! IMPORTANT !!** This model was pretrained on the span corruption MLM task, meaning this model is **not usable** in any downstream task **without finetuning** first!

### How to load a t5x model

```python
from transformers import T5Tokenizer, T5Model

tokenizer = T5Tokenizer.from_pretrained('dumitrescustefan/t5-v1_1-base-romanian')
model = T5Model.from_pretrained('dumitrescustefan/t5-v1_1-base-romanian')

input_ids = tokenizer("Acesta este un test", return_tensors="pt").input_ids  # Batch size 1
decoder_input_ids = tokenizer("Acesta este", return_tensors="pt").input_ids  # Batch size 1

# preprocess: Prepend decoder_input_ids with start token which is pad token for T5Model.
# This is not needed for torch's T5ForConditionalGeneration as it does this internally using labels arg.
decoder_input_ids = model._shift_right(decoder_input_ids)

# forward pass
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
last_hidden_states = outputs.last_hidden_state

print(last_hidden_states.shape)  # this will print [1, 3, 768]
```

Remember to always sanitize your text! Replace ``ş`` and ``ţ`` cedilla-letters to comma-letters with :

```python
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
```

because the model was **not** trained on cedilla ``ş`` and ``ţ``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word.

### Acknowledgements

We'd like to thank [TPU Research Cloud](https://sites.research.google/trc/about/) for providing the TPUv4 cores we used to train these models!

### Authors

Yours truly,  

_[Stefan Dumitrescu](https://github.com/dumitrescustefan), [Mihai Ilie](https://github.com/iliemihai) and [Per Egil Kummervold](https://huggingface.co/north)_