piotr-rybak commited on
Commit
03fea7b
·
1 Parent(s): c8a2092

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -11
README.md CHANGED
@@ -2,21 +2,34 @@
2
  language: pl
3
  tags:
4
  - herbert
5
- license: cc-by-sa-4.0
6
  ---
 
7
  # HerBERT
8
- **[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora
9
- using MLM and SSO objectives with dynamic masking of whole words.
 
10
  Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.
11
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ## Tokenizer
13
- The training dataset was tokenized into subwords using ``CharBPETokenizer`` a character level byte-pair encoding with
14
  a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library.
15
- We kindly encourage you to use the **Fast** version of tokenizer, namely ``HerbertTokenizerFast``.
16
-
17
- ## HerBERT usage
18
 
 
19
 
 
20
  Example code:
21
  ```python
22
  from transformers import AutoTokenizer, AutoModel
@@ -39,12 +52,29 @@ output = model(
39
  )
40
  ```
41
 
42
-
43
  ## License
44
- CC BY-SA 4.0
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  ## Authors
48
- Model was trained by **Machine Learning Research Team at Allegro** and **Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences**.
49
 
50
- You can contact us at: <a href="mailto:[email protected]">[email protected]</a>
 
2
  language: pl
3
  tags:
4
  - herbert
5
+ license: cc-by-4.0
6
  ---
7
+
8
  # HerBERT
9
+ **[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish corpora
10
+ using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. For more details, please refer to: [HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish](https://www.aclweb.org/anthology/2021.bsnlp-1.1/).
11
+
12
  Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.
13
 
14
+ ## Corpus
15
+ HerBERT was trained on six different corpora available for Polish language:
16
+
17
+ | Corpus | Tokens | Documents |
18
+ | :------ | ------: | ------: |
19
+ | [CCNet Middle](https://github.com/facebookresearch/cc_net) | 3243M | 7.9M |
20
+ | [CCNet Head](https://github.com/facebookresearch/cc_net) | 2641M | 7.0M |
21
+ | [National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=1)| 1357M | 3.9M |
22
+ | [Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 1056M | 1.1M
23
+ | [Wikipedia](https://dumps.wikimedia.org/) | 260M | 1.4M |
24
+ | [Wolne Lektury](https://wolnelektury.pl/) | 41M | 5.5k |
25
+
26
  ## Tokenizer
27
+ The training dataset was tokenized into subwords using a character level byte-pair encoding (``CharBPETokenizer``) with
28
  a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library.
 
 
 
29
 
30
+ We kindly encourage you to use the ``Fast`` version of the tokenizer, namely ``HerbertTokenizerFast``.
31
 
32
+ ## Usage
33
  Example code:
34
  ```python
35
  from transformers import AutoTokenizer, AutoModel
 
52
  )
53
  ```
54
 
 
55
  ## License
56
+ CC BY 4.0
57
 
58
+ ## Citation
59
+ If you use this model, please cite the following paper:
60
+ ```
61
+ @inproceedings{mroczkowski-etal-2021-herbert,
62
+ title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
63
+ author = "Mroczkowski, Robert and
64
+ Rybak, Piotr and
65
+ Wr{\'o}blewska, Alina and
66
+ Gawlik, Ireneusz",
67
+ booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
68
+ month = apr,
69
+ year = "2021",
70
+ address = "Kiyv, Ukraine",
71
+ publisher = "Association for Computational Linguistics",
72
+ url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
73
+ pages = "1--10",
74
+ }
75
+ ```
76
 
77
  ## Authors
78
+ The model was trained by **Machine Learning Research Team at Allegro** and [**Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences**](http://zil.ipipan.waw.pl/).
79
 
80
+ You can contact us at: <a href="mailto:[email protected]">[email protected]</a>