julien-c HF staff commited on
Commit
d7616be
·
1 Parent(s): 6a6b627

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/HooshvareLab/bert-base-parsbert-uncased/README.md

Files changed (1) hide show
  1. README.md +124 -0
README.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## ParsBERT: Transformer-based Model for Persian Language Understanding
2
+
3
+ ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as BERT-Base.
4
+
5
+ Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
6
+
7
+ All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
8
+
9
+
10
+ ---
11
+
12
+ ## Introduction
13
+
14
+ This model is pre-trained on a large Persian corpus with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 2M documents. A large subset of this corpus was crawled manually.
15
+
16
+ As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpus into a proper format. This process produces more than 40M true sentences.
17
+
18
+
19
+ ## Evaluation
20
+
21
+ ParsBERT is evaluated on three NLP downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). For this matter and due to insufficient resources, two large datasets for SA and two for text classification were manually composed, which are available for public use and benchmarking. ParsBERT outperformed all other language models, including multilingual BERT and other hybrid deep learning models for all tasks, improving the state-of-the-art performance in Persian language modeling.
22
+
23
+ ## Results
24
+
25
+ The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
26
+
27
+
28
+
29
+ ### Sentiment Analysis (SA) task
30
+
31
+ | Dataset | ParsBERT | mBERT | DeepSentiPers |
32
+ |:--------------------------:|:---------:|:-----:|:-------------:|
33
+ | Digikala User Comments | 81.74* | 80.74 | - |
34
+ | SnappFood User Comments | 88.12* | 87.87 | - |
35
+ | SentiPers (Multi Class) | 71.11* | - | 69.33 |
36
+ | SentiPers (Binary Class) | 92.13* | - | 91.98 |
37
+
38
+
39
+
40
+ ### Text Classification (TC) task
41
+
42
+ | Dataset | ParsBERT | mBERT |
43
+ |:-----------------:|:--------:|:-----:|
44
+ | Digikala Magazine | 93.59* | 90.72 |
45
+ | Persian News | 97.19* | 95.79 |
46
+
47
+
48
+ ### Named Entity Recognition (NER) task
49
+
50
+ | Dataset | ParsBERT | mBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
51
+ |:-------:|:--------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
52
+ | PEYMA | 93.10* | 86.64 | - | 90.59 | - | 84.00 | - |
53
+ | ARMAN | 98.79* | 95.89 | 89.9 | 84.03 | 86.55 | - | 77.45 |
54
+
55
+
56
+ **If you tested ParsBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference**
57
+
58
+ ## How to use
59
+
60
+ ### TensorFlow 2.0
61
+
62
+ ```python
63
+ from transformers import AutoConfig, AutoTokenizer, TFAutoModel
64
+
65
+ config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
66
+ tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
67
+ model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
68
+
69
+ text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد می‌توانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
70
+ tokenizer.tokenize(text)
71
+
72
+ >>> ['ما', 'در', 'هوش', '##واره', 'معتقدیم', 'با', 'انتقال', 'صحیح', 'دانش', 'و', 'اگاهی', '،', 'همه', 'افراد', 'میتوانند', 'از', 'ابزارهای', 'هوشمند', 'استفاده', 'کنند', '.', 'شعار', 'ما', 'هوش', 'مصنوعی', 'برای', 'همه', 'است', '.']
73
+
74
+ ```
75
+
76
+ ### Pytorch
77
+
78
+ ```python
79
+ from transformers import AutoConfig, AutoTokenizer, AutoModel
80
+
81
+ config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
82
+ tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
83
+ model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
84
+ ```
85
+
86
+
87
+ ## NLP Tasks Tutorial
88
+
89
+ Coming soon stay tuned
90
+
91
+
92
+ ## Cite
93
+
94
+ Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
95
+
96
+ ```markdown
97
+ @article{ParsBERT,
98
+ title={ParsBERT: Transformer-based Model for Persian Language Understanding},
99
+ author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
100
+ journal={ArXiv},
101
+ year={2020},
102
+ volume={abs/2005.12515}
103
+ }
104
+ ```
105
+
106
+
107
+ ## Acknowledgments
108
+
109
+ We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
110
+
111
+
112
+ ## Contributors
113
+
114
+ - Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
115
+ - Mohammad Gharachorloo: [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
116
+ - Marzieh Farahani: [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
117
+ - Mohammad Manthouri: [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
118
+ - Hooshvare Team: [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
119
+
120
+
121
+ ## Releases
122
+
123
+ ### Release v0.1 (May 27, 2019)
124
+ This is the first version of our ParsBERT based on BERT<sub>BASE</sub>