Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/monilouise/ner_pt_br/README.md
README.md
ADDED
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- pt
|
4 |
+
tags:
|
5 |
+
- ner
|
6 |
+
metrics:
|
7 |
+
- f1
|
8 |
+
- accuracy
|
9 |
+
- precision
|
10 |
+
- recall
|
11 |
+
---
|
12 |
+
|
13 |
+
# RiskData Brazilian Portuguese NER
|
14 |
+
|
15 |
+
## Model description
|
16 |
+
|
17 |
+
This is a finetunned version from [Neuralmind BERTimbau] (https://github.com/neuralmind-ai/portuguese-bert/blob/master/README.md) for Portuguese language.
|
18 |
+
|
19 |
+
For more details, please see, (https://github.com/SecexSaudeTCU/noticias_ner).
|
20 |
+
|
21 |
+
## Intended uses & limitations
|
22 |
+
|
23 |
+
#### How to use
|
24 |
+
|
25 |
+
```python
|
26 |
+
from transformers import BertForTokenClassification, DistilBertTokenizerFast, pipeline
|
27 |
+
model = BertForTokenClassification.from_pretrained('monilouise/ner_pt_br')
|
28 |
+
tokenizer = DistilBertTokenizerFast.from_pretrained('neuralmind/bert-base-portuguese-cased'
|
29 |
+
, model_max_length=512
|
30 |
+
, do_lower_case=False
|
31 |
+
)
|
32 |
+
nlp = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)
|
33 |
+
result = nlp("O Tribunal de Contas da União é localizado em Brasília e foi fundado por Rui Barbosa.")
|
34 |
+
```
|
35 |
+
|
36 |
+
#### Limitations and bias
|
37 |
+
|
38 |
+
- The finetunned model was trained on a corpus with around 180 news articles crawled from Google News. The original project's purpose was to recognize named entities in news
|
39 |
+
related to fraud and corruption, classifying these entities in four classes: PERSON, ORGANIZATION, PUBLIC INSITUITION and LOCAL (PESSOA, ORGANIZAÇÃO, INSTITUIÇÃO PÚBLICA and LOCAL).
|
40 |
+
|
41 |
+
## Training data
|
42 |
+
|
43 |
+
The training data can be found at (https://github.com/SecexSaudeTCU/noticias_ner/blob/master/dados/labeled_4_labels.jsonl).
|
44 |
+
|
45 |
+
|
46 |
+
## Training procedure
|
47 |
+
|
48 |
+
|
49 |
+
## Eval results
|
50 |
+
|
51 |
+
accuracy: 0.98,
|
52 |
+
precision: 0.86
|
53 |
+
recall: 0.91
|
54 |
+
f1: 0.88
|
55 |
+
|
56 |
+
|
57 |
+
The score was calculated using this code:
|
58 |
+
|
59 |
+
```python
|
60 |
+
def align_predictions(predictions: np.ndarray, label_ids: np.ndarray) -> Tuple[List[int], List[int]]:
|
61 |
+
preds = np.argmax(predictions, axis=2)
|
62 |
+
batch_size, seq_len = preds.shape
|
63 |
+
out_label_list = [[] for _ in range(batch_size)]
|
64 |
+
preds_list = [[] for _ in range(batch_size)]
|
65 |
+
|
66 |
+
for i in range(batch_size):
|
67 |
+
for j in range(seq_len):
|
68 |
+
if label_ids[i, j] != nn.CrossEntropyLoss().ignore_index:
|
69 |
+
out_label_list[i].append(id2tag[label_ids[i][j]])
|
70 |
+
preds_list[i].append(id2tag[preds[i][j]])
|
71 |
+
|
72 |
+
return preds_list, out_label_list
|
73 |
+
|
74 |
+
def compute_metrics(p: EvalPrediction) -> Dict:
|
75 |
+
preds_list, out_label_list = align_predictions(p.predictions, p.label_ids)
|
76 |
+
return {
|
77 |
+
"accuracy_score": accuracy_score(out_label_list, preds_list),
|
78 |
+
"precision": precision_score(out_label_list, preds_list),
|
79 |
+
"recall": recall_score(out_label_list, preds_list),
|
80 |
+
"f1": f1_score(out_label_list, preds_list),
|
81 |
+
}
|
82 |
+
```
|
83 |
+
|
84 |
+
### BibTeX entry and citation info
|
85 |
+
|
86 |
+
For further information about BERTimbau language model:
|
87 |
+
|
88 |
+
```bibtex
|
89 |
+
@inproceedings{souza2020bertimbau,
|
90 |
+
author = {Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},
|
91 |
+
title = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
|
92 |
+
booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
|
93 |
+
year = {2020}
|
94 |
+
}
|
95 |
+
|
96 |
+
@article{souza2019portuguese,
|
97 |
+
title={Portuguese Named Entity Recognition using BERT-CRF},
|
98 |
+
author={Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},
|
99 |
+
journal={arXiv preprint arXiv:1909.10649},
|
100 |
+
url={http://arxiv.org/abs/1909.10649},
|
101 |
+
year={2019}
|
102 |
+
}
|
103 |
+
```
|