bond005 commited on
Commit
ae58ac5
·
1 Parent(s): 3f25fae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md CHANGED
@@ -1,3 +1,73 @@
1
  ---
 
 
 
 
 
 
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: ru
3
+ tags:
4
+ - russian
5
+ - text-to-text
6
+ - PyTorch
7
+ - Transformers
8
  license: apache-2.0
9
+ widget:
10
+ - text: <LM>Водка ""Русская валюта"" премиум люкс 38% 0,25л, Россия
11
+ pipeline_tag: text2text-generation
12
  ---
13
+
14
+ This is a named entity recognizer for goods and brands extraction from receipts of fiscal data operators in Russian.
15
+
16
+ It was developed for the special multi-staged competition devoted to receipt structurization. This competition was organized by [Open Data Science community](https://ods.ai) and [Alpha Bank](https://alfabank.ru), and it was consisted of [the first](https://ods.ai/competitions/nlp-receipts), [the second](https://ods.ai/competitions/alfabank-nlp-receipts-2) and [the final](https://ods.ai/competitions/alfabank-nlp-receipts-final) stage. But this model can be used for any receipt parsing and structurization in Russian. The repository with code for fine-tuning and inference is available on [gitflic.ru](https://gitflic.ru/project/bond005/ods-ner-2023).
17
+
18
+ Example of using:
19
+
20
+ ```
21
+ from typing import Tuple
22
+ import torch
23
+ from transformers import T5ForConditionalGeneration, GPT2Tokenizer
24
+
25
+
26
+ MODEL_NAME = 'bond005/FRED-T5-large-ods-ner-2023'
27
+ START_TAG = '<LM>'
28
+ END_TAG = '</s>'
29
+
30
+
31
+ def initialize_recognizer(model_path: str) -> Tuple[GPT2Tokenizer, T5ForConditionalGeneration]:
32
+ model = T5ForConditionalGeneration.from_pretrained(model_path)
33
+ if not torch.cuda.is_available():
34
+ raise ValueError('CUDA is not available!')
35
+ model = model.cuda()
36
+ model.eval()
37
+ tokenizer = GPT2Tokenizer.from_pretrained(model_path)
38
+ return tokenizer, model
39
+
40
+
41
+ def recognize(text: str, tokenizer: GPT2Tokenizer, model: T5ForConditionalGeneration) -> Tuple[str, str]:
42
+ if text.startswith(START_TAG):
43
+ x = tokenizer(text, return_tensors='pt', padding=True).to(model.device)
44
+ else:
45
+ x = tokenizer(START_TAG + text, return_tensors='pt', padding=True).to(model.device)
46
+ out = model.generate(**x)
47
+ predictions = tokenizer.decode(out[0], skip_special_tokens=True).strip()
48
+ while predictions.endswith(END_TAG):
49
+ predictions = predictions[:-len(END_TAG)].strip()
50
+ prediction_pair = predictions.split(';')
51
+ if len(prediction_pair) == 0:
52
+ goods = ''
53
+ brands = ''
54
+ elif len(prediction_pair) == 1:
55
+ goods = prediction_pair[0].strip()
56
+ brands = ''
57
+ else:
58
+ goods = prediction_pair[0].strip()
59
+ brands = prediction_pair[1].strip()
60
+ return goods, brands
61
+
62
+
63
+ recognizer = initialize_recognizer(MODEL_NAME)
64
+
65
+ goods_and_brands = recognize(text='Водка "Русская валюта" премиум люкс 38% 0,25л, Россия',
66
+ tokenizer=recognizer[0], model=recognizer[1])
67
+
68
+ print(f'GOODS: {goods_and_brands[0]}')
69
+ # водка
70
+
71
+ print(f'BRANDS: {goods_and_brands[1]}')
72
+ # русская валюта
73
+ ```