File size: 6,052 Bytes
6f1de7c
 
 
 
 
 
 
 
 
 
 
 
 
 
74e3df1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6f1de7c
74e3df1
 
 
 
 
 
 
 
 
 
 
918acc6
74e3df1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: mit
language:
- sk
pipeline_tag: token-classification
library_name: transformers
metrics:
- f1
base_model: daviddrzik/SK_BPE_BLM
tags:
- pos-tagging
datasets:
- universal-dependencies/universal_dependencies
---

# Fine-Tuned POS Tagging Model - SK_BPE_BLM (POS Tags)

## Model Overview
This model is a fine-tuned version of the [SK_BPE_BLM model](https://huggingface.co/daviddrzik/SK_BPE_BLM) for tokenization and POS tagging. For this task, we used the [UD Slovak SNK dataset](https://github.com/UniversalDependencies/UD_Slovak-SNK), which is part of the Universal Dependencies project. This dataset contains annotated Slovak texts with various linguistic information, including UPOS tags, morphological features, syntactic relations, and lemmatization. We focused on UPOS tags, which provide basic categories of parts of speech.

## POS Tags
Each token in the dataset is annotated with one of the following POS tags:
- **NOUN (0):** Nouns
- **PUNCT (1):** Punctuation marks
- **VERB (2):** Verbs
- **ADJ (3):** Adjectives
- **ADP (4):** Adpositions (Prepositions)
- **PRON (5):** Pronouns
- **PROPN (6):** Proper nouns
- **ADV (7):** Adverbs
- **DET (8):** Determiners
- **AUX (9):** Auxiliary verbs
- **CCONJ (10):** Coordinating conjunctions
- **PART (11):** Particles
- **SCONJ (12):** Subordinating conjunctions
- **NUM (13):** Numerals

Unused tags:
- **X**
- **INTJ**
- **SYM**

## Dataset Details
The UD Slovak SNK dataset contains annotated Slovak texts that we adapted for this task, fine-tuning the model for POS tagging. The dataset provides UPOS tags for each token, which allowed us to refine our model for accurate recognition and categorization of parts of speech in the Slovak language. The total number of sequences in the data set we used is **9,847**.

## Fine-Tuning Hyperparameters

The following hyperparameters were used during the fine-tuning process:

- **Learning Rate:** 3e-05
- **Training Batch Size:** 64
- **Evaluation Batch Size:** 64
- **Seed:** 42
- **Optimizer:** Adam (default)
- **Number of Epochs:** 10

## Model Performance

The model was evaluated using stratified 10-fold cross-validation, achieving a weighted F1-score with a median value of <span style="font-size: 24px;">**0.979**</span>.

## Model Usage

This model is suitable for tokenization and POS tagging of Slovak text. It is specifically designed for applications requiring accurate categorization of parts of speech in various texts.

### Example Usage

Below is an example of how to use the fine-tuned `SK_BPE_BLM-pos ` model in a Python script:

```python
import torch
from transformers import RobertaForTokenClassification, RobertaTokenizerFast
from huggingface_hub import hf_hub_download
import json

class TokenClassifier:
    def __init__(self, model, tokenizer):
        self.model = RobertaForTokenClassification.from_pretrained(model, num_labels=14)
        self.tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer, max_length=256)
        byte_utf8_mapping_path = hf_hub_download(repo_id=tokenizer, filename="byte_utf8_mapping.json")
        with open(byte_utf8_mapping_path, "r", encoding="utf-8") as f:
            self.byte_utf8_mapping = json.load(f)
            
    def decode(self, tokens):
        decoded_tokens = []
        for token in tokens:
            for k, v in self.byte_utf8_mapping.items():
                if k in token:
                    token = token.replace(k, v)
                token = token.replace("Ġ"," ")
            decoded_tokens.append(token)
        return decoded_tokens

    def tokenize_text(self, text):
        encoded_text = self.tokenizer(text.lower(), max_length=256, padding='max_length', truncation=True, return_tensors='pt')
        return encoded_text

    def classify_tokens(self, text):
        encoded_text = self.tokenize_text(text)
        tokens = self.tokenizer.convert_ids_to_tokens(encoded_text['input_ids'].squeeze().tolist())

        with torch.no_grad():
            output = self.model(**encoded_text)
            logits = output.logits
            predictions = torch.argmax(logits, dim=-1)

            active_loss = encoded_text['attention_mask'].view(-1) == 1
            active_logits = logits.view(-1, self.model.config.num_labels)[active_loss]
            active_predictions = predictions.view(-1)[active_loss]

            probabilities = torch.softmax(active_logits, dim=-1)

            results = []
            for token, pred, prob in zip(self.decode(tokens), active_predictions.tolist(), probabilities.tolist()):
                if token not in ['<s>', '</s>', '<pad>']:
                    result = f"Token: {token: <10}  POS tag: ({self.model.config.id2label[pred]} = {max(prob):.4f})"
                    results.append(result)

        return results

# Instantiate the POS token classifier with the specified tokenizer and model
classifier = TokenClassifier(tokenizer="daviddrzik/SK_BPE_BLM", model="daviddrzik/SK_BPE_BLM-pos")

# Tokenize the input text
text_to_classify = "Od učenia ešte nikto nezomrel, ale načo riskovať."

# Classify the tokens of the tokenized text
classification_results = classifier.classify_tokens(text_to_classify)
print(f"============= POS Token Classification =============")
print("Text to classify:", text_to_classify)
for classification_result in classification_results:
    print(classification_result)
```

Example Output
Here is the output when running the above example:
```yaml
============= POS Token Classification =============
Text to classify: Od učenia ešte nikto nezomrel, ale načo riskovať.
Token: od          POS tag: (ADP = 0.9984)
Token:  učenia     POS tag: (NOUN = 0.9952)
Token:  ešte       POS tag: (PART = 0.9720)
Token:  nikto      POS tag: (PRON = 0.9947)
Token:  nezom      POS tag: (VERB = 0.9973)
Token: rel         POS tag: (VERB = 0.9950)
Token: ,           POS tag: (PUNCT = 0.9992)
Token:  ale        POS tag: (CCONJ = 0.9981)
Token:  načo       POS tag: (ADV = 0.9804)
Token:  riskovať   POS tag: (VERB = 0.9948)
Token: .           POS tag: (PUNCT = 0.9994)
```