yeshpanovrustem
commited on
Commit
·
12b7fef
1
Parent(s):
def38cf
Update README.md
Browse files
README.md
CHANGED
@@ -24,15 +24,13 @@ widget:
|
|
24 |
## KazNERD (cleaned)
|
25 |
While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed.
|
26 |
As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed.
|
27 |
-
|
28 |
-
**Statistics for training (Train), validation (Valid), and test (Test) sets**
|
29 |
| Unit | Train | Valid | Test | Total |
|
30 |
| :---: | :---: | :---: | :---: | :---: |
|
31 |
| Sentence | 88,540 (80.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) |
|
32 |
| Token | 1,088,461 (80.04%) | 136,021 (10.00%) | 135,426 (9.96%) | 1,359,908 (100%) |
|
33 |
| NE | 106,148 (80.17%) | 13,189 (9.96%) | 13,072 (9.87%) | 132,409 (100%) |
|
34 |
-
|
35 |
-
**80 / 10 / 10 split**
|
36 |
|Representation| Train | Valid | Test | Total |
|
37 |
| :---: | :---: | :---: | :---: | :---: |
|
38 |
| **AID** | 67,582 (79.99%) | 8,439 (9.99%) | 8,467 (10.02%)| 84,488 (100%) |
|
@@ -42,8 +40,7 @@ As a result, the number of sentences, tokens, and named entities (NEs) in the cl
|
|
42 |
| **EID** | 260 (81.00%) | 27 (8.41%) | 34 (10.59%)| 321 (100%) |
|
43 |
| **FID** | 9 (75.00%) | 1 (8.33%)| 2 (16.67%)| 12 (100%) |
|
44 |
|**Total**| **88,540 (80.00%)** | **11,067 (10.00%)** | **11,068 (10.00%)** | **110,675 (100%)** |
|
45 |
-
|
46 |
-
**Distribution of representations across sets**
|
47 |
|Representation| Train | Valid | Test | Total |
|
48 |
| :---: | :---: | :---: | :---: | :---: |
|
49 |
| **AID** | 67,582 (76.33%) | 8,439 (76.25%) | 8,467 (76.50%)| 84,488 (76.34%) |
|
|
|
24 |
## KazNERD (cleaned)
|
25 |
While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed.
|
26 |
As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed.
|
27 |
+
### Statistics for training (Train), validation (Valid), and test (Test) sets
|
|
|
28 |
| Unit | Train | Valid | Test | Total |
|
29 |
| :---: | :---: | :---: | :---: | :---: |
|
30 |
| Sentence | 88,540 (80.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) |
|
31 |
| Token | 1,088,461 (80.04%) | 136,021 (10.00%) | 135,426 (9.96%) | 1,359,908 (100%) |
|
32 |
| NE | 106,148 (80.17%) | 13,189 (9.96%) | 13,072 (9.87%) | 132,409 (100%) |
|
33 |
+
### 80 / 10 / 10 split
|
|
|
34 |
|Representation| Train | Valid | Test | Total |
|
35 |
| :---: | :---: | :---: | :---: | :---: |
|
36 |
| **AID** | 67,582 (79.99%) | 8,439 (9.99%) | 8,467 (10.02%)| 84,488 (100%) |
|
|
|
40 |
| **EID** | 260 (81.00%) | 27 (8.41%) | 34 (10.59%)| 321 (100%) |
|
41 |
| **FID** | 9 (75.00%) | 1 (8.33%)| 2 (16.67%)| 12 (100%) |
|
42 |
|**Total**| **88,540 (80.00%)** | **11,067 (10.00%)** | **11,068 (10.00%)** | **110,675 (100%)** |
|
43 |
+
### Distribution of representations across sets
|
|
|
44 |
|Representation| Train | Valid | Test | Total |
|
45 |
| :---: | :---: | :---: | :---: | :---: |
|
46 |
| **AID** | 67,582 (76.33%) | 8,439 (76.25%) | 8,467 (76.50%)| 84,488 (76.34%) |
|