|
--- |
|
license: mit |
|
language: |
|
- ru |
|
pipeline_tag: text-classification |
|
tags: |
|
- safetensors |
|
- text-classification |
|
- tensorflow |
|
- russian |
|
library_name: tf-keras |
|
widget: |
|
- text: Мне нравится этот фильм! |
|
output: |
|
- label: POSITIVE |
|
score: 0.98 |
|
- label: NEGATIVE |
|
score: 0.02 |
|
- text: Какой же ты идиот.. |
|
output: |
|
- label: POSITIVE |
|
score: 0.01 |
|
- label: NEGATIVE |
|
score: 0.99 |
|
- text: Паша, купи уже TURMS |
|
output: |
|
- label: POSITIVE |
|
score: 0.82 |
|
- label: NEGATIVE |
|
score: 0.18 |
|
- text: Дp пошtл ты, идиот |
|
output: |
|
- label: POSITIVE |
|
score: 0.01 |
|
- label: NEGATIVE |
|
score: 0.99 |
|
--- |
|
#### 1D-CNN-MC-toxicity-classifier-ru |
|
(One-Dimensional Convolutional Neural Network with Multi-Channel input) |
|
|
|
Architectural visualization: |
|
|
|
![](https://i.imgur.com/skbLM6w.png) |
|
|
|
Total parameters: 503249 |
|
|
|
##### Test Accuracy: 94.44% |
|
##### Training Accuracy: 97.46% |
|
|
|
This model is developed for binary classification of Cyrillic text. |
|
|
|
##### A dataset of 75093 negative rows and 75093 positive rows was used for training. |
|
|
|
##### Recommended length of the input sequence: 25 - 400 Cyrillic characters. |
|
|
|
##### Simplifications of the dataset strings: |
|
Removing extra spaces. |
|
|
|
Replacing capital letters with small letters. (Я -> я). |
|
|
|
Removing any non-Cyrillic characters, including prefixes. (Remove: z, !, ., #, 4, &... etc) |
|
|
|
Replacing ё with e. |
|
|
|
### Example of use: |
|
|
|
import numpy as np |
|
from tensorflow import keras |
|
from tensorflow.keras.preprocessing.text import tokenizer_from_json |
|
from safetensors.numpy import load_file |
|
from tensorflow.keras.preprocessing.sequence import pad_sequences |
|
import os |
|
import re |
|
# Название папки, где хранится модель |
|
model_dir = 'model' |
|
max_len = 400 |
|
# Загрузка архитектуры модели |
|
with open(os.path.join(model_dir, 'model_architecture.json'), 'r', encoding='utf-8') as json_file: |
|
model_json = json_file.read() |
|
model = keras.models.model_from_json(model_json) |
|
# Загрузка весов из safetensors |
|
state_dict = load_file(os.path.join(model_dir, 'tf_model.safetensors')) |
|
weights = [state_dict[f'weight_{i}'] for i in range(len(state_dict))] |
|
model.set_weights(weights) |
|
# Загрузка токенизатора |
|
with open(os.path.join(model_dir, 'tokenizer.json'), 'r', encoding='utf-8') as f: |
|
tokenizer_json = f.read() |
|
tokenizer = tokenizer_from_json(tokenizer_json) |
|
def predict_toxicity(text): |
|
sequences = tokenizer.texts_to_sequences([text]) |
|
padded = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post') |
|
probability = model.predict(padded)[0][0] |
|
class_label = "toxic" if probability >= 0.5 else "normal" |
|
return class_label, probability |
|
# Пример использования |
|
text = "Да какой идиот сделал эту НС?" |
|
class_label, probability = predict_toxicity(text) |
|
print(f"Text: {text}") |
|
print(f"Class: {class_label} ({probability:.2%})") |
|
|
|
###### Output: |
|
Text: Да какой идиот сделал эту НС? |
|
Class: toxic (99.35%) |