CrabInHoney
/

1D-CNN-MC-toxicity-classifier-ru

Text Classification

Model card Files Files and versions Community

1D-CNN-MC-toxicity-classifier-ru / README.md

CrabInHoney's picture

Update README.md

5f1dddf verified 3 months ago

|

history blame contribute delete

3.21 kB

	---
	license: mit
	language:
	- ru
	pipeline_tag: text-classification
	tags:
	- safetensors
	- text-classification
	- tensorflow
	- russian
	library_name: tf-keras
	widget:
	- text: Мне нравится этот фильм!
	output:
	- label: POSITIVE
	score: 0.98
	- label: NEGATIVE
	score: 0.02
	- text: Какой же ты идиот..
	output:
	- label: POSITIVE
	score: 0.01
	- label: NEGATIVE
	score: 0.99
	- text: Паша, купи уже TURMS
	output:
	- label: POSITIVE
	score: 0.82
	- label: NEGATIVE
	score: 0.18
	- text: Дp пошtл ты, идиот
	output:
	- label: POSITIVE
	score: 0.01
	- label: NEGATIVE
	score: 0.99
	---
	#### 1D-CNN-MC-toxicity-classifier-ru
	(One-Dimensional Convolutional Neural Network with Multi-Channel input)

	Architectural visualization:

	![](https://i.imgur.com/skbLM6w.png)

	Total parameters: 503249

	##### Test Accuracy: 94.44%
	##### Training Accuracy: 97.46%

	This model is developed for binary classification of Cyrillic text.

	##### A dataset of 75093 negative rows and 75093 positive rows was used for training.

	##### Recommended length of the input sequence: 25 - 400 Cyrillic characters.

	##### Simplifications of the dataset strings:
	Removing extra spaces.

	Replacing capital letters with small letters. (Я -> я).

	Removing any non-Cyrillic characters, including prefixes. (Remove: z, !, ., #, 4, &... etc)

	Replacing ё with e.

	### Example of use:

	import numpy as np
	from tensorflow import keras
	from tensorflow.keras.preprocessing.text import tokenizer_from_json
	from safetensors.numpy import load_file
	from tensorflow.keras.preprocessing.sequence import pad_sequences
	import os
	import re
	# Название папки, где хранится модель
	model_dir = 'model'
	max_len = 400
	# Загрузка архитектуры модели
	with open(os.path.join(model_dir, 'model_architecture.json'), 'r', encoding='utf-8') as json_file:
	model_json = json_file.read()
	model = keras.models.model_from_json(model_json)
	# Загрузка весов из safetensors
	state_dict = load_file(os.path.join(model_dir, 'tf_model.safetensors'))
	weights = [state_dict[f'weight_{i}'] for i in range(len(state_dict))]
	model.set_weights(weights)
	# Загрузка токенизатора
	with open(os.path.join(model_dir, 'tokenizer.json'), 'r', encoding='utf-8') as f:
	tokenizer_json = f.read()
	tokenizer = tokenizer_from_json(tokenizer_json)
	def predict_toxicity(text):
	sequences = tokenizer.texts_to_sequences([text])
	padded = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')
	probability = model.predict(padded)[0][0]
	class_label = "toxic" if probability >= 0.5 else "normal"
	return class_label, probability
	# Пример использования
	text = "Да какой идиот сделал эту НС?"
	class_label, probability = predict_toxicity(text)
	print(f"Text: {text}")
	print(f"Class: {class_label} ({probability:.2%})")

	###### Output:
	Text: Да какой идиот сделал эту НС?
	Class: toxic (99.35%)