Text Classification
Transformers
PyTorch
Spanish
roberta
Inference Endpoints
File size: 3,961 Bytes
29309ab
 
 
 
 
81b0e90
3f1f1fa
 
4a5746f
29309ab
 
 
 
 
565c1ab
29309ab
 
 
 
3f1f1fa
29309ab
 
3f1f1fa
 
29309ab
3f1f1fa
29309ab
 
3f1f1fa
29309ab
565c1ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
language:
- es
metrics:
- f1
pipeline_tag: text-classification
datasets:
- dariolopez/suicide-comments-es
license: apache-2.0
---


# Model Description

This model is a fine-tuned version of [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) to detect suicidal ideation/behavior in public comments (reddit, forums, twitter, etc.) using the Spanish language.

# How to use

```python
>>> from transformers import pipeline


>>> model_name= 'dariolopez/roberta-base-bne-finetuned-suicide-es'
>>> pipe = pipeline("text-classification", model=model_name)

>>> pipe("Quiero acabar con todo. No merece la pena vivir.")
[{'label': 'Suicide', 'score': 0.9999703168869019}]

>>> pipe("El partido de fútbol fue igualado, disfrutamos mucho jugando juntos.")
[{'label': 'Non-Suicide', 'score': 0.999990701675415}]
```


# Training

## Training data

The dataset consists of comments on Reddit, Twitter, and inputs/outputs of the Alpaca dataset translated to Spanish language and classified as suicidal ideation/behavior and non-suicidal.

The dataset has 10050 rows (777 considered as Suicidal Ideation/Behavior and 9273 considered Non-Suicidal).

More info: https://huggingface.co/datasets/dariolopez/suicide-comments-es

## Training procedure

The training data has been tokenized using the `PlanTL-GOB-ES/roberta-base-bne` tokenizer with a vocabulary size of 50262 tokens and a model maximum length of 512 tokens.

The training lasted a total of 10 minutes using a NVIDIA GPU GeForce RTX 3090.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:68:00.0 Off |                  N/A |
| 31%   50C    P8    25W / 250W |      1MiB / 24265MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


# Considerations for Using the Model

The model is designed for use in Spanish language, specifically to detect suicidal ideation/behavior.

## Intended uses & limitations

In progress.

## Limitations and bias

In progress.


# Evaluation


## Metric

F1 = 2 * (precision * recall) / (precision + recall)

## 5 K fold

We use [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) with `n_splits=5` to evaluate the model.

Results:

```
>>> best_f1_model_by_fold = [0.9163879598662207, 0.9380530973451328, 0.9333333333333333, 0.8943661971830986, 0.9226190476190477]
>>> best_f1_model_by_fold.mean()
0.9209519270693666
```


# Additional Information

## Team

* [dariolopez](https://huggingface.co/dariolopez)
* [diegogd](https://huggingface.co/diegogd)

## Licesing

This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)