File size: 2,644 Bytes
a966aee
 
3e7de21
a966aee
 
 
 
2ccadcf
a966aee
 
 
2ccadcf
a966aee
 
4e91f83
e4fd3e0
2ccadcf
 
a966aee
 
 
 
 
 
 
3ac75d1
a966aee
 
 
 
 
 
2ccadcf
 
a966aee
 
 
 
 
e48b116
a966aee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ac75d1
 
 
2ccadcf
 
a966aee
 
 
a06c6d6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: apache-2.0
pipeline_tag: text-classification
---

## **Sentiment Inferencing model for stock related commments**

#### *A project by NUS ISS students Frank Cao, Gerong Zhang, Jiaqi Yao, Sikai Ni, Yunduo Zhang*

<br /> 

### Description

This model is fine tuned with roberta-base model on 3200000 comments from stocktwits, with the user labeled tags 'Bullish' or 'Bearish'

try something that the individual investors may say on the investment forum on the inference API, for example, try 'red' and 'green'.

[code on github](https://github.com/Gitrexx/PLPPM_Sentiment_Analysis_via_Stocktwits/tree/main/SentimentEngine)

<br />

### Training information
- batch size 32
- learning rate 2e-5

|              | Train loss  | Validation loss  | Validation accuracy |
| ----------- | ----------- | ---------------- | ------------------- |
| epoch1      | 0.3495      | 0.2956           | 0.8679              |
| epoch2      | 0.2717      | 0.2235           | 0.9021              |
| epoch3      | 0.2360      | 0.1875           | 0.9210              |
| epoch4      | 0.2106      | 0.1603           | 0.9343              |

<br />

# How to use
```python
from transformers import RobertaForSequenceClassification, RobertaTokenizer
from transformers import pipeline
import pandas as pd
import emoji

# the model was trained upon below preprocessing
def process_text(texts):

  # remove URLs
  texts = re.sub(r'https?://\S+', "", texts)
  texts = re.sub(r'www.\S+', "", texts)
  # remove '
  texts = texts.replace('&#39;', "'")
  # remove symbol names
  texts = re.sub(r'(\#)(\S+)', r'hashtag_\2', texts)
  texts = re.sub(r'(\$)([A-Za-z]+)', r'cashtag_\2', texts)
  # remove usernames
  texts = re.sub(r'(\@)(\S+)', r'mention_\2', texts)
  # demojize
  texts = emoji.demojize(texts, delimiters=("", " "))

  return texts.strip()
  
tokenizer_loaded = RobertaTokenizer.from_pretrained('zhayunduo/roberta-base-stocktwits-finetuned')
model_loaded = RobertaForSequenceClassification.from_pretrained('zhayunduo/roberta-base-stocktwits-finetuned')

nlp = pipeline("text-classification", model=model_loaded, tokenizer=tokenizer_loaded)

sentences = pd.Series(['just buy','just sell it',
                      'entity rocket to the sky!',
                      'go down','even though it is going up, I still think it will not keep this trend in the near future'])
# sentences = list(sentences.apply(process_text))  # if input text contains https, @ or # or $ symbols, better apply preprocess to get a more accurate result
sentences = list(sentences)
results = nlp(sentences)
print(results) # 2 labels, label 0 is bearish, label 1 is bullish

```