xinyangz commited on
Commit
e91570f
·
1 Parent(s): 22ebe2e

Add model card.

Browse files
Files changed (1) hide show
  1. README.md +151 -1
README.md CHANGED
@@ -1,3 +1,153 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - ja
5
+ - pt
6
+ - es
7
+ - ko
8
+ - ar
9
+ - tr
10
+ - th
11
+ - fr
12
+ - id
13
+ - ru
14
+ - de
15
+ - fa
16
+ - it
17
+ - zh
18
+ - pl
19
+ - hi
20
+ - ur
21
+ - nl
22
+ - el
23
+ - ms
24
+ - ca
25
+ - sr
26
+ - sv
27
+ - uk
28
+ - he
29
+ - fi
30
+ - cs
31
+ - ta
32
+ - ne
33
+ - vi
34
+ - hu
35
+ - eo
36
+ - bn
37
+ - mr
38
+ - ml
39
+ - hr
40
+ - no
41
+ - sw
42
+ - sl
43
+ - te
44
+ - az
45
+ - da
46
+ - ro
47
+ - gl
48
+ - gu
49
+ - ps
50
+ - mk
51
+ - kn
52
+ - bg
53
+ - lv
54
+ - eu
55
+ - pa
56
+ - et
57
+ - mn
58
+ - sq
59
+ - si
60
+ - sd
61
+ - la
62
+ - is
63
+ - jv
64
+ - lt
65
+ - ku
66
+ - am
67
+ - bs
68
+ - hy
69
+ - or
70
+ - sk
71
+ - uz
72
+ - cy
73
+ - my
74
+ - su
75
+ - br
76
+ - as
77
+ - af
78
+ - be
79
+ - fy
80
+ - kk
81
+ - ga
82
+ - lo
83
+ - ka
84
+ - km
85
+ - sa
86
+ - mg
87
+ - so
88
+ - ug
89
+ - ky
90
+ - gd
91
+ - yi
92
+ tags:
93
+ - Twitter
94
+ - Multilingual
95
+ license: "apache-2.0"
96
  ---
97
+
98
+ # TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations
99
+ [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-green.svg?style=flat-square)](http://makeapullrequest.com)
100
+ [![arXiv](https://img.shields.io/badge/arXiv-2203.15827-b31b1b.svg)](https://arxiv.org/abs/2209.07562)
101
+
102
+
103
+ This repo contains models, code and pointers to datasets from our paper: [TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations](https://arxiv.org/abs/2209.07562).
104
+ [[PDF]](https://arxiv.org/pdf/2209.07562.pdf)
105
+ [[HuggingFace Models]](https://huggingface.co/Twitter)
106
+
107
+ ### Overview
108
+ TwHIN-BERT is a new multi-lingual Tweet language model that is trained on 7 billion Tweets from over 100 distinct languages. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision (e.g., MLM), but also with a social objective based on the rich social engagements within a Twitter Heterogeneous Information Network (TwHIN).
109
+
110
+ TwHIN-BERT can be used as a drop-in replacement for BERT in a variety of NLP and recommendation tasks. It not only outperforms similar models semantic understanding tasks such text classification), but also **social recommendation **tasks such as predicting user to Tweet engagement.
111
+
112
+ ## 1. Pretrained Models
113
+
114
+ We initially release two pretrained TwHIN-BERT models (base and large) that are compatible wit the [HuggingFace BERT models](https://github.com/huggingface/transformers).
115
+
116
+
117
+ | Model | Size | Download Link (🤗 HuggingFace) |
118
+ | ------------- | ------------- | --------- |
119
+ | TwHIN-BERT-base | 280M parameters | [Twitter/TwHIN-BERT-base](https://huggingface.co/Twitter/twhin-bert-base) |
120
+ | TwHIN-BERT-large | 550M parameters | [Twitter/TwHIN-BERT-large](https://huggingface.co/Twitter/twhin-bert-large) |
121
+
122
+
123
+ To use these models in 🤗 Transformers:
124
+ ```python
125
+ from transformers import AutoTokenizer, AutoModel
126
+ tokenizer = AutoTokenizer.from_pretrained('Twitter/twhin-bert-large')
127
+ model = AutoModel.from_pretrained('Twitter/twhin-bert-large')
128
+ inputs = tokenizer("I'm using TwHIN-BERT! #TwHIN-BERT #NLP", return_tensors="pt")
129
+ outputs = model(**inputs)
130
+ ```
131
+
132
+
133
+
134
+ <!-- ## 2. Set up environment and data
135
+ ### Environment
136
+ TBD
137
+
138
+
139
+ ## 3. Fine-tune TwHIN-BERT
140
+
141
+ TBD -->
142
+
143
+
144
+ ## Citation
145
+ If you use TwHIN-BERT or out datasets in your work, please cite, please cite the following:
146
+ ```bib
147
+ @article{zhang2022twhin,
148
+ title={TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations},
149
+ author={Zhang, Xinyang and Malkov, Yury and Florez, Omar and Park, Serim and McWilliams, Brian and Han, Jiawei and El-Kishky, Ahmed},
150
+ journal={arXiv preprint arXiv:2209.07562},
151
+ year={2022}
152
+ }
153
+ ```