update files and readme
Browse files- README.md +31 -23
- config.json +1 -1
- modeling_lsg_camembert.py +9 -16
README.md
CHANGED
@@ -2,10 +2,10 @@
|
|
2 |
language: fr
|
3 |
tags:
|
4 |
- long context
|
|
|
5 |
---
|
6 |
|
7 |
# LSG model
|
8 |
-
|
9 |
**Transformers >= 4.18.0**\
|
10 |
**This model relies on a custom modeling file, you need to add trust_remote_code=True**\
|
11 |
**See [\#13467](https://github.com/huggingface/transformers/pull/13467)**
|
@@ -16,16 +16,14 @@ tags:
|
|
16 |
* [Tasks](#tasks)
|
17 |
* [Training global tokens](#training-global-tokens)
|
18 |
|
19 |
-
This model
|
20 |
-
|
21 |
|
22 |
-
The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). \
|
23 |
|
|
|
24 |
|
25 |
-
The model
|
26 |
-
|
27 |
|
28 |
-
Support encoder-decoder
|
29 |
Implemented in PyTorch.
|
30 |
|
31 |
![attn](attn.png)
|
@@ -36,8 +34,8 @@ The model relies on a custom modeling file, you need to add trust_remote_code=Tr
|
|
36 |
```python:
|
37 |
from transformers import AutoModel, AutoTokenizer
|
38 |
|
39 |
-
model = AutoModel.from_pretrained("ccdv/lsg-base-4096
|
40 |
-
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096
|
41 |
```
|
42 |
|
43 |
## Parameters
|
@@ -54,7 +52,7 @@ Default parameters work well in practice. If you are short on memory, reduce blo
|
|
54 |
```python:
|
55 |
from transformers import AutoModel
|
56 |
|
57 |
-
model = AutoModel.from_pretrained("ccdv/lsg-base-4096
|
58 |
trust_remote_code=True,
|
59 |
num_global_tokens=16,
|
60 |
block_size=64,
|
@@ -66,7 +64,6 @@ model = AutoModel.from_pretrained("ccdv/lsg-base-4096-fr",
|
|
66 |
)
|
67 |
```
|
68 |
|
69 |
-
|
70 |
## Sparse selection type
|
71 |
|
72 |
There are 5 different sparse selection patterns. The best type is task dependent. \
|
@@ -92,21 +89,19 @@ Note that for sequences with length < 2*block_size, the type has no effect.
|
|
92 |
* Each head will use block of tokens strided by sparsify_factor
|
93 |
* Not recommended if sparsify_factor > num_heads
|
94 |
|
95 |
-
|
96 |
## Tasks
|
97 |
Fill mask example:
|
98 |
```python:
|
99 |
from transformers import FillMaskPipeline, AutoModelForMaskedLM, AutoTokenizer
|
100 |
|
101 |
-
model = AutoModelForMaskedLM.from_pretrained("ccdv/lsg-base-4096
|
102 |
-
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096
|
103 |
|
104 |
-
SENTENCES =
|
105 |
pipeline = FillMaskPipeline(model, tokenizer)
|
106 |
-
output = pipeline(SENTENCES
|
107 |
-
|
108 |
-
|
109 |
-
> ['Paris est la capitale de la france.', 'Le sens de la vie est simple.']
|
110 |
```
|
111 |
|
112 |
|
@@ -114,11 +109,11 @@ Classification example:
|
|
114 |
```python:
|
115 |
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
116 |
|
117 |
-
model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-base-4096
|
118 |
trust_remote_code=True,
|
119 |
pool_with_global=True, # pool with a global token instead of first token
|
120 |
)
|
121 |
-
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096
|
122 |
|
123 |
SENTENCE = "This is a test for sequence classification. " * 300
|
124 |
token_ids = tokenizer(
|
@@ -137,16 +132,29 @@ To train global tokens and the classification head only:
|
|
137 |
```python:
|
138 |
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
139 |
|
140 |
-
model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-base-4096
|
141 |
trust_remote_code=True,
|
142 |
pool_with_global=True, # pool with a global token instead of first token
|
143 |
num_global_tokens=16
|
144 |
)
|
145 |
-
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096
|
146 |
|
147 |
for name, param in model.named_parameters():
|
148 |
if "global_embeddings" not in name:
|
149 |
param.requires_grad = False
|
150 |
else:
|
151 |
param.required_grad = True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
152 |
```
|
|
|
2 |
language: fr
|
3 |
tags:
|
4 |
- long context
|
5 |
+
pipeline_tag: fill-mask
|
6 |
---
|
7 |
|
8 |
# LSG model
|
|
|
9 |
**Transformers >= 4.18.0**\
|
10 |
**This model relies on a custom modeling file, you need to add trust_remote_code=True**\
|
11 |
**See [\#13467](https://github.com/huggingface/transformers/pull/13467)**
|
|
|
16 |
* [Tasks](#tasks)
|
17 |
* [Training global tokens](#training-global-tokens)
|
18 |
|
19 |
+
This model is adapted from [CamemBERT-base](https://huggingface.co/camembert-base) without additional pretraining yet. It uses the same number of parameters/layers and the same tokenizer.
|
|
|
20 |
|
|
|
21 |
|
22 |
+
This model can handle long sequences but faster and more efficiently than Longformer or BigBird (from Transformers) and relies on Local + Sparse + Global attention (LSG).
|
23 |
|
24 |
+
The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). \
|
|
|
25 |
|
26 |
+
Support encoder-decoder but I didnt test it extensively.\
|
27 |
Implemented in PyTorch.
|
28 |
|
29 |
![attn](attn.png)
|
|
|
34 |
```python:
|
35 |
from transformers import AutoModel, AutoTokenizer
|
36 |
|
37 |
+
model = AutoModel.from_pretrained("ccdv/lsg-camembert-base-4096", trust_remote_code=True)
|
38 |
+
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
|
39 |
```
|
40 |
|
41 |
## Parameters
|
|
|
52 |
```python:
|
53 |
from transformers import AutoModel
|
54 |
|
55 |
+
model = AutoModel.from_pretrained("ccdv/lsg-camembert-base-4096",
|
56 |
trust_remote_code=True,
|
57 |
num_global_tokens=16,
|
58 |
block_size=64,
|
|
|
64 |
)
|
65 |
```
|
66 |
|
|
|
67 |
## Sparse selection type
|
68 |
|
69 |
There are 5 different sparse selection patterns. The best type is task dependent. \
|
|
|
89 |
* Each head will use block of tokens strided by sparsify_factor
|
90 |
* Not recommended if sparsify_factor > num_heads
|
91 |
|
|
|
92 |
## Tasks
|
93 |
Fill mask example:
|
94 |
```python:
|
95 |
from transformers import FillMaskPipeline, AutoModelForMaskedLM, AutoTokenizer
|
96 |
|
97 |
+
model = AutoModelForMaskedLM.from_pretrained("ccdv/lsg-camembert-base-4096", trust_remote_code=True)
|
98 |
+
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
|
99 |
|
100 |
+
SENTENCES = "Paris est la <mask> de la France."
|
101 |
pipeline = FillMaskPipeline(model, tokenizer)
|
102 |
+
output = pipeline(SENTENCES)
|
103 |
+
|
104 |
+
> 'Paris est la capitale de la France.'
|
|
|
105 |
```
|
106 |
|
107 |
|
|
|
109 |
```python:
|
110 |
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
111 |
|
112 |
+
model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-camembert-base-4096",
|
113 |
trust_remote_code=True,
|
114 |
pool_with_global=True, # pool with a global token instead of first token
|
115 |
)
|
116 |
+
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
|
117 |
|
118 |
SENTENCE = "This is a test for sequence classification. " * 300
|
119 |
token_ids = tokenizer(
|
|
|
132 |
```python:
|
133 |
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
134 |
|
135 |
+
model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-camembert-base-4096",
|
136 |
trust_remote_code=True,
|
137 |
pool_with_global=True, # pool with a global token instead of first token
|
138 |
num_global_tokens=16
|
139 |
)
|
140 |
+
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
|
141 |
|
142 |
for name, param in model.named_parameters():
|
143 |
if "global_embeddings" not in name:
|
144 |
param.requires_grad = False
|
145 |
else:
|
146 |
param.required_grad = True
|
147 |
+
```
|
148 |
+
|
149 |
+
**CamemBERT**
|
150 |
+
```
|
151 |
+
@inproceedings{Martin_2020,
|
152 |
+
doi = {10.18653/v1/2020.acl-main.645},
|
153 |
+
url = {https://doi.org/10.18653%2Fv1%2F2020.acl-main.645},
|
154 |
+
year = 2020,
|
155 |
+
publisher = {Association for Computational Linguistics},
|
156 |
+
author = {Louis Martin and Benjamin Muller and Pedro Javier Ortiz Su{\'{a}}rez and Yoann Dupont and Laurent Romary and {\'{E}}ric de la Clergeri and Djam{\'{e}} Seddah and Beno{\^{\i}}t Sagot},
|
157 |
+
title = {{CamemBERT}: a Tasty French Language Model},
|
158 |
+
booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}
|
159 |
+
}
|
160 |
```
|
config.json
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
{
|
2 |
-
"_name_or_path": "ccdv/lsg-base-4096
|
3 |
"adaptive": true,
|
4 |
"architectures": [
|
5 |
"LSGCamembertForMaskedLM"
|
|
|
1 |
{
|
2 |
+
"_name_or_path": "ccdv/lsg-camembert-base-4096",
|
3 |
"adaptive": true,
|
4 |
"architectures": [
|
5 |
"LSGCamembertForMaskedLM"
|
modeling_lsg_camembert.py
CHANGED
@@ -1032,33 +1032,26 @@ class LSGCamembertModel(LSGCamembertPreTrainedModel, RobertaModel):
|
|
1032 |
return_dict=return_dict
|
1033 |
)
|
1034 |
|
1035 |
-
|
1036 |
if self.pool_with_global:
|
1037 |
-
|
1038 |
|
1039 |
diff = t - t_
|
1040 |
-
n, _, d =
|
1041 |
-
|
1042 |
|
1043 |
# Adapt sequence to initial shape
|
1044 |
if diff < 0:
|
1045 |
-
|
1046 |
|
1047 |
-
encoder_outputs.last_hidden_state = context
|
1048 |
-
sequence_output = encoder_outputs[0]
|
1049 |
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
|
1050 |
|
1051 |
if not return_dict:
|
1052 |
return (sequence_output, pooled_output) + encoder_outputs[1:]
|
1053 |
-
|
1054 |
-
|
1055 |
-
|
1056 |
-
|
1057 |
-
past_key_values=encoder_outputs.past_key_values,
|
1058 |
-
hidden_states=encoder_outputs.hidden_states,
|
1059 |
-
attentions=encoder_outputs.attentions,
|
1060 |
-
cross_attentions=encoder_outputs.cross_attentions,
|
1061 |
-
)
|
1062 |
|
1063 |
def get_extended_attention_mask(self, attention_mask, input_shape, device=None):
|
1064 |
|
|
|
1032 |
return_dict=return_dict
|
1033 |
)
|
1034 |
|
1035 |
+
sequence_output = encoder_outputs[0]
|
1036 |
if self.pool_with_global:
|
1037 |
+
sequence_output[:, self.num_global_tokens] = sequence_output[:, 0]
|
1038 |
|
1039 |
diff = t - t_
|
1040 |
+
n, _, d = sequence_output.size()
|
1041 |
+
sequence_output = sequence_output[..., self.num_global_tokens:, :]
|
1042 |
|
1043 |
# Adapt sequence to initial shape
|
1044 |
if diff < 0:
|
1045 |
+
sequence_output = sequence_output[:, :t]
|
1046 |
|
|
|
|
|
1047 |
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
|
1048 |
|
1049 |
if not return_dict:
|
1050 |
return (sequence_output, pooled_output) + encoder_outputs[1:]
|
1051 |
+
|
1052 |
+
encoder_outputs.last_hidden_state = sequence_output
|
1053 |
+
encoder_outputs.pooler_output = pooled_output
|
1054 |
+
return encoder_outputs
|
|
|
|
|
|
|
|
|
|
|
1055 |
|
1056 |
def get_extended_attention_mask(self, attention_mask, input_shape, device=None):
|
1057 |
|