hails commited on
Commit
23919f4
·
1 Parent(s): f60b41d

Add better infilling documentation

Browse files
Files changed (1) hide show
  1. README.md +42 -10
README.md CHANGED
@@ -40,7 +40,7 @@ CarperAI will be releasing larger LMs better tuned for code in the near future,
40
  | \\(n_{heads}\\) | 16 |
41
  | \\(d_{head}\\) | 128 |
42
  | \\(n_{ctx}\\) | 2048 |
43
- | \\(n_{vocab}\\) | 50254 |
44
  | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)
45
 
46
 
@@ -105,27 +105,59 @@ language model output is generated after \<MID\> token!
105
 
106
  As a concrete example, here is a code snippet that should allow a model to perform infilling:
107
 
108
- ```python
109
 
 
 
 
 
 
 
110
 
111
- from transformers import AutoTokenizer, AutoModelForCausalLM
112
 
 
 
 
 
 
113
 
114
- tokenizer = AutoTokenizer.from_pretrained("CarperAI/FIM-NeoX-1.3B")
115
- model = AutoModelForCausalLM.from_pretrained("CarperAI/FIM-NeoX-1.3B")
 
116
 
117
- prelude = "this is some text preceding the cursor,"
118
- suffix = "and this is some text after it."
 
 
 
 
119
 
 
 
 
120
 
121
- model_tokenized_input = [50253, *tokenizer(suffix), 50254, *tokenizer(prelude), 50255]
 
122
 
123
- infilled = model.generate(model_tokenized_input)
 
 
124
 
 
 
 
125
 
 
 
 
 
 
126
  ```
 
 
127
 
128
- We are working on making a better interface for this in future model releases or updates to the tokenizer.
129
 
130
 
131
  ## Intended Uses and Limitations
 
40
  | \\(n_{heads}\\) | 16 |
41
  | \\(d_{head}\\) | 128 |
42
  | \\(n_{ctx}\\) | 2048 |
43
+ | \\(n_{vocab}\\) | 50280 |
44
  | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)
45
 
46
 
 
105
 
106
  As a concrete example, here is a code snippet that should allow a model to perform infilling:
107
 
108
+ There was an issue where the sentinel `<|SUF|>`, `<|PRE|>`, and `<|MID|>` tokens were not the correct ids in the uploaded tokenizer and model card! Please try clearing the Huggingface cache and redownloading the model :))
109
 
110
+ Here is a minimal example of performing open-ended generation with this model, on a simple function `score(x, y)`:
111
+ ```
112
+ def score(x,y) -> int:
113
+ """
114
+
115
+ ```
116
 
117
+ and also infilling with the function and end of docstring already placed:
118
 
119
+ ```
120
+ def score(x,y) -> int:
121
+ """
122
+ <|MID|> (infill here)
123
+ """
124
 
125
+ score = x + y
126
+ return score
127
+ ```
128
 
129
+ ```
130
+ from transformers import AutoTokenizer, AutoModelForCausalLM
131
+ import torch
132
+
133
+ model = AutoModelForCausalLM.from_pretrained("CarperAI/FIM-NeoX-1.3B")
134
+ tok = AutoTokenizer.from_pretrained("CarperAI/
135
 
136
+ # infilling demo
137
+ prefix = 'def score(x, y) -> int:\n"""\n'
138
+ suffix = '"""\n\n score = x + y\n return score'
139
 
140
+ model_input = [50277, *tok(suffix)["input_ids"], 50278, *tok(prefix)["input_ids"], 50279]
141
+ output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=40)[0])
142
 
143
+ print(output)
144
+ ```
145
+ outputs: `'<|SUF|>"""\n\n score = x + y\n return score<|PRE|>def score(x, y) -> int:\n"""\n<|MID|> score(x, y) -> int\n<|endoftext|>'`
146
 
147
+ ```
148
+ from transformers import AutoTokenizer, AutoModelForCausalLM
149
+ import torch
150
 
151
+ # non-infilling demo
152
+ prefix = 'def score(x, y) -> int:\n"""\n'
153
+ model_input = [*tok(prefix)["input_ids"]]
154
+ output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=100)[0])
155
+ print(output)
156
  ```
157
+ outputs: `'def score(x, y) -> int:\n"""\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y_list))\n\ndef get_point_score(x, y) -> int:\n """\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y'`
158
+
159
 
160
+ The sentinel tokens are now accessible via `tokenizer.decode(50277) = "<|SUF|>"`, `tokenizer.decode(50278) = "<|PRE|>"`, `tokenizer.decode(50279) = "<|MID|>"`.
161
 
162
 
163
  ## Intended Uses and Limitations