File size: 8,759 Bytes
1076fcf
 
ce1f658
 
2c7d17d
 
4e7312f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e45a75
 
 
4e7312f
9e45a75
4e7312f
 
 
 
 
 
1076fcf
ce1f658
 
dcf1e76
bcf8dff
dcf1e76
 
ce1f658
572f532
 
bcf8dff
4abee59
ce1f658
4e7312f
2c7d17d
 
8090ce5
4e7312f
ce1f658
8090ce5
 
 
 
 
 
 
 
bcf8dff
 
2c7d17d
bcf8dff
 
ce1f658
bcf8dff
888d4a7
 
 
ce1f658
bcf8dff
 
 
 
01d2cf4
 
 
 
 
 
 
 
266de40
ce1f658
 
 
 
 
 
 
749980f
 
 
 
 
ce1f658
749980f
ce1f658
1e1a20a
 
 
 
 
 
 
ce1f658
 
1e1a20a
ce1f658
 
 
 
749980f
ce1f658
 
 
 
 
 
 
bcf8dff
ce1f658
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcf8dff
ce1f658
 
 
 
5d6823a
 
 
 
ce1f658
 
 
 
 
 
 
 
 
 
 
c17eb2e
 
 
 
 
 
 
cc0a4f1
c17eb2e
 
cc0a4f1
c17eb2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcf8dff
ce1f658
 
 
 
 
e023a84
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
license: cc-by-sa-4.0
datasets:
- bigcode/the-stack-dedup
tags:
- code
language:
- code
programming_language: 
- Markdown
- Java
- JavaScript
- Python
- TypeScript
- PHP
- SQL
- JSX
- reStructuredText
- Rust
- C
- CSS
- Go
- C++
- HTML
- Vue
- Ruby
- Jupyter Notebook
- R
- Shell
model-index:
- name: replit-code-v1-3b
  results:
  - task: 
      name: Code Generation
      type: code-generation
    dataset:
      name: "HumanEval" 
      type: openai_humaneval
    metrics:
    - name: pass@1
      type: pass@1
      value: 0.219
      verified: false
---


# replit-code-v1-3b
Developed by: Replit, Inc.

[**🧑‍💻 Test it on our Demo Space! 🧑‍💻**](https://huggingface.co/spaces/replit/replit-code-v1-3b-demo)

[**⚙️ Fine-tuning and Instruct-tuning guides ⚙️**](https://github.com/replit/replitLM)

## Model Description
`replit-code-v1-3b` is a 2.7B Causal Language Model focused on **Code Completion**. The model has been trained on a subset of the [Stack Dedup v1.2 dataset](https://arxiv.org/abs/2211.15533).

The training mixture includes **20 different languages**, listed here in descending order of number of tokens: 
<br/>
`Markdown`, `Java`, `JavaScript`, `Python`, `TypeScript`, `PHP`, `SQL`, `JSX`, `reStructuredText`, `Rust`, `C`, `CSS`, `Go`, `C++`, `HTML`, `Vue`, `Ruby`, `Jupyter Notebook`, `R`, `Shell`
<br/>
In total, the training dataset contains 175B tokens, which were repeated over 3 epochs -- in total, `replit-code-v1-3b` has been trained on **525B** tokens (~195 tokens per parameter).

The model has been trained on the [MosaicML](https://www.mosaicml.com/) platform with 256 x A100-40GB GPUs, leveraging their latest [LLM examples repo](https://github.com/mosaicml/examples/tree/release/v0.0.4/examples/llm).
<br/>
`replit-code-v1-3b` is powered by state-of-the-art LLM techniques, such as: 
[Flash Attention](https://arxiv.org/abs/2205.14135) for fast training and inference,
[AliBi positional embeddings](https://arxiv.org/abs/2108.12409) to support variable context length at inference time, 
[LionW optimizer](https://arxiv.org/abs/2302.06675), 
etc.

## Intended Use
Replit intends this model be used by anyone as a foundational model for application-specific fine-tuning without strict limitations on commercial use.

## Limitations
The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing filters, and such content may be reflected in model generated text. We recommend that users exercise reasonable caution when using in production systems. Do not use for any applications that may cause harm or distress to individuals or groups.

## License
The model checkpoint and vocabulary file are licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.

The source code files (`*.py`) are licensed under the Apache 2.0 license.

## Contact
For questions and comments about the model, please post in the community section. 

## How to Use
First of all, you need to install the latest versions of the following dependencies:
```
einops
sentencepiece
torch
transformers
```

You can then load the model as follows:
```python
from transformers import AutoModelForCausalLM

# load model
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
```

To use the optimized Triton implementation of FlashAttention on GPUs with BF16 precision, first install the following dependencies: 
```
flash-attn==0.2.8
triton==2.0.0.dev20221202
```

Then, move the model to `bfloat16` and use it as follows:
```python
from transformers import AutoModelForCausalLM, AutoConfig

config = AutoConfig.from_pretrained(
    "replit/replit-code-v1-3b",
    trust_remote_code=True
)
config.attn_config['attn_impl'] = 'triton'

# load model
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', config=config, trust_remote_code=True)
model.to(device='cuda:0', dtype=torch.bfloat16)

# forward pass
x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
x = x.to(device='cuda:0')
y = model(x)

```

Note that `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the
[Transformers](https://huggingface.co/docs/transformers/index) library. 

### Tokenizer

We have trained a custom SentencePiece Unigram tokenizer optimized with a vocabulary specifically for code of 32768 tokens.

Note that using this requires the `sentencepiece` library to be installed. 

The tokenizer can be used as follows:

```python
from transformers import AutoTokenizer

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

# single input encoding + generation
x = tokenizer.encode('def hello():\n  print("hello world")\n', return_tensors='pt')
y = model.generate(x)

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)
```

Note that: 
- `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the [Transformers](https://huggingface.co/docs/transformers/index) library. 
- `clean_up_tokenization_spaces=False` is meant to avoid removing spaces in the output, because that would affect the syntactical correctness of the generated code. 


### Generation

You can generate code using the `transformers` library as follows:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)
```

Experiment with different decoding methods and parameters to get the best results for your use case.


### Loading with 8-bit and 4-bit quantization

#### Loading in 8-bit
You can also load the model in 8-bit with the `load_in_8bit=True` kwarg that uses `bitsandbytes` under the hood.

First you need to  install the following additional dependanices: 
```
accelerate
bitsandbytes
```

Then you can load the model in 8bit as follows:

```
model = AutoModelForCausalLM.from_pretrained("replit/replit-code-v1-3b", 
                                             trust_remote_code=True, 
                                             device_map="auto",
                                             load_in_8bit=True)
```
The additional kwargs that make this possible are `device_map='auto'` and `load_in_8bit=True`. 

#### Loading in 4-bit

For loading in 4-bit, at the time of writing, support for `load_in_4bit` has not been merged into the latest releases for 
`transformers` and `accelerate`. However you can use it if you install the dependancies the `main` branches of the published repos:

```bash
pip install git+https://github.com/huggingface/accelerate.git
pip install git+https://github.com/huggingface/transformers.git
```

Then load in 4-bit with:

```
model = AutoModelForCausalLM.from_pretrained("replit/replit-code-v1-3b", 
                                             trust_remote_code=True, 
                                             device_map="auto",
                                             load_in_4bit=True)
```

#### References
- [Hugging Face's Quantization Doc](https://huggingface.co/docs/transformers/main/main_classes/quantization)
- [Original Blogpost introducing 8-bit](https://huggingface.co/blog/hf-bitsandbytes-integration)
- [New Blogpost introducing 4-bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes)


### Post Processing

Note that as with all code generation models, post-processing of the generated code is important. In particular, the following post-processing steps are recommended:
- stop generation when the EOS token is encountered
- remove trailing whitespaces
- set `max_tokens` to a reasonable value based on your completion use case
- truncate generation to stop words such as `return`, `def`, "```", "`\n\n\n`" to avoid generating incomplete code when `max_tokens` is larger than the length of the expected generated code.