PunchPunch22 commited on
Commit
23ddca4
·
1 Parent(s): ea5dbd8

Upload 7 files

Browse files
Files changed (7) hide show
  1. README.md +79 -10
  2. config.json +54 -0
  3. gitattributes.txt +34 -0
  4. merges.txt +0 -0
  5. special_tokens_map.json +1 -0
  6. tokenizer.json +0 -0
  7. vocab.json +0 -0
README.md CHANGED
@@ -1,13 +1,82 @@
1
  ---
2
- title: Salesforce Codegen 350M Multi
3
- emoji: 🐠
4
- colorFrom: indigo
5
- colorTo: green
6
- sdk: gradio
7
- sdk_version: 3.34.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: afl-3.0
 
 
 
 
 
 
 
 
3
  ---
4
 
5
+ # PyCodeGPT
6
+ A pre-trained GPT model for Python code completion and generation
7
+
8
+ ## What is it?
9
+
10
+ PyCodeGPT is efficient and effective GPT-Neo-based model for python code generation task, which is similar to [OpenAI Codex](https://openai.com/blog/openai-codex/), [Github Copliot](https://copilot.github.com/), [CodeParrot](https://huggingface.co/blog/codeparrot), [AlphaCode](https://deepmind.com/blog/article/Competitive-programming-with-AlphaCode).
11
+
12
+ ## Training Data
13
+ Due to the small size of public released dataset, we proposed to collect data from GitHub from scratch. We first crawled 1.2M python-related repositories hosted by GitHub. Then, we used these repository URLs to download all contents of each repository from GitHub. After that, we got 60M raw python files under 1MB with a total size of 330GB. Finally, we carefully designed various strategies of data cleaning to get about 96GB data for training. Please refer to the following table for the details.
14
+
15
+ |Model|Repositories|Size and file after filtering|
16
+ |:------:|:---:|:---:|
17
+ | CodeParrot | 0.56M | 12GB (compressed), 5.4M |
18
+ | Codex | 54M | 159GB |
19
+ | PyCodeGPT | 1.2M | 96GB, 13M |
20
+
21
+
22
+ ## Pretrained models
23
+
24
+ we aims to train median-large pre-trained models (model size with 110M) based on GPT-Neo:
25
+ - PyCodeGPT-110M: derived from GPT-Neo 125M with a vocabulary size of 32K.
26
+
27
+ ## GitHub
28
+
29
+ [https://github.com/microsoft/PyCodeGPT](https://github.com/microsoft/PyCodeGPT)
30
+
31
+ ## Evaluation Results
32
+
33
+ Here's our evaluation result on HumanEval dataset:
34
+
35
+ Note: our model can have a comparable accuracy with Codex of similar model size.
36
+
37
+ |Model|Pass@1|Pass@10|Pass@100|
38
+ |:------:|:---:|:---:|:---:|
39
+ |PyCodeGPT-110M |**8.32%** |**13.53%** |**18.3%** |
40
+ |||||
41
+ |GPT-Neo 125M |0.75% |1.88% |2.97% |
42
+ |GPT-Neo 1.3B |4.97% |7.47% |16.3% |
43
+ |GPT-Neo 2.7B |6.41% |11.27% |21.37% |
44
+ |GPT-J 6B |11.62% |15.74% |27.74% |
45
+ |||||
46
+ |TabNine |2.58% |4.35% |7.59% |
47
+ |||||
48
+ |CodeParrot 110M |3.80% |6.57% |12.78% |
49
+ |CodeParrot 1.5B |3.58% |8.03% |14.96% |
50
+ |||||
51
+ |Codex 12M |2.00% |3.62% |8.58% |
52
+ |Codex 25M |3.21% |7.1% |12.89% |
53
+ |Codex 42M |5.06% |8.8% |15.55% |
54
+ |Codex 85M |8.22% |12.81% |22.4% |
55
+ |Codex 300M |13.17% |20.37% |36.27% |
56
+ |Codex 679M |16.22% |25.7% |40.95% |
57
+ |Codex 2.5B |21.36% |35.42% |59.5% |
58
+ |Codex 12B |28.81% |46.81% |72.31% |
59
+ |||||
60
+ |Pretrained Decoder-only 13M (AlphaCode) |1.5% |3.6% |8.6% |
61
+ |Pretrained Decoder-only 29M (AlphaCode) |3.4% |5.8% |11.2% |
62
+ |Pretrained Decoder-only 55M (AlphaCode) |4.2% |8.2% |16.9% |
63
+ |Pretrained Decoder-only 89M (AlphaCode) |4.3% |12.2% |20.0% |
64
+ |Pretrained Decoder-only 302M (AlphaCode) |11.6% |18.8% |31.8% |
65
+ |Pretrained Decoder-only 685M (AlphaCode) |14.2% |24.4% |38.8% |
66
+ |Pretrained Decoder-only 1.1B (AlphaCode) |17.1% |28.2% |45.3% |
67
+ |||||
68
+ |PolyCoder 160M |2.13% |3.35% |4.88% |
69
+ |PolyCoder 400M |2.96% |5.29% |11.59% |
70
+ |PolyCoder 2.7B |5.59% |9.84% |17.68% |
71
+
72
+ ## Reference
73
+ If you want to use the models, you need to cite our following paper:
74
+
75
+ ```
76
+ @inproceedings{CERT,
77
+ title={{CERT}: Continual Pre-training on Sketches for Library-oriented Code Generation},
78
+ author={Zan, Daoguang and Chen, Bei and Yang, Dejian and Lin, Zeqi and Kim, Minsu and Guan, Bei and Wang, Yongji and Chen, Weizhu and Lou, Jian-Guang},
79
+ booktitle={The 2022 International Joint Conference on Artificial Intelligence},
80
+ year={2022}
81
+ }
82
+ ```
config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "//amlt0a3b8c9fa72c7a7e36e6cd517fb7abe6/data/pycode_func_0214_17M_codepy-110M/model",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "GPTNeoForCausalLM"
6
+ ],
7
+ "attention_dropout": 0,
8
+ "attention_layers": [
9
+ "global",
10
+ "local",
11
+ "global",
12
+ "local",
13
+ "global",
14
+ "local",
15
+ "global",
16
+ "local",
17
+ "global",
18
+ "local",
19
+ "global",
20
+ "local"
21
+ ],
22
+ "attention_types": [
23
+ [
24
+ [
25
+ "global",
26
+ "local"
27
+ ],
28
+ 6
29
+ ]
30
+ ],
31
+ "bos_token_id": 1,
32
+ "embed_dropout": 0,
33
+ "eos_token_id": 0,
34
+ "gradient_checkpointing": false,
35
+ "hidden_size": 768,
36
+ "initializer_range": 0.02,
37
+ "intermediate_size": null,
38
+ "layer_norm_epsilon": 1e-05,
39
+ "max_position_embeddings": 2048,
40
+ "model_type": "gpt_neo",
41
+ "num_heads": 12,
42
+ "num_layers": 12,
43
+ "resid_dropout": 0,
44
+ "summary_activation": null,
45
+ "summary_first_dropout": 0.1,
46
+ "summary_proj_to_labels": true,
47
+ "summary_type": "cls_index",
48
+ "summary_use_proj": true,
49
+ "torch_dtype": "float32",
50
+ "transformers_version": "4.12.5",
51
+ "use_cache": true,
52
+ "vocab_size": 32000,
53
+ "window_size": 256
54
+ }
gitattributes.txt ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tflite filter=lfs diff=lfs merge=lfs -text
29
+ *.tgz filter=lfs diff=lfs merge=lfs -text
30
+ *.wasm filter=lfs diff=lfs merge=lfs -text
31
+ *.xz filter=lfs diff=lfs merge=lfs -text
32
+ *.zip filter=lfs diff=lfs merge=lfs -text
33
+ *.zst filter=lfs diff=lfs merge=lfs -text
34
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<|beginoftext|>", "eos_token": "<|endoftext|>", "unk_token": "<|unkoftext|>", "pad_token": "<|padoftext|>"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
The diff for this file is too large to render. See raw diff