fullstack commited on
Commit
2a25271
·
verified ·
1 Parent(s): 5e16211

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Credits and Acknowledgments
2
+
3
+ TURBOPASTA is built upon the excellent work of [Fast Apply](https://github.com/kortix-ai/fast-apply) by Kortix AI. Our model leverages their dataset and builds on their pioneering approach to code merging and transformation. Key inspirations include:
4
+
5
+ - Dataset structure and generation methodology
6
+ - XML-based prompt engineering approach
7
+ - Evaluation metrics and benchmarking approaches
8
+
9
+ Special thanks to:
10
+ - The Kortix AI team for open-sourcing Fast Apply
11
+ - Their foundational work on high-speed code transformation models
12
+ - The comprehensive dataset they've made available to the community
13
+
14
+ While TURBOPASTA introduces its own innovations, the groundwork laid by Fast Apply was instrumental in making this project possible. We encourage users interested in code transformation models to also check out the original Fast Apply models:
15
+
16
+ - [FastApply-7B-v1.0](https://huggingface.co/Kortix/FastApply-7B-v1.0)
17
+ - [FastApply-1.5B-v1.0](https://huggingface.co/Kortix/FastApply-1.5B-v1.0)
18
+ - [FastApply-dataset-v1.0](https://huggingface.co/datasets/Kortix/FastApply-dataset-v1.0)
19
+
20
+ This project is licensed under Apache-2.0, consistent with Fast Apply's open-source ethos.
21
+
22
+ ---------------------------------------------------------------------------
23
+
24
+ Based on a dataset inspired by https://www.kortix.ai/
25
+
26
+
27
+ # TURBOPASTA LoRA Adapter for Qwen2.5-3B
28
+
29
+ A LoRA adapter for unsloth/Qwen2.5-3B that merges code updates using chain-of-thought reasoning and maintains strict adherence to original code structure and formatting.
30
+
31
+ ## Technical Specifications
32
+
33
+ ### Base Model
34
+ - Model: unsloth/Qwen2.5-3B
35
+ - LoRA Rank: 64
36
+ - Target Modules: v_proj, o_proj, down_proj, up_proj, q_proj, k_proj, gate_proj
37
+ - Task: CAUSAL_LM
38
+ - Dropout: 0
39
+ - Alpha: 32
40
+
41
+ ### Input/Output Format
42
+
43
+ Input XML structure:
44
+ ```xml
45
+ <instruction>You are a coding assistant that helps merge code updates, ensuring every modification is fully integrated. Merge all changes from the snippet into the code. Preserve the code's structure, order, comments, and indentation exactly.</instruction>
46
+
47
+ <fastapply>
48
+ <code>
49
+ {original_code}
50
+ </code>
51
+ <update>
52
+ {update_snippet}
53
+ </update>
54
+ <finalcode>
55
+ {merged_result}
56
+ </finalcode>
57
+ </fastapply>
58
+ ```
59
+
60
+ The model supports multiple `<fastapply>` blocks for few-shot context learning. Use your stop token as `</fastapply>`.
61
+
62
+ ## Deployment
63
+
64
+ ### VLLM Server Setup
65
+ ```bash
66
+ export VLLM_ALLOW_RUNTIME_LORA_UPDATING=1
67
+ export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
68
+
69
+ vllm serve unsloth/qwen2.5-3b \
70
+ --gpu-memory-utilization=1 \
71
+ --port 6002 \
72
+ --served-model-name="turbopasta" \
73
+ --trust-remote-code \
74
+ --max-model-len 8192 \
75
+ --disable-log-requests \
76
+ --enable-lora \
77
+ --lora-modules lora=./dataset/output/turbopasta/lora_model \
78
+ --max-lora-rank 64
79
+ ```
80
+
81
+ ### Client Implementation
82
+
83
+ ```python
84
+ import requests
85
+
86
+ def merge_code(original_code: str, update_snippet: str, vllm_url: str = "http://localhost:6002/v1/completions") -> dict:
87
+ xml_content = (
88
+ '<instruction>You are a coding assistant that helps merge code updates, ensuring every modification is fully '
89
+ 'integrated. Merge all changes from the snippet into the code. Preserve the code\'s structure, order, comments, '
90
+ 'and indentation exactly.</instruction>\n'
91
+ '<fastapply>\n'
92
+ ' <code>\n'
93
+ f'{original_code}\n'
94
+ ' </code>\n'
95
+ ' <update>\n'
96
+ f'{update_snippet}\n'
97
+ ' </update>'
98
+ )
99
+
100
+ response = requests.post(
101
+ vllm_url,
102
+ json={
103
+ "prompt": xml_content,
104
+ "max_tokens": 6000,
105
+ "temperature": 0.1,
106
+ "model": "lora",
107
+ "stop": ["</fastapply>"]
108
+ },
109
+ timeout=30000
110
+ )
111
+
112
+ completion = response.json()["choices"][0]["text"]
113
+
114
+ # Parse XML tags
115
+ import re
116
+ def extract_tag(tag: str) -> str:
117
+ match = re.search(f'<{tag}>(.*?)</{tag}>', completion, re.DOTALL)
118
+ return match.group(1).strip() if match else ""
119
+
120
+ return {
121
+ "merged_code": extract_tag("finalcode")
122
+ }
123
+ ```
124
+
125
+ ### Batch Processing
126
+
127
+ The model works with the included data processor for parallel processing of code updates:
128
+
129
+ ```python
130
+ from request_processor import RequestProcessor
131
+
132
+ processor = RequestProcessor(
133
+ input_file="updates.jsonl",
134
+ output_file="merged.jsonl",
135
+ num_threads=24
136
+ )
137
+ processor.process_file()
138
+ ```
139
+
140
+ Input JSONL format:
141
+ ```json
142
+ {
143
+ "id": "update_id",
144
+ "original_code": "...",
145
+ "update_snippet": "...",
146
+ "file_path": "path/to/file"
147
+ }
148
+ ```
149
+
150
+ Output JSONL format:
151
+ ```json
152
+ {
153
+ "id": "update_id",
154
+ "original_code": "...",
155
+ "update_snippet": "...",
156
+ "merged_code": "...",
157
+ "file_path": "path/to/file",
158
+ "processed_at": "2024-10-24 02:52:33"
159
+ }
160
+ ```
161
+
162
+ ## Implementation and Performance Considerations
163
+
164
+ - Uses thread pooling for parallel processing
165
+ - Atomic writes with file locking
166
+ - Progress tracking with tqdm
167
+ - Automatic error handling and logging
168
+ - Configurable thread count for optimization
169
+ - Temperature set to 0.1 for consistent merges
170
+
171
+ ## Error Handling
172
+
173
+ Errors are captured in the output JSONL:
174
+ ```json
175
+ {
176
+ "error": "error message",
177
+ "processed_at": "timestamp"
178
+ }
179
+ ```
180
+
181
+ Monitor errors in real-time:
182
+ ```bash
183
+ tail -f merged.jsonl | grep error
184
+ ```
185
+
186
+ ## Model Training Details
187
+
188
+ This model was trained using Force Multiplier's autotuning pipeline with the following key characteristics:
189
+
190
+ - Base Model: unsloth/Qwen2.5-3B
191
+ - Training Type: Few-shot learning with chain-of-thought reasoning
192
+ - Special Focus: Code structure preservation and merge accuracy
193
+ - LoRA Configuration: Optimized for code understanding and generation
194
+
195
+ ## Limitations
196
+
197
+ - Maximum context length of 8192 tokens
198
+ - Best suited for single-file code changes
199
+ - May require multiple passes for complex refactoring
200
+ - Not recommended for binary file merges
adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "unsloth/Qwen2.5-3B",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layer_replication": null,
10
+ "layers_pattern": null,
11
+ "layers_to_transform": null,
12
+ "loftq_config": {},
13
+ "lora_alpha": 32,
14
+ "lora_dropout": 0,
15
+ "megatron_config": null,
16
+ "megatron_core": "megatron.core",
17
+ "modules_to_save": null,
18
+ "peft_type": "LORA",
19
+ "r": 64,
20
+ "rank_pattern": {},
21
+ "revision": null,
22
+ "target_modules": [
23
+ "v_proj",
24
+ "o_proj",
25
+ "down_proj",
26
+ "up_proj",
27
+ "q_proj",
28
+ "k_proj",
29
+ "gate_proj"
30
+ ],
31
+ "task_type": "CAUSAL_LM",
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b69d737bfcf70429b6aa88aed71703f4beb07daf8a51a6088706c17b14f2afc3
3
+ size 479005064
added_tokens.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|PAD_TOKEN|>": 151665,
5
+ "<|box_end|>": 151649,
6
+ "<|box_start|>": 151648,
7
+ "<|endoftext|>": 151643,
8
+ "<|file_sep|>": 151664,
9
+ "<|fim_middle|>": 151660,
10
+ "<|fim_pad|>": 151662,
11
+ "<|fim_prefix|>": 151659,
12
+ "<|fim_suffix|>": 151661,
13
+ "<|im_end|>": 151645,
14
+ "<|im_start|>": 151644,
15
+ "<|image_pad|>": 151655,
16
+ "<|object_ref_end|>": 151647,
17
+ "<|object_ref_start|>": 151646,
18
+ "<|quad_end|>": 151651,
19
+ "<|quad_start|>": 151650,
20
+ "<|repo_name|>": 151663,
21
+ "<|video_pad|>": 151656,
22
+ "<|vision_end|>": 151653,
23
+ "<|vision_pad|>": 151654,
24
+ "<|vision_start|>": 151652
25
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|PAD_TOKEN|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fab42efe8d17406525a9154b728cf9e957629a8ed7ce997770efdd71128c6a1a
3
+ size 11422086
tokenizer_config.json ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<|PAD_TOKEN|>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": true
188
+ }
189
+ },
190
+ "additional_special_tokens": [
191
+ "<|im_start|>",
192
+ "<|im_end|>",
193
+ "<|object_ref_start|>",
194
+ "<|object_ref_end|>",
195
+ "<|box_start|>",
196
+ "<|box_end|>",
197
+ "<|quad_start|>",
198
+ "<|quad_end|>",
199
+ "<|vision_start|>",
200
+ "<|vision_end|>",
201
+ "<|vision_pad|>",
202
+ "<|image_pad|>",
203
+ "<|video_pad|>"
204
+ ],
205
+ "bos_token": null,
206
+ "clean_up_tokenization_spaces": false,
207
+ "eos_token": "<|endoftext|>",
208
+ "errors": "replace",
209
+ "model_max_length": 32768,
210
+ "pad_token": "<|PAD_TOKEN|>",
211
+ "padding_side": "right",
212
+ "split_special_tokens": false,
213
+ "tokenizer_class": "Qwen2Tokenizer",
214
+ "unk_token": null
215
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff