<div align="center">
  <img src="https://raw.githubusercontent.com/Anditty/OASIS/refs/heads/main/Group.svg" width="60%" alt="Kwaipilot" />
</div>
<hr>

# Kwaipilot KwaiCoder-DS-V2-Lite-Base

## 1.Model Details

**Introduction**

Kwai-Coder-DS-V2-Lite-Base is built on Deepseek-v2-Lite-Base, which has a total of 16B parameters and 2.4B activated parameters. It supports both English and Chinese and underwent continue pretraining on 800B tokens of high-quality code, math, and Chinese-English text data. The training data consists of 70% code data, 20% math data, and 10% text data (including a large amount of code-related text data). Ultimately, the base model achieved SOTA levels in multiple benchmarks.

**Performance**

| Model | Size | Humaneval | Humaneval+ | MBPP | MBPP+ | BigCodeBench（Full） | BigCodeBench（Hard） | MATH| GSM8k|
|----------|----------|-----------|------------|------|-------|----------------------|----------------------|-------|-------|
| Qwen2.5-Coder | 1.5B       | 43.9|	36.6|	69.2|	58.6|	34.6|	9.5|30.9|65.8|
| CodeGemma    | 2B      |31.1|	16.5|	51.1|	43.1|	23.9|	7.4|-|-|
| CodeLlama    | 7B      | 33.5|	26.2|	55.3|	46.8|	28.7|	5.4|12.1|31.2|
| Qwen2.5-Coder    | 7B      | 46.3|	37.8|	66.2|	53.1|	38.4|	12.2|**46.6**|**83.9**|
| OpenCoder    | 8B      |66.5	|63.4|	79.9|	**70.4**	|40.5	|9.5|-|-|
| Yi-Coder    | 9B      | 53.7      | 46.3       | 48.4 |40.7  | 42.9                | 14.2                |-|-|
| StarCoder2    | 15B      | 46.3|	37.8|	66.2|	53.1	|38.4|	12.2|10.3|23.4|
| DeepSeek-Coder-V2-Lite    | 16B      | 40.9|	34.1|	71.9|	59.4|	30.6|	8.1|39.0|67.1|
| **KwaiCoder-DS-V2-Lite**    | 16B      |**75.0**|**68.9**|**81.2**|67.7|**49.4**|**18.2**| 40.48|81.5|
| CodeLlama    | 34B      | 51.8|	43.9|	69.3|	56.3|	45.3|	16.2|21.2|58.2|
			

Kwai-Coder-DS-V2-Lite-Base achieved Pass@1 scores of 75.0% and 68.9% on the HumanEval and HumanEval+ test sets, respectively. Compared to Deepseek-v2-Lite-Base of the same parameter scale, this represents an improvement of 83.37% and 102.05%, respectively. Additionally, it surpassed the current best base model (OpenCoder-8B), reaching SOTA (State-of-the-Art) levels.

On the MBPP and MBPP+ test sets, Kwai-Coder-DS-V2-Lite-Base outperformed the Deepseek-v2-Lite-Base model of the same parameter scale. Additionally, with only 2.4B activated parameters, the Kwai-Coder-DS-V2-Lite-Base model achieved an average improvement of nearly 5 percentage points compared to the 7B parameter-scale Qwen2.5-Coder.

On the BigCodeBench-Complete full set (Full), Kwai-Coder-DS-V2-Lite-Base achieved a 6% improvement over DeepSeek-Coder-33B, reaching SOTA (State-of-the-Art) levels. On the Hard subset, Kwai-Coder-DS-V2-Lite-Base also significantly outperformed the 70B parameter-scale CodeLlama model and the 7B parameter-scale Qwen2.5-Coder model.

In terms of mathematical capabilities, with only 2.4B activated parameters, Kwai-Coder-DS-V2-Lite-Base surpassed Deepseek-v2-Lite-Base of the same parameter scale on the MATH and GSM8K test sets (with improvements of 3.79% and 21.46%, respectively) and outperformed the larger parameter-scale CodeLlama-34B (with improvements of 90.95% and 40.03%, respectively). Although it has not yet exceeded Qwen2.5-Coder-7B, Kwai-Coder-DS-V2-Lite-Base has already surpassed the Qwen2.5-Coder-3B model, which has more activated parameters, achieving SOTA (State-of-the-Art) levels for its parameter scale.

## 2.Usage

**Code Completion**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Kwaipilot/KwaiCoder-DS-V2-Lite-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,trust_remote_code=True)
text = "#write a quick sort algorithm"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=80)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(text):])
```
**Code Insertion**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Kwaipilot/KwaiCoder-DS-V2-Lite-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,trust_remote_code=True)
text = """<｜fim▁begin｜>def find_longest_substring(s):
    seen = {}
    max_length = 0
    start = 0
<｜fim▁hole｜>
        if char in seen and seen[char] >= start:
            start = seen[char] + 1
        seen[char] = end
        max_length = max(max_length, end - start + 1)
    return max_length<｜fim▁end｜>"""
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=80)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(text):])
```

## 3.License
This code repository is licensed under the MIT License. The use of KwaiCoder-DS-V2-Lite-Base models is subject to the Model License. 

## 4.BibTex
```BibTex
@misc{kwaicoder,
  title = {KwaiCoder: Code mathematical abilities comprehensive improvement.},
  author = {Kwaipilot team},
  year = {2024},
}
```