Update README.md
Browse files
README.md
CHANGED
@@ -1,11 +1,4 @@
|
|
1 |
-
-
|
2 |
-
- [Model download](#download)
|
3 |
-
- [Run the model](#inference)
|
4 |
-
- [Fine-tuning the model](#finetuning)
|
5 |
-
- [Limitations](#limitations)
|
6 |
-
- [License](https://github.com/VinAIResearch/PhoGPT/blob/main/LICENSE)
|
7 |
-
|
8 |
-
# PhoGPT: Generative Pre-training for Vietnamese <a name="introduction"></a>
|
9 |
|
10 |
|
11 |
We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20480 token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. We demonstrate its strong performance compared to previous closed-source and open-source 7B-parameter models. More details about the general architecture and experimental results of PhoGPT can be found in our [technical report](https://arxiv.org/abs/2311.02945):
|
@@ -22,84 +15,4 @@ year = {2023}
|
|
22 |
|
23 |
**Please CITE** our technical report when PhoGPT is used to help produce published results or is incorporated into other software.
|
24 |
|
25 |
-
|
26 |
-
## Model download <a name="download"></a>
|
27 |
-
|
28 |
-
Model | Type | Model Size | Context length | Vocab size | Training data size | Note
|
29 |
-
---|--|---|---|---|---|---
|
30 |
-
[`vinai/PhoGPT-4B`](https://huggingface.co/vinai/PhoGPT-4B) | Base | 3.7B | 8192 | 20480 | 484GB
|
31 |
-
[`vinai/PhoGPT-4B-Chat`](https://huggingface.co/vinai/PhoGPT-4B-Chat) |Instruction following & Chat|3.7B| 8192| 20480 |70K instructional prompt and response pairs & 290K conversations| `PROMPT_TEMPLATE = "### Câu hỏi: {instruction}\n### Trả lời:"`
|
32 |
-
`vinai/PhoGPT-7B5` | Base | 7.5B | 2048 | 250K | 41GB
|
33 |
-
`vinai/PhoGPT-7B5-Instruct` |Instruction following|7.5B| 2048| 250K |150K instructional prompt and response pairs| `PROMPT_TEMPLATE = "### Câu hỏi:\n{instruction}\n\n### Trả lời:"`
|
34 |
-
|
35 |
-
|
36 |
-
## Run the model <a name="inference"></a>
|
37 |
-
|
38 |
-
### with pure `transformers`
|
39 |
-
|
40 |
-
```python
|
41 |
-
import torch
|
42 |
-
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
|
43 |
-
|
44 |
-
model_path = "vinai/PhoGPT-4B-Chat"
|
45 |
-
|
46 |
-
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
|
47 |
-
config.init_device = "cuda"
|
48 |
-
|
49 |
-
model = AutoModelForCausalLM.from_pretrained(
|
50 |
-
model_path, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True
|
51 |
-
)
|
52 |
-
# If your GPU does not support bfloat16:
|
53 |
-
# model = AutoModelForCausalLM.from_pretrained(model_path, config=config, torch_dtype=torch.float16, trust_remote_code=True)
|
54 |
-
model.eval()
|
55 |
-
|
56 |
-
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
57 |
-
|
58 |
-
PROMPT_TEMPLATE = "### Câu hỏi: {instruction}\n### Trả lời:"
|
59 |
-
|
60 |
-
# Some instruction examples
|
61 |
-
# instruction = "Viết bài văn nghị luận xã hội về {topic}"
|
62 |
-
# instruction = "Viết bản mô tả công việc cho vị trí {job_title}"
|
63 |
-
# instruction = "Sửa lỗi chính tả:\n{sentence_or_paragraph}"
|
64 |
-
# instruction = "Dựa vào văn bản sau đây:\n{text}\nHãy trả lời câu hỏi: {question}"
|
65 |
-
# instruction = "Tóm tắt văn bản:\n{text}"
|
66 |
-
|
67 |
-
|
68 |
-
instruction = "Viết bài văn nghị luận xã hội về an toàn giao thông"
|
69 |
-
# instruction = "Sửa lỗi chính tả:\nTriệt phá băng nhóm kướp ô tô, sử dụng \"vũ khí nóng\""
|
70 |
-
|
71 |
-
input_prompt = PROMPT_TEMPLATE.format_map(
|
72 |
-
{"instruction": instruction}
|
73 |
-
)
|
74 |
-
|
75 |
-
input_ids = tokenizer(input_prompt, return_tensors="pt")
|
76 |
-
|
77 |
-
outputs = model.generate(
|
78 |
-
inputs=input_ids["input_ids"].to("cuda"),
|
79 |
-
attention_mask=input_ids["attention_mask"].to("cuda"),
|
80 |
-
do_sample=True,
|
81 |
-
temperature=1.0,
|
82 |
-
top_k=50,
|
83 |
-
top_p=0.9,
|
84 |
-
max_new_tokens=1024,
|
85 |
-
eos_token_id=tokenizer.eos_token_id,
|
86 |
-
pad_token_id=tokenizer.pad_token_id
|
87 |
-
)
|
88 |
-
|
89 |
-
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
|
90 |
-
response = response.split("### Trả lời:")[1]
|
91 |
-
```
|
92 |
-
|
93 |
-
### with vLLM, Text Generation Inference & llama.cpp
|
94 |
-
|
95 |
-
PhoGPT can run with inference engines, such as [vLLM](https://github.com/vllm-project/vllm) and [Text Generation Inference](https://github.com/huggingface/text-generation-inference). Users can also employ [llama.cpp](https://github.com/ggerganov/llama.cpp) to run PhoGPT, as it belongs to the MPT model family that is supported by [llama.cpp](https://github.com/ggerganov/llama.cpp).
|
96 |
-
|
97 |
-
## Fine-tuning the model <a name="finetuning"></a>
|
98 |
-
|
99 |
-
See [llm-foundry docs](https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/README.md#llmfinetuning) for more details. To fully fine-tune `vinai/PhoGPT-7B5` or `vinai/PhoGPT-7B5-Instruct` on a single GPU A100 with 40GB memory, it is advisable to employ the `decoupled_lionw` optimizer with a `device_train_microbatch_size` set to 1. An example of model finetuning YAML configuration can be found in `fine-tuning-phogpt-7b5.yaml`.
|
100 |
-
|
101 |
-
## Limitations <a name="limitations"></a>
|
102 |
-
|
103 |
-
PhoGPT has certain limitations. For example, it is not good at tasks involving reasoning, coding or mathematics. PhoGPT may generate harmful, hate speech, biased responses, or answer unsafe questions. Users should be cautious when interacting with PhoGPT that can produce factually incorrect output.
|
104 |
-
|
105 |
-
## [License](https://github.com/VinAIResearch/PhoGPT/blob/main/LICENSE)
|
|
|
1 |
+
# PhoGPT: Generative Pre-training for Vietnamese
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
|
4 |
We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20480 token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. We demonstrate its strong performance compared to previous closed-source and open-source 7B-parameter models. More details about the general architecture and experimental results of PhoGPT can be found in our [technical report](https://arxiv.org/abs/2311.02945):
|
|
|
15 |
|
16 |
**Please CITE** our technical report when PhoGPT is used to help produce published results or is incorporated into other software.
|
17 |
|
18 |
+
For further information or requests, please go to [PhoGPT's homepage](https://github.com/VinAIResearch/PhoGPT)!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|