|
--- |
|
language: |
|
- vie |
|
pipeline_tag: text-generation |
|
|
|
Trained: Pre-train |
|
Config file: 2.7B |
|
--- |
|
# Model Card for Model ID |
|
|
|
This model is pretrained with Vietnamese language, based on GPT-NeoX which is a large language model developed by EleutherAI. |
|
|
|
|
|
## Model Details |
|
|
|
### Training Data |
|
- **Pre-train:** |
|
Culturax Vietnamese Dataset(450GB) + AI-Hub Vietnamese Dataset(1.3GB) + Crawled Vietnamese Wikipedia Dataset(630MB) + viwik18 Dataset(1.27GB) |
|
|
|
### Training Hardware |
|
Trained on A100 40GB GPU and 48 core CPU. Took about 17 hours to reach 80,000 steps. |
|
|
|
### Hyperparameters |
|
<figure style="width:30em"> |
|
|
|
| Hyperparameter | Value | |
|
| ---------------------- | ----------- | |
|
| n<sub>parameters</sub> | 2670182400 | |
|
| n<sub>layers</sub> | 32 | |
|
| d<sub>model</sub> | 2560 | |
|
| n<sub>heads</sub> | 32 | |
|
| d<sub>head</sub> | 128 | |
|
| n<sub>vocab</sub> | 60000 | |
|
| Sequence Length | 2048 | |
|
| Learning Rate | 0.00016 | |
|
| Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) | |
|
</figure> |
|
|
|
### How to use |
|
The model can be loaded using the `AutoModelForCausalLM` functionality: |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("eunyounglee/GPT-NeoX-2.7B-Vietnamese-pretrained") |
|
model = AutoModelForCausalLM.from_pretrained("eunyounglee/GPT-NeoX-2.7B-Vietnamese-pretrained") |
|
``` |
|
|