File size: 5,337 Bytes
ac57142
 
 
 
 
 
 
 
 
 
2235636
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f743fbb
f13f711
 
 
 
ac57142
1da9d7f
 
 
ac57142
1da9d7f
 
 
 
ac57142
2a8e940
ac57142
f8fd39b
ac57142
2a8e940
 
 
 
 
 
ac57142
 
 
fb6a9ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac57142
 
 
f8fd39b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac57142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a8e940
 
 
fb6a9ed
ac57142
 
 
1da9d7f
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
base_model:
- meta-llama/Llama-3.1-8B-Instruct
license: llama3.1
language:
- gl
metrics:
- bleu
- rouge
model-index:
- name: Llama-3.1-8B-Instruct-Galician
  results:
  - task:
      type: text-generation
    dataset:
      name: alpaca_data_galician
      type: alpaca_data_galician
    metrics:
    - name: bleu
      type: bleu-4
      value: 23.13
    - name: rouge
      type: rouge-l
      value: 21.84
pipeline_tag: text-generation
library_name: transformers
widget:
  - text: "Onde está o concello de Frades?"
    output:
      text: Frades é un concello da provincia da Coruña, pertencente á comarca de Ordes. Está situado a 15 quilómetros de Santiago de Compostela.
---
<div align="center">
    <p align="center"><img width=20% src="https://gitlab.irlab.org/eliseo.bao/xovetic-llms-underrepresented-languages/-/raw/main/img/logo.png" /></p>
</div>




# Llama-3.1-8B-Instruct-Galician a.k.a. Cabuxa 2.0

This model is a continued pretraining version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on the [CorpusNós](https://zenodo.org/records/11655219) dataset.

## Model Description

- **Developed by:** [UDC Information Retrieval Lab (IRLab)](https://huggingface.co/irlab-udc)
- **Language(s) (NLP):** Multilingual, adapted to Galician
- **License:** llama3.1
- **Finetuned from model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- **Repository:** [Adapting Large Language Models for Underrepresented Languages](https://gitlab.irlab.org/eliseo.bao/xovetic-llms-underrepresented-languages)
- **Paper:** _Coming soon_

## How to Get Started with the Model

```python
import transformers
import torch

model_id = "irlab-udc/Llama-3.1-8B-Instruct-Galician"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a conversational AI that always responds in Galician."},
    {"role": "user", "content": "Cal é a principal vantaxe de usar Scrum?"},
]

outputs = pipeline(messages, max_new_tokens=512)

print(outputs[0]["generated_text"][-1]["content"])
```

#### Training Hyperparameters

| Parameter                     | Value                                |
|--------------------------------|--------------------------------------|
| learning_rate                  | 0.0001                               |
| train_batch_size               | 32                                   |
| eval_batch_size                | 1                                    |
| seed                           | 42                                   |
| distributed_type               | multi-GPU                            |
| num_devices                    | 4                                    |
| gradient_accumulation_steps     | 2                                    |
| total_train_batch_size         | 256                                  |
| total_eval_batch_size          | 4                                    |
| optimizer                      | Adam with betas=(0.9, 0.999), epsilon=1e-08 |
| lr_scheduler_type              | cosine                               |
| lr_scheduler_warmup_ratio      | 0.1                                  |
| num_epochs                     | 1.0                                  |


#### Training results

| Training Loss | Epoch  | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 2.0606        | 0.1682 | 900  | 2.0613          |
| 1.9898        | 0.3363 | 1800 | 1.9929          |
| 1.9847        | 0.5045 | 2700 | 1.9613          |
| 1.9577        | 0.6726 | 3600 | 1.9445          |
| 1.9287        | 0.8408 | 4500 | 1.9368          |

## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** 4x NVIDIA A100 SXM4 80 GB (TDP of 400W)
- **Hours used:** 60
- **Cloud Provider:** Private infrastructure
- **Carbon Emitted:** 10.37 Kg. CO₂ eq.

## Citation

```
@inproceedings{bao-perez-parapar-xovetic-2024,
  title={Adapting Large Language Models for Underrepresented Languages},
  author={Eliseo Bao and Anxo Pérez and Javier Parapar	},
  booktitle={VII Congreso XoveTIC: impulsando el talento cient{\'\i}fico},
  year={2024},
  organization={Universidade da Coru{\~n}a, Servizo de Publicaci{\'o}ns}
  abstact = {The popularization of Large Language Models (LLMs), especially with the development of conversational systems, makes mandatory to think about facilitating the use of artificial intelligence (AI) to everyone. Most models neglect minority languages, prioritizing widely spoken ones. This exacerbates their underrepresentation in the digital world and negatively affects their speakers. We present two resources aimed at improving natural language processing (NLP) for Galician: (i) a Llama 3.1 instruct model adapted through continuous pre-training on the CorpusNos dataset; and (ii) a Galician version of the Alpaca dataset, used to assess the improvement over the base model. In this evaluation, our model outperformed both the base model and another Galician model in quantitative and qualitative terms}
}
```