File size: 2,774 Bytes
a686184
 
 
 
 
 
79407b8
a686184
79407b8
a686184
79407b8
 
a686184
79407b8
 
b4cdd00
a686184
79407b8
a686184
 
ae92d8e
 
a686184
79407b8
a686184
79407b8
a686184
 
 
 
79407b8
a686184
79407b8
 
 
a686184
ae92d8e
 
b37b90a
 
 
 
d92b6a6
b37b90a
 
d92b6a6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
library_name: transformers
tags: []
---


# Lugha-Llama/Lugha-Llama-8B-wura_math

<!-- Provide a quick summary of what the model is/does. -->

Lugha-Llama is an Africa-centric language model developed through continual pretraining with [WURA dataset](https://huggingface.co/datasets/castorini/wura), a large African languages corpora which consists of sixteen low-resource African languages and four high-resource 
languages commonly spoken on the African continent. 

To train the model, we sample as uniformly as possible across languages while limiting the number of times data is repeated and upsample rare languages by at most four epochs.
We combine [WURA data](https://huggingface.co/datasets/castorini/wura) with high-quality English documents from [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) which results into improved Lugha-Llama-Edu and Lugha-Llama-Maths models respectively.
Our models consistently achieve the best performance amongst similary-sized baselines on AfriMMLU, AfriMGSM, AfriXNLI tasks in Irokobench. 

In a separate ablation experiment, we translate English education documents to Swahili to study whether the performance gains from FineWeb-Edu data is due to its content or English source language. [FineWeb_Edu-swahili-translated](https://huggingface.co/datasets/princeton-nlp/fineweb_edu-swahili-translated).


We demonstrate the findings in our paper [comming soon]()
<!--[Lugha-Llama:Adapting Large Language Models for African Languages]()-->

Authors: [Happy Buzaaba](https://buzaabah.github.io/)\*, [Alexander Wettig](https://www.cs.princeton.edu/~awettig/)\*, [David Ifeoluwa Adelani](https://dadelani.github.io/), [Christiane Fellbaum](https://www.cs.princeton.edu/people/profile/fellbaum) (* equal contribution)

Contact `{happy.buzaaba@, awettig@cs}princeton.edu`




## Lugha-Llama models

* [Lugha-Llama/Lugha-Llama-8B-wura](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura)
* [Lugha-Llama/Lugha-Llama-8B-wura_edu](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura_edu)
* [Lugha-Llama/Lugha-Llama-8B-wura_math](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura_math)

Our main result

![main_result.png](https://cdn-uploads.huggingface.co/production/uploads/62a8501e1c396da5716cfca2/MZw0c4TAnRPYNVdhru7Uo.png)

<p align="center">
<em>Performance of Lugha-Llama models and baselines on <a href="https://arxiv.org/abs/2406.03368"> [IrokoBench]</a>. Languages in italic
are not present in the continual pre-training data. †: We exclude the high-resource languages English (eng) and French (fra) from the average, 
as they would otherwise skew the results due to strong English base models.</em>
</p>