Update README.md
Browse files
README.md
CHANGED
@@ -23,7 +23,7 @@ This model is a continually pretrained version of the [meta-llama/Llama-3.2-3B](
|
|
23 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
24 |
| Llama 3.2 (text only) | Hishab curated Bangla text corpus | 3B(3.21B) | Monolingual Text(Bangla) | Monolingual Text(Bangla) | 4096 | Yes | Yes | 37B tokens | |
|
25 |
|
26 |
-
**Supported Languages:** Bengali(primary) and English(secondary)
|
27 |
|
28 |
**Llama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.
|
29 |
|
@@ -31,7 +31,7 @@ This model is a continually pretrained version of the [meta-llama/Llama-3.2-3B](
|
|
31 |
|
32 |
**Status:** This is a static model trained on an offline dataset. Future versions may be released to improve model capabilities.
|
33 |
|
34 |
-
**License:** We are using a similar license
|
35 |
|
36 |
|
37 |
## How to use
|
@@ -64,7 +64,7 @@ pipe("আমাদের দেশের নাম")
|
|
64 |
|
65 |
## Training Data
|
66 |
|
67 |
-
**Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size roughly around 268 GB.
|
68 |
|
69 |
Data sources summary:
|
70 |
- Web documents: Extracted, clean, and filtered common crawl data
|
@@ -77,7 +77,7 @@ Data sources summary:
|
|
77 |
- Others: We scraped some selected website data, used open-source data, and used some other data sources
|
78 |
|
79 |
## Token Extending
|
80 |
-
We trained a separate Bangla tokenizer using [Tiktoken](https://github.com/openai/tiktoken) library on 48 GB Bangla datasets (sampled from main pretraining data) with a vocabulary size 48k and separated 42k tokens for adding with the pretrained model. We extended the model's vocabulary with these tokens and continued the pretraining process on Bangla data. The token-extending process was done to enhance the model's ability to generate high-quality Bangla text. Our updated vocab size is 170K whereas the original llama-3.2 vocab size is 128k.
|
81 |
|
82 |
|
83 |
## Benchmarks \- Bangla Text
|
|
|
23 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
24 |
| Llama 3.2 (text only) | Hishab curated Bangla text corpus | 3B(3.21B) | Monolingual Text(Bangla) | Monolingual Text(Bangla) | 4096 | Yes | Yes | 37B tokens | |
|
25 |
|
26 |
+
**Supported Languages:** Bengali (primary) and English (secondary)
|
27 |
|
28 |
**Llama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.
|
29 |
|
|
|
31 |
|
32 |
**Status:** This is a static model trained on an offline dataset. Future versions may be released to improve model capabilities.
|
33 |
|
34 |
+
**License:** We are using a similar license to Llama 3.2. Use of Llama 3.2 is governed by the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).
|
35 |
|
36 |
|
37 |
## How to use
|
|
|
64 |
|
65 |
## Training Data
|
66 |
|
67 |
+
**Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB. The total trained tokens are 37B tokens.
|
68 |
|
69 |
Data sources summary:
|
70 |
- Web documents: Extracted, clean, and filtered common crawl data
|
|
|
77 |
- Others: We scraped some selected website data, used open-source data, and used some other data sources
|
78 |
|
79 |
## Token Extending
|
80 |
+
We trained a separate Bangla tokenizer using [Tiktoken](https://github.com/openai/tiktoken) library on 48 GB Bangla datasets (sampled from main pretraining data) with a vocabulary size of 48k and separated 42k tokens for adding with the pretrained model. We extended the model's vocabulary with these tokens and continued the pretraining process on Bangla data. The token-extending process was done to enhance the model's ability to generate high-quality Bangla text. Our updated vocab size is 170K whereas the original llama-3.2 vocab size is 128k.
|
81 |
|
82 |
|
83 |
## Benchmarks \- Bangla Text
|