birgermoell
commited on
Commit
·
7db48e1
1
Parent(s):
41cd5ea
Update README.md
Browse files
README.md
CHANGED
@@ -6,10 +6,13 @@ widget:
|
|
6 |
|
7 |
# GPT2-svenska-wikipedia
|
8 |
A swedish GPT2 style model trained using Flax CLM pipeline on the Swedish
|
9 |
-
part of the wiki40b dataset.
|
10 |
-
|
11 |
https://huggingface.co/datasets/wiki40b
|
12 |
|
|
|
|
|
|
|
|
|
13 |
|
14 |
## Data cleaning and preprocessing
|
15 |
The data was cleaned and preprocessed using the following script. Make sure to install depencies for beam_runner to make the dataset work.
|
@@ -26,10 +29,18 @@ def load_and_clean_wiki():
|
|
26 |
return filtered_dataset
|
27 |
|
28 |
def filter_wikipedia(batch):
|
29 |
-
batch["text"] = " ".join(batch["text"].split("\
|
30 |
-
|
31 |
-
|
32 |
-
batch["text"] = " ".join(batch["text"].split("\
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
batch["text"] = " ".join(batch["text"].split("_NEWLINE_"))
|
34 |
batch["text"] = " ".join(batch["text"].split("\xa0"))
|
35 |
return batch
|
|
|
6 |
|
7 |
# GPT2-svenska-wikipedia
|
8 |
A swedish GPT2 style model trained using Flax CLM pipeline on the Swedish
|
9 |
+
part of the wiki40b dataset and the Oscar dataset.
|
|
|
10 |
https://huggingface.co/datasets/wiki40b
|
11 |
|
12 |
+
The model was trained for around 22600 steps (42 hours) as part of the Huggingface Jax/Flax challenge with the following loss and learning rate
|
13 |
+
Loss: 3.1715331077575684, Learning Rate: 0.0024816440418362617)
|
14 |
+
|
15 |
+
The model could likely be trained for longer.
|
16 |
|
17 |
## Data cleaning and preprocessing
|
18 |
The data was cleaned and preprocessed using the following script. Make sure to install depencies for beam_runner to make the dataset work.
|
|
|
29 |
return filtered_dataset
|
30 |
|
31 |
def filter_wikipedia(batch):
|
32 |
+
batch["text"] = " ".join(batch["text"].split("\
|
33 |
+
_START_SECTION_\
|
34 |
+
"))
|
35 |
+
batch["text"] = " ".join(batch["text"].split("\
|
36 |
+
_START_ARTICLE_\
|
37 |
+
"))
|
38 |
+
batch["text"] = " ".join(batch["text"].split("\
|
39 |
+
_START_ARTICLE_\
|
40 |
+
"))
|
41 |
+
batch["text"] = " ".join(batch["text"].split("\
|
42 |
+
_START_PARAGRAPH_\
|
43 |
+
"))
|
44 |
batch["text"] = " ".join(batch["text"].split("_NEWLINE_"))
|
45 |
batch["text"] = " ".join(batch["text"].split("\xa0"))
|
46 |
return batch
|