birgermoell
/

swedish-gpt

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

birgermoell commited on Jul 17, 2021

Commit

7db48e1

·

1 Parent(s): 41cd5ea

Update README.md

Files changed (1) hide show

README.md +17 -6

README.md CHANGED Viewed

@@ -6,10 +6,13 @@ widget:
 # GPT2-svenska-wikipedia
 A swedish GPT2 style model trained using Flax CLM pipeline on the Swedish
-part of the wiki40b dataset.
 https://huggingface.co/datasets/wiki40b
 ## Data cleaning and preprocessing
 The data was cleaned and preprocessed using the following script. Make sure to install depencies for beam_runner to make the dataset work.
@@ -26,10 +29,18 @@ def load_and_clean_wiki():
     return filtered_dataset
 def filter_wikipedia(batch):
-    batch["text"] = " ".join(batch["text"].split("\n_START_SECTION_\n"))
-    batch["text"] = " ".join(batch["text"].split("\n_START_ARTICLE_\n"))
-    batch["text"] = " ".join(batch["text"].split("\n_START_ARTICLE_\n"))
-    batch["text"] = " ".join(batch["text"].split("\n_START_PARAGRAPH_\n"))
     batch["text"] = " ".join(batch["text"].split("_NEWLINE_"))
     batch["text"] = " ".join(batch["text"].split("\xa0"))
     return batch

 # GPT2-svenska-wikipedia
 A swedish GPT2 style model trained using Flax CLM pipeline on the Swedish
+part of the wiki40b dataset and the Oscar dataset.
 https://huggingface.co/datasets/wiki40b
+The model was trained for around 22600 steps (42 hours) as part of the Huggingface Jax/Flax challenge with the following loss and learning rate
+Loss: 3.1715331077575684, Learning Rate: 0.0024816440418362617)
+The model could likely be trained for longer.
 ## Data cleaning and preprocessing
 The data was cleaned and preprocessed using the following script. Make sure to install depencies for beam_runner to make the dataset work.
     return filtered_dataset
 def filter_wikipedia(batch):
+    batch["text"] = " ".join(batch["text"].split("\
+_START_SECTION_\
+"))
+    batch["text"] = " ".join(batch["text"].split("\
+_START_ARTICLE_\
+"))
+    batch["text"] = " ".join(batch["text"].split("\
+_START_ARTICLE_\
+"))
+    batch["text"] = " ".join(batch["text"].split("\
+_START_PARAGRAPH_\
+"))
     batch["text"] = " ".join(batch["text"].split("_NEWLINE_"))
     batch["text"] = " ".join(batch["text"].split("\xa0"))
     return batch