update to new dataset version

Browse files

Files changed (8) hide show

README.md +39 -19
all_results.json +6 -6
config.json +1 -1
extract_sents.py +45 -0
pytorch_model.bin +2 -2
train_results.json +6 -6
trainer_state.json +93 -147
training_args.bin +2 -2

README.md CHANGED Viewed

@@ -40,6 +40,11 @@ The training dataset consists of:
 These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using [OpusFilter](https://helsinki-nlp.github.io/OpusFilter) (Aulamo et al., 2020), see [`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to the retraining of a statistical alignment model, but in my experience, different runs tend to give extremely similar results. Do not hesitate to reach out if you experience difficulties in using this to collect data.
 ## Training procedure
 The training hyperparameters are those suggested by Adelani et al. (2022) in their [code release](https://github.com/masakhane-io/lafand-mt), which gave their best results for machine translation of several African languages.
@@ -48,42 +53,57 @@ More specifically, we use the [example training script](https://github.com/huggi
 ```bash
 python run_translation.py \
-    --model_name_or_path facebook/m2m100_418M  \
-    --do_train \
-    --train_file {path_to_train_corpus} \
-    --source_lang br \
-    --target_lang fr \
-    --output_dir {path_to_model} \
-    --per_device_train_batch_size=4 \
-    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
-    --predict_with_generate \
-    --forced_bos_token fr \
-    --save_steps 50000 \
-    --num_beams 10 \
 ```
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - learning_rate: 5e-05
 - train_batch_size: 8
 - eval_batch_size: 8
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: linear
-- num_epochs: 3.0
 ### Framework versions
-- Transformers 4.23.1
 - Pytorch 1.12.1+cu116
 - Datasets 2.6.1
 - Tokenizers 0.13.1
 ## References
-- Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, et al. 2022. « A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation ». In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle, United States: Association for Computational Linguistics. <https://doi.org/10.18653/v1/2022.naacl-main.223>.
-- Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational Linguistics.
-- Tiedemann, Jorg 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
-- Tyers, Francis M. 2009 "Rule-based augmentation of training data in Breton-French statistical machine translation ". Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09. Barcelona, España. 213--218

 These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using [OpusFilter](https://helsinki-nlp.github.io/OpusFilter) (Aulamo et al., 2020), see [`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to the retraining of a statistical alignment model, but in my experience, different runs tend to give extremely similar results. Do not hesitate to reach out if you experience difficulties in using this to collect data.
+In addition to these, the training dataset also includes parallel br/fr sentences, provided as
+glosses in the [Arbres](https://arbres.iker.cnrs.fr) wiki (Jouitteau, 2022), obtained from their
+[ongoing port](https://github.com/Autogramm/Breton/commit/45ac2c444a979b7ee41e5f24a3bfd1ec39f09d7d)
+to Universal Dependencies in the Autogramm project.
 ## Training procedure
 The training hyperparameters are those suggested by Adelani et al. (2022) in their [code release](https://github.com/masakhane-io/lafand-mt), which gave their best results for machine translation of several African languages.
 ```bash
 python run_translation.py \
+  --model_name_or_path facebook/m2m100_418M  \
+  --do_train \
+  --train_file {path_to_training_data} \
+  --source_lang br \
+  --target_lang fr \
+  --output_dir {path_to_model}\
+  --per_device_train_batch_size=8 \
+  --overwrite_output_dir \
+  --forced_bos_token fr \
+  --save_steps 4096 \
+  --fp16 \
+  --num_train_epochs 4
 ```
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - learning_rate: 5e-05
 - train_batch_size: 8
 - eval_batch_size: 8
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: linear
+- num_epochs: 4.0
 ### Framework versions
+- Transformers 4.24.0
 - Pytorch 1.12.1+cu116
 - Datasets 2.6.1
 - Tokenizers 0.13.1
 ## References
+- Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,
+  et al. 2022. « A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for
+  African News Translation ». In Proceedings of the 2022 Conference of the North American Chapter of
+  the Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle,
+  United States: Association for Computational Linguistics.
+  <https://doi.org/10.18653/v1/2022.naacl-main.223>.
+- Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus
+  Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational
+  Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational
+  Linguistics.
+- Tiedemann, Jorg 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th
+  International Conference on Language Resources and Evaluation (LREC 2012)
+- Jouitteau, Mélanie. (éd.). 2009-2022. ARBRES, wikigrammaire des dialectes du breton et centre de
+  ressources pour son étude linguistique formelle, IKER, CNRS, http://arbres.iker.cnrs.fr. Licence
+  Creative Commons BY-NC-SA.
+- Tyers, Francis M. 2009 "Rule-based augmentation of training data in Breton-French statistical
+  machine translation ". Proceedings of the 13th Annual Conference of the European Association of
+  Machine Translation, EAMT09. Barcelona, España. 213--218

all_results.json CHANGED Viewed

@@ -1,8 +1,8 @@
 {
-    "epoch": 3.0,
-    "train_loss": 1.4955534830489579,
-    "train_runtime": 15709.331,
-    "train_samples": 48907,
-    "train_samples_per_second": 9.34,
-    "train_steps_per_second": 1.168
 }

 {
+    "epoch": 4.0,
+    "train_loss": 1.4005291703168083,
+    "train_runtime": 11994.4751,
+    "train_samples": 54393,
+    "train_samples_per_second": 18.139,
+    "train_steps_per_second": 1.134
 }

config.json CHANGED Viewed

@@ -32,7 +32,7 @@
   "pad_token_id": 1,
   "scale_embedding": true,
   "torch_dtype": "float32",
-  "transformers_version": "4.23.1",
   "use_cache": true,
   "vocab_size": 128112
 }

   "pad_token_id": 1,
   "scale_embedding": true,
   "torch_dtype": "float32",
+  "transformers_version": "4.24.0",
   "use_cache": true,
   "vocab_size": 128112
 }

extract_sents.py ADDED Viewed

	@@ -0,0 +1,45 @@

+from typing import TextIO
+import re
+import click
+import conllu
+import jsonlines
+@click.command(help="Extract a parallel corpus from a CoNLL-U file with translations")
+@click.argument("conllu_path", type=click.File("r"))
+@click.argument("output_path", type=click.File("w"), default="-")
+@click.option("--main-langcode", default="br", show_default=True)
+@click.option("--require-langcode", multiple=True, show_default=True)
+def main(
+    conllu_path: TextIO,
+    main_langcode: str,
+    output_path: TextIO,
+    require_langcode: list[str],
+):
+    with jsonlines.Writer(output_path) as out_stream:
+        for tokenlist in conllu.parse_incr(conllu_path):
+            if m := re.match(r"'?(?P<content>[^/]+?)'?$", tokenlist.metadata["text"]):
+                main_text = m.group("content")
+            else:
+                continue
+            translations = {
+                km.group("langcode"): kv.group("content")
+                for k, v in tokenlist.metadata.items()
+                if (km := re.match(r"text_(?P<langcode>.*)", k))
+                and (kv := re.match(r"'?(?P<content>[^/]+?)'?$", v))
+            }
+            if not all(l in translations for l in require_langcode):
+                continue
+            out_stream.write(
+                {
+                    "translation": {
+                        main_langcode: main_text,
+                        **translations,
+                    }
+                }
+            )
+if __name__ == "__main__":
+    main()

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7e6ca2e2048c0abf133d1ea4f986221f7744da591536e2c96c4aa7c8ad290d8e
-size 1935792071

 version https://git-lfs.github.com/spec/v1
+oid sha256:c1c8ae0171f992187869f7f6979a8762112705b6caa0404548a13cf039f8a5f1
+size 1935795713

train_results.json CHANGED Viewed

@@ -1,8 +1,8 @@
 {
-    "epoch": 3.0,
-    "train_loss": 1.4955534830489579,
-    "train_runtime": 15709.331,
-    "train_samples": 48907,
-    "train_samples_per_second": 9.34,
-    "train_steps_per_second": 1.168
 }

 {
+    "epoch": 4.0,
+    "train_loss": 1.4005291703168083,
+    "train_runtime": 11994.4751,
+    "train_samples": 54393,
+    "train_samples_per_second": 18.139,
+    "train_steps_per_second": 1.134
 }

trainer_state.json CHANGED Viewed

@@ -1,241 +1,187 @@
 {
   "best_metric": null,
   "best_model_checkpoint": null,
-  "epoch": 3.0,
-  "global_step": 18342,
   "is_hyper_param_search": false,
   "is_local_process_zero": true,
   "is_world_process_zero": true,
   "log_history": [
     {
-      "epoch": 0.08,
-      "learning_rate": 4.863700795987351e-05,
-      "loss": 2.8121,
       "step": 500
     },
     {
-      "epoch": 0.16,
-      "learning_rate": 4.7274015919747035e-05,
-      "loss": 2.3962,
       "step": 1000
     },
     {
-      "epoch": 0.25,
-      "learning_rate": 4.5911023879620545e-05,
-      "loss": 2.2275,
       "step": 1500
     },
     {
-      "epoch": 0.33,
-      "learning_rate": 4.454803183949406e-05,
-      "loss": 2.1283,
       "step": 2000
     },
     {
-      "epoch": 0.41,
-      "learning_rate": 4.318503979936757e-05,
-      "loss": 2.0421,
       "step": 2500
     },
     {
-      "epoch": 0.49,
-      "learning_rate": 4.182204775924109e-05,
-      "loss": 1.9646,
       "step": 3000
     },
     {
-      "epoch": 0.57,
-      "learning_rate": 4.0459055719114605e-05,
-      "loss": 1.9383,
       "step": 3500
     },
     {
-      "epoch": 0.65,
-      "learning_rate": 3.9096063678988114e-05,
-      "loss": 1.8862,
       "step": 4000
     },
     {
-      "epoch": 0.74,
-      "learning_rate": 3.773307163886163e-05,
-      "loss": 1.8524,
       "step": 4500
     },
     {
-      "epoch": 0.82,
-      "learning_rate": 3.637007959873515e-05,
-      "loss": 1.8164,
       "step": 5000
     },
     {
-      "epoch": 0.9,
-      "learning_rate": 3.500708755860866e-05,
-      "loss": 1.7536,
       "step": 5500
     },
     {
-      "epoch": 0.98,
-      "learning_rate": 3.3644095518482174e-05,
-      "loss": 1.7631,
       "step": 6000
     },
     {
-      "epoch": 1.06,
-      "learning_rate": 3.228110347835569e-05,
-      "loss": 1.4821,
       "step": 6500
     },
     {
-      "epoch": 1.14,
-      "learning_rate": 3.09181114382292e-05,
-      "loss": 1.4501,
       "step": 7000
     },
     {
-      "epoch": 1.23,
-      "learning_rate": 2.9555119398102717e-05,
-      "loss": 1.4364,
       "step": 7500
     },
     {
-      "epoch": 1.31,
-      "learning_rate": 2.8192127357976233e-05,
-      "loss": 1.4028,
       "step": 8000
     },
     {
-      "epoch": 1.39,
-      "learning_rate": 2.6829135317849746e-05,
-      "loss": 1.3865,
       "step": 8500
     },
     {
-      "epoch": 1.47,
-      "learning_rate": 2.546614327772326e-05,
-      "loss": 1.4069,
       "step": 9000
     },
     {
-      "epoch": 1.55,
-      "learning_rate": 2.4103151237596773e-05,
-      "loss": 1.3883,
       "step": 9500
     },
     {
-      "epoch": 1.64,
-      "learning_rate": 2.2740159197470286e-05,
-      "loss": 1.3654,
       "step": 10000
     },
     {
-      "epoch": 1.72,
-      "learning_rate": 2.1377167157343802e-05,
-      "loss": 1.3411,
       "step": 10500
     },
     {
-      "epoch": 1.8,
-      "learning_rate": 2.0014175117217316e-05,
-      "loss": 1.3359,
       "step": 11000
     },
     {
-      "epoch": 1.88,
-      "learning_rate": 1.8651183077090832e-05,
-      "loss": 1.379,
       "step": 11500
     },
     {
-      "epoch": 1.96,
-      "learning_rate": 1.7288191036964345e-05,
-      "loss": 1.3316,
       "step": 12000
     },
     {
-      "epoch": 2.04,
-      "learning_rate": 1.592519899683786e-05,
-      "loss": 1.2242,
       "step": 12500
     },
     {
-      "epoch": 2.13,
-      "learning_rate": 1.4562206956711375e-05,
-      "loss": 1.0537,
       "step": 13000
     },
     {
-      "epoch": 2.21,
-      "learning_rate": 1.3199214916584887e-05,
-      "loss": 1.0697,
       "step": 13500
     },
     {
-      "epoch": 2.29,
-      "learning_rate": 1.1836222876458403e-05,
-      "loss": 1.0704,
-      "step": 14000
-    },
-    {
-      "epoch": 2.37,
-      "learning_rate": 1.0473230836331916e-05,
-      "loss": 1.0918,
-      "step": 14500
-    },
-    {
-      "epoch": 2.45,
-      "learning_rate": 9.110238796205431e-06,
-      "loss": 1.0878,
-      "step": 15000
-    },
-    {
-      "epoch": 2.54,
-      "learning_rate": 7.747246756078944e-06,
-      "loss": 1.0506,
-      "step": 15500
-    },
-    {
-      "epoch": 2.62,
-      "learning_rate": 6.384254715952459e-06,
-      "loss": 1.0557,
-      "step": 16000
-    },
-    {
-      "epoch": 2.7,
-      "learning_rate": 5.021262675825973e-06,
-      "loss": 1.0325,
-      "step": 16500
-    },
-    {
-      "epoch": 2.78,
-      "learning_rate": 3.658270635699488e-06,
-      "loss": 1.0784,
-      "step": 17000
-    },
-    {
-      "epoch": 2.86,
-      "learning_rate": 2.295278595573002e-06,
-      "loss": 1.0239,
-      "step": 17500
-    },
-    {
-      "epoch": 2.94,
-      "learning_rate": 9.322865554465163e-07,
-      "loss": 1.0211,
-      "step": 18000
-    },
-    {
-      "epoch": 3.0,
-      "step": 18342,
-      "total_flos": 2.148457555862323e+16,
-      "train_loss": 1.4955534830489579,
-      "train_runtime": 15709.331,
-      "train_samples_per_second": 9.34,
-      "train_steps_per_second": 1.168
     }
   ],
-  "max_steps": 18342,
-  "num_train_epochs": 3,
-  "total_flos": 2.148457555862323e+16,
   "trial_name": null,
   "trial_params": null
 }

 {
   "best_metric": null,
   "best_model_checkpoint": null,
+  "epoch": 4.0,
+  "global_step": 13600,
   "is_hyper_param_search": false,
   "is_local_process_zero": true,
   "is_world_process_zero": true,
   "log_history": [
     {
+      "epoch": 0.15,
+      "learning_rate": 4.816176470588236e-05,
+      "loss": 2.6313,
       "step": 500
     },
     {
+      "epoch": 0.29,
+      "learning_rate": 4.632352941176471e-05,
+      "loss": 2.2069,
       "step": 1000
     },
     {
+      "epoch": 0.44,
+      "learning_rate": 4.448529411764706e-05,
+      "loss": 2.035,
       "step": 1500
     },
     {
+      "epoch": 0.59,
+      "learning_rate": 4.2647058823529415e-05,
+      "loss": 1.9491,
       "step": 2000
     },
     {
+      "epoch": 0.74,
+      "learning_rate": 4.08125e-05,
+      "loss": 1.8742,
       "step": 2500
     },
     {
+      "epoch": 0.88,
+      "learning_rate": 3.897426470588236e-05,
+      "loss": 1.8387,
       "step": 3000
     },
     {
+      "epoch": 1.03,
+      "learning_rate": 3.713602941176471e-05,
+      "loss": 1.6941,
       "step": 3500
     },
     {
+      "epoch": 1.18,
+      "learning_rate": 3.529779411764706e-05,
+      "loss": 1.5224,
       "step": 4000
     },
     {
+      "epoch": 1.32,
+      "learning_rate": 3.3459558823529415e-05,
+      "loss": 1.4897,
       "step": 4500
     },
     {
+      "epoch": 1.47,
+      "learning_rate": 3.1621323529411765e-05,
+      "loss": 1.4445,
       "step": 5000
     },
     {
+      "epoch": 1.62,
+      "learning_rate": 2.978308823529412e-05,
+      "loss": 1.4593,
       "step": 5500
     },
     {
+      "epoch": 1.76,
+      "learning_rate": 2.7944852941176468e-05,
+      "loss": 1.4251,
       "step": 6000
     },
     {
+      "epoch": 1.91,
+      "learning_rate": 2.6113970588235297e-05,
+      "loss": 1.39,
       "step": 6500
     },
     {
+      "epoch": 2.06,
+      "learning_rate": 2.427573529411765e-05,
+      "loss": 1.2959,
       "step": 7000
     },
     {
+      "epoch": 2.21,
+      "learning_rate": 2.24375e-05,
+      "loss": 1.1621,
       "step": 7500
     },
     {
+      "epoch": 2.35,
+      "learning_rate": 2.0599264705882353e-05,
+      "loss": 1.1374,
       "step": 8000
     },
     {
+      "epoch": 2.5,
+      "learning_rate": 1.876102941176471e-05,
+      "loss": 1.1649,
       "step": 8500
     },
     {
+      "epoch": 2.65,
+      "learning_rate": 1.6926470588235294e-05,
+      "loss": 1.1513,
       "step": 9000
     },
     {
+      "epoch": 2.79,
+      "learning_rate": 1.5088235294117647e-05,
+      "loss": 1.1463,
       "step": 9500
     },
     {
+      "epoch": 2.94,
+      "learning_rate": 1.3250000000000002e-05,
+      "loss": 1.1466,
       "step": 10000
     },
     {
+      "epoch": 3.09,
+      "learning_rate": 1.1411764705882353e-05,
+      "loss": 1.0411,
       "step": 10500
     },
     {
+      "epoch": 3.24,
+      "learning_rate": 9.573529411764706e-06,
+      "loss": 0.9581,
       "step": 11000
     },
     {
+      "epoch": 3.38,
+      "learning_rate": 7.735294117647058e-06,
+      "loss": 0.9514,
       "step": 11500
     },
     {
+      "epoch": 3.53,
+      "learning_rate": 5.897058823529412e-06,
+      "loss": 0.9429,
       "step": 12000
     },
     {
+      "epoch": 3.68,
+      "learning_rate": 4.058823529411765e-06,
+      "loss": 0.9676,
       "step": 12500
     },
     {
+      "epoch": 3.82,
+      "learning_rate": 2.2205882352941175e-06,
+      "loss": 0.9324,
       "step": 13000
     },
     {
+      "epoch": 3.97,
+      "learning_rate": 3.8235294117647064e-07,
+      "loss": 0.9555,
       "step": 13500
     },
     {
+      "epoch": 4.0,
+      "step": 13600,
+      "total_flos": 3.918346910230118e+16,
+      "train_loss": 1.4005291703168083,
+      "train_runtime": 11994.4751,
+      "train_samples_per_second": 18.139,
+      "train_steps_per_second": 1.134
     }
   ],
+  "max_steps": 13600,
+  "num_train_epochs": 4,
+  "total_flos": 3.918346910230118e+16,
   "trial_name": null,
   "trial_params": null
 }

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:906c9235c8bf75589426402d4a87903317a704f12dd138c3607b781101161aea
-size 3503

 version https://git-lfs.github.com/spec/v1
+oid sha256:d7e19c4b52c1665d4e24c8332861794cb0354d00704d62e085c5f3112b7d82d7
+size 3579