--- model-index: - name: robinlee99/Pythia-2.8B-TLDR-Iterative-SamPO results: [] datasets: - webis/tldr-17 language: - en base_model: EleutherAI/pythia-2.8b license: apache-2.0 --- # Model Card for Pythia-2.8B-TLDR-Iterative-SamPO This repository provides a fine-tuned version of Pythia-2.8B, using our proposed [SamPO](https://github.com/LuJunru/SamPO) algorithm: Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence. ## Performance | Pairwise Comparison | GPT-4 win rate | | ----- | ------ | | Pythia-2.8B-TLDR-Iterative-SamPO Vs DPO | 78.66% | ## Evaluation Details We test our model with the same GPT-4 Win rate prompt template proposed by the [DPO paper](https://arxiv.org/pdf/2305.18290). The [sampled test set](https://huggingface.co/robinlee99/Pythia-2.8B-TLDR-Iterative-SamPO/blob/main/test_tldr.jsonl) is included in this repo. ## Training hyperparameters The following hyperparameters were used during DPO/SamPO training: - DPO beta: 0.5 - learning_rate: 1e-6 - total_train_batch_size: 128 - optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - Weight Decay: 0.0 - num_epochs: 1.0