lijiazheng99
initial
faa3cfe
|
raw
history blame
1.22 kB
metadata
model-index:
  - name: robinlee99/Pythia-2.8B-TLDR-Iterative-SamPO
    results: []
datasets:
  - webis/tldr-17
language:
  - en
base_model: EleutherAI/pythia-2.8b
license: apache-2.0

Model Card for Pythia-2.8B-TLDR-Iterative-SamPO

This repository provides a fine-tuned version of Pythia-2.8B, using our proposed SamPO algorithm: Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence.

Performance

Pairwise Comparison GPT-4 win rate
Pythia-2.8B-TLDR-Iterative-SamPO Vs DPO 78.66%

Evaluation Details

We test our model with the same GPT-4 Win rate prompt template proposed by the DPO paper. The sampled test set is included in this repo.

Training hyperparameters

The following hyperparameters were used during DPO/SamPO training:

  • DPO beta: 0.5
  • learning_rate: 1e-6
  • total_train_batch_size: 128
  • optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • Weight Decay: 0.0
  • num_epochs: 1.0