zephyr-7b-dpo-qlora

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5018
  • Rewards/chosen: -2.1482
  • Rewards/rejected: -3.1540
  • Rewards/accuracies: 0.7590
  • Rewards/margins: 1.0058
  • Logps/rejected: -556.6644
  • Logps/chosen: -480.1277
  • Logits/rejected: -1.2931
  • Logits/chosen: -1.3827

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 32
  • total_eval_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6635 0.0523 100 0.6640 0.0129 -0.0631 0.6830 0.0761 -247.5831 -264.0175 -2.0469 -2.1424
0.6119 0.1047 200 0.6207 -0.5556 -0.8212 0.6790 0.2657 -323.3911 -320.8676 -1.9524 -2.0449
0.5874 0.1570 300 0.5849 -0.4240 -0.8044 0.7000 0.3804 -321.7128 -307.7115 -1.9609 -2.0494
0.5608 0.2094 400 0.5607 -1.1817 -1.7752 0.7290 0.5935 -418.7894 -383.4811 -1.6969 -1.7823
0.5287 0.2617 500 0.5434 -1.7248 -2.4550 0.7250 0.7303 -486.7726 -437.7878 -1.5394 -1.6284
0.5504 0.3141 600 0.5278 -1.3541 -2.1302 0.7370 0.7761 -454.2872 -400.7156 -1.4439 -1.5287
0.5243 0.3664 700 0.5278 -0.9934 -1.7415 0.7420 0.7481 -415.4179 -364.6462 -1.4888 -1.5754
0.5346 0.4187 800 0.5285 -1.0509 -1.8191 0.7360 0.7681 -423.1764 -370.4044 -1.4861 -1.5718
0.5072 0.4711 900 0.5197 -1.6324 -2.5736 0.7300 0.9412 -498.6239 -428.5474 -1.3651 -1.4531
0.5023 0.5234 1000 0.5158 -1.6927 -2.6755 0.7460 0.9828 -508.8179 -434.5808 -1.2853 -1.3779
0.4954 0.5758 1100 0.5126 -1.4605 -2.3370 0.7480 0.8765 -474.9688 -411.3603 -1.3921 -1.4843
0.4983 0.6281 1200 0.5105 -2.0566 -3.0678 0.7450 1.0112 -548.0505 -470.9687 -1.1942 -1.2848
0.4774 0.6805 1300 0.5093 -1.9802 -3.0112 0.7510 1.0311 -542.3931 -463.3254 -1.2574 -1.3491
0.4516 0.7328 1400 0.5058 -2.1539 -3.2003 0.7530 1.0464 -561.2969 -480.7002 -1.2592 -1.3500
0.4758 0.7851 1500 0.5018 -2.2342 -3.2427 0.7550 1.0085 -565.5339 -488.7257 -1.2803 -1.3710
0.4967 0.8375 1600 0.5019 -2.1690 -3.1744 0.7590 1.0054 -558.7111 -482.2090 -1.2939 -1.3837
0.4769 0.8898 1700 0.5018 -2.1431 -3.1460 0.7600 1.0029 -555.8691 -479.6245 -1.2936 -1.3834
0.4843 0.9422 1800 0.5019 -2.1475 -3.1534 0.7580 1.0059 -556.6094 -480.0620 -1.2932 -1.3829
0.5048 0.9945 1900 0.5019 -2.1484 -3.1540 0.7590 1.0056 -556.6639 -480.1491 -1.2933 -1.3829

Framework versions

  • PEFT 0.7.1
  • Transformers 4.40.1
  • Pytorch 2.1.2
  • Datasets 2.19.0
  • Tokenizers 0.19.1
Downloads last month
0
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for chrlu/zephyr-7b-dpo-qlora

Adapter
(1257)
this model

Dataset used to train chrlu/zephyr-7b-dpo-qlora