phi-2-gpo-renew2-b0.001-0.5ultrafeedback-i1

This model is a fine-tuned version of DUAL-GPO/phi-2-gpo-renew2-b0.001-i0 on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0493
  • Rewards/chosen: 0.0665
  • Rewards/rejected: 0.0507
  • Rewards/accuracies: 0.5690
  • Rewards/margins: 0.0158
  • Logps/rejected: -1825.6942
  • Logps/chosen: -2149.9026
  • Logits/rejected: -0.2409
  • Logits/chosen: -0.2329

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.0516 0.05 100 0.0521 0.0302 0.0252 0.5160 0.0050 -1851.1947 -2186.1650 -0.2268 -0.2310
0.0382 0.1 200 0.0514 0.0484 0.0391 0.5265 0.0092 -1837.2778 -2168.0449 -0.2575 -0.2563
0.0425 0.16 300 0.0515 0.0312 0.0225 0.5610 0.0088 -1853.9449 -2185.1636 -0.2744 -0.2743
0.052 0.21 400 0.0521 0.0749 0.0598 0.5335 0.0151 -1816.5804 -2141.4990 -0.2811 -0.2714
0.056 0.26 500 0.0503 0.0578 0.0446 0.5590 0.0132 -1831.7897 -2158.6121 -0.3082 -0.2984
0.0544 0.31 600 0.0504 0.0516 0.0383 0.5560 0.0134 -1838.1166 -2164.7563 -0.4014 -0.3857
0.0445 0.37 700 0.0502 0.0513 0.0391 0.5595 0.0122 -1837.3597 -2165.1204 -0.3294 -0.3191
0.0584 0.42 800 0.0502 0.0562 0.0432 0.5575 0.0130 -1833.1853 -2160.2231 -0.3252 -0.3142
0.0435 0.47 900 0.0500 0.0832 0.0666 0.5470 0.0166 -1809.8208 -2133.2534 -0.2741 -0.2653
0.0538 0.52 1000 0.0497 0.0603 0.0471 0.5585 0.0132 -1829.3304 -2156.1384 -0.2713 -0.2671
0.0542 0.58 1100 0.0496 0.0876 0.0698 0.5535 0.0178 -1806.5677 -2128.8037 -0.2533 -0.2442
0.0482 0.63 1200 0.0496 0.0614 0.0474 0.5630 0.0140 -1829.0079 -2155.0408 -0.2336 -0.2285
0.0441 0.68 1300 0.0496 0.0563 0.0427 0.5680 0.0136 -1833.6627 -2160.0811 -0.2370 -0.2324
0.0524 0.73 1400 0.0497 0.0535 0.0398 0.5700 0.0137 -1836.6145 -2162.8931 -0.2605 -0.2534
0.0426 0.79 1500 0.0495 0.0606 0.0456 0.5675 0.0150 -1830.8245 -2155.8127 -0.2496 -0.2420
0.0389 0.84 1600 0.0493 0.0691 0.0529 0.5655 0.0162 -1823.5212 -2147.2993 -0.2432 -0.2348
0.0557 0.89 1700 0.0493 0.0663 0.0505 0.5670 0.0159 -1825.9503 -2150.0764 -0.2429 -0.2348
0.0513 0.94 1800 0.0493 0.0669 0.0510 0.5680 0.0158 -1825.3712 -2149.5503 -0.2432 -0.2349
0.0501 0.99 1900 0.0493 0.0665 0.0507 0.5675 0.0158 -1825.7052 -2149.9072 -0.2409 -0.2329

Framework versions

  • PEFT 0.7.1
  • Transformers 4.36.2
  • Pytorch 2.1.2
  • Datasets 2.14.6
  • Tokenizers 0.15.2
Downloads last month
2
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for DUAL-GPO/phi-2-gpo-renew2-b0.001-0.5ultrafeedback-i1

Base model

microsoft/phi-2
Adapter
(774)
this model

Dataset used to train DUAL-GPO/phi-2-gpo-renew2-b0.001-0.5ultrafeedback-i1