Llama-2-7b-hf-DPO-LookAhead5_FullEval_TTree1.4_TLoop0.7_TEval0.2_V1.0

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6899	0.3002	71	0.7199	0.1044	0.1468	0.5	-0.0424	-98.9657	-131.7005	-0.6723	-0.6398
0.5545	0.6004	142	0.6834	-0.1295	-0.1555	0.6000	0.0260	-101.9890	-134.0396	-0.6949	-0.6636
0.6881	0.9006	213	0.7185	-0.1471	-0.1685	0.6000	0.0214	-102.1191	-134.2157	-0.7134	-0.6805
0.6234	1.2008	284	0.8098	-0.9930	-1.0067	0.6000	0.0137	-110.5010	-142.6748	-0.7955	-0.7622
0.2756	1.5011	355	0.7770	-1.2850	-1.3168	0.6000	0.0318	-113.6021	-145.5950	-0.8659	-0.8358
0.4006	1.8013	426	0.7082	-0.8266	-0.9994	0.7000	0.1728	-110.4281	-141.0111	-0.8156	-0.7870
0.0745	2.1015	497	0.8545	-1.9092	-2.0160	0.5	0.1068	-120.5937	-151.8366	-1.0343	-1.0061
0.1066	2.4017	568	0.9854	-2.7276	-2.7740	0.5	0.0463	-128.1734	-160.0211	-1.2086	-1.1809
0.0845	2.7019	639	0.9790	-2.8897	-2.9459	0.5	0.0562	-129.8927	-161.6417	-1.2043	-1.1773