This model was trained as part of a series of experiments testing the performance of pure DPO vs SFT vs ORPO, all supported by Unsloth/Huggingface TRL.

Note: Extremely buggy, not recommended for use. However, it didn't massively overfit like #3, so it could be usable still.

The training was somewhat unstable, so the optimal bound for LR seems to be around [1e-5, 1e-4].

Benchmarks

For some reason the OpenLLM leaderboard refuses to bench this model, so I guess we will never know how well it performs.

Training Details

Duration: ~10-12 hours on one Kaggle T4 with Unsloth

Model: https://huggingface.co/unsloth/mistral-7b-v0.2-bnb-4bit

Dataset: https://huggingface.co/datasets/argilla/dpo-mix-7k

Rank: 8

Alpha: 16

Learning rate: 1e-4

Beta: 0.1

Batch size: 8

Epochs: 1

Learning rate scheduler: Linear

Prompt Format: ChatML

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Why is the sky blue?<|im_end|>
<|im_start|>assistant

WanDB Reports

image/png

image/png

Downloads last month
144
Safetensors
Model size
3.86B params
Tensor type
F32
·
FP16
·
U8
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including G-reen/EXPERIMENT-DPO-m7b2-4-merged