Simplifying Alignment: From RLHF to Direct Preference Optimization (DPO)

Community Article Published January 19, 2025

Large language models (LLMs) are getting smarter every day, but teaching them to do what we want—aligning them with human preferences—is still a tough nut to crack. As deep learning folks, we know that if you want a model to learn something, you give it data, right? So why not gather some examples of what we like and have the model learn those preferences?

That’s where Reinforcement Learning with Human Feedback (RLHF) comes in. It’s a clever way to teach LLMs to follow human preferences by using feedback data. But RLHF can be a bit of a headache—it brings reinforcement learning into the mix, and optimization gets tricky fast.

Enter Direct Preference Optimization (DPO). DPO skips the RL part while still teaching models to follow preferences. It’s simpler, cleaner, and honestly, who doesn’t love simplicity?

In this blog, we’ll go on a journey from RLHF to DPO, break down the math (don’t worry, we’ll keep it chill), and see why DPO might just be the smarter, easier way forward.

Reinforcement Learning with Human Feedback (RLHF)

RLHF is a framework for aligning language models with human preferences through a structured, three-phase process. Each phase builds on the previous one, refining the model to better understand and produce responses that align with human expectations. Let’s break it down:

1. Supervised Fine-Tuning (SFT)

We start by taking a pre-trained language model and fine-tuning it on high-quality, task-specific data. This process generates a base policy $\pi_{\text{SFT}}(y \mid x)$ , which represents the model's probability of generating an output $y$ given an input $x$ .

This base policy serves as a strong starting point, capturing general task-related behavior but still needing refinement to align with human preferences.

2. Preference Sampling & Reward Learning

This phase focuses on collecting data about human preferences and building a reward model to represent those preferences numerically.

Preference Sampling

Here’s how it works:

The supervised fine-tuned model generates pairs of responses $(y_{1}, y_{2})$ for a given input or prompt $x$ .
Human annotators compare these responses and select their preferred response, $y_{w}$ (the "winner"), over the less preferred response, $y_{l}$ (the "loser").

These human preferences are then used as training data for the next step.

Reward Modeling

We want to create a reward model $r_\phi(x, y)$ , which assigns a numerical score (reward) to each response $y$ given a prompt $x$ . This score reflects how well the response aligns with human preferences.

Modeling Pairwise Preferences

To train this reward model, we rely on pairwise comparisons of responses $y_{w}$ (winner) and $y_{l}$ (loser). The preferences are modeled using the Bradley-Terry framework, which assigns a probability to the preference:

The probability of $y_{w}$ being preferred over $y_{l}$ is: $p_\phi(y_w > y_l \mid x) = \frac{\exp r_\phi(x, y_w)}{\exp r_\phi(x, y_w) + \exp r_\phi(x, y_l)}.$
- $r_\phi(x, y_w)$ and $r_\phi(x, y_l)$ : Rewards (scores) assigned to the winner and loser, respectively.
- The numerator $\exp r_\phi(x, y_w)$ : Represents the likelihood of the winner being the preferred choice.
- The denominator ensures probabilities sum to 1 by including both options: $\exp r_\phi(x, y_w) + \exp r_\phi(x, y_l)$ .
Rearranging, we write the probability in terms of a difference between rewards: $p_\phi(y_w > y_l \mid x) = \frac{1}{1 + \exp \left[ r_\phi(x, y_l) - r_\phi(x, y_w) \right]}.$
- $r_\phi(x, y_l) - r_\phi(x, y_w)$ : The difference between the scores of the loser and the winner.
- $\exp(\cdot)$ : Converts this difference into a scaling factor for the probability.
Using the sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ , the equation becomes: $p_\phi(y_w > y_l \mid x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)).$
- $r_\phi(x, y_w) - r_\phi(x, y_l)$ : If the winner's score is much higher than the loser's, the probability approaches 1 (the winner is strongly preferred).

Training the Reward Model

To train $r_\phi$ , we optimize it to match human preferences as closely as possible. This is done using maximum likelihood estimation (MLE):

The loss function for the reward model is: $\mathcal{L}_R(r_\phi, \mathcal{D}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \big[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \big].$
- $\mathcal{D}$ : Dataset of human preferences (pairs $y_{w}, y_{l}$ for each prompt $x$ ).
- $\log \sigma(\cdot)$ : Penalizes predictions that assign low probabilities to actual human preferences.
- The goal is to minimize the negative log-likelihood, ensuring the reward model aligns its predictions with the collected human feedback.

3. Reinforcement Learning (RL) Optimization

The final step involves fine-tuning the policy $\pi_\phi(y \mid x)$ using reinforcement learning to maximize the reward. However, directly maximizing the reward can lead to excessive deviations from the base policy $\pi_{\text{SFT}}$ , causing unnatural or overly optimized behavior. To address this, we add a penalty to constrain the policy:

The RL Objective

The optimization objective is: $\max_{\pi_\phi} \ \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\phi(y \mid x)} \big[ r_\phi(x, y) \big] - \beta \text{D}_{\text{KL}} \big[ \pi_\phi(y \mid x) \Vert \pi_{\text{ref}}(y \mid x) \big].$

First Term: $\mathbb{E}[r_\phi(x, y)]$
- Encourages the policy to generate responses with higher rewards.
Second Term: $\text{D}_{\text{KL}}$
- $\text{D}_{\text{KL}}(P \Vert Q)$ : The Kullback-Leibler (KL) divergence, a measure of how much $P$ differs from $Q$ .
- Penalizes the policy $\pi_\phi$ for straying too far from the reference policy $\pi_{\text{ref}}$ (usually $\pi_{\text{SFT}}$ ).
$\beta$ : A weighting factor controlling the balance between maximizing rewards and staying close to the reference policy.

This process completes the RLHF pipeline, where reinforcement learning ensures the model generates responses that maximize alignment with human preferences while maintaining a natural and task-relevant behavior.

Challenges of RLHF

Non-Differentiability of Language Outputs:
Language generation involves sampling discrete tokens, which breaks the flow of gradients during optimization. This makes it tricky to directly use gradient-based methods (the backbone of deep learning) to adjust the model.
Reward Model Struggles to Generalize:
The reward model learns to predict human preferences, but it’s hard to capture the subtlety and variability of what humans truly prefer. If the reward model fails to generalize, it can lead to misaligned or biased optimization.
Computational and Implementation Overhead:
RL adds significant complexity to the pipeline. From designing the reward function to tuning hyperparameters like the KL penalty, it requires specialized expertise and significantly more compute power compared to simpler fine-tuning methods.

From RLHF to Direct Preference Optimization (DPO)

In this section, we’ll reframe the RLHF objective and introduce a key variable transformation. This reformulation will pave the way to understanding how Direct Preference Optimization (DPO) works and why it’s a simpler, more efficient alternative to RLHF.

Reformulating the RLHF Objective

The RLHF objective begins as:

$\max_{\pi} \ \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(y \mid x)} \big[ r(x, y) \big] - \beta \text{D}_{\text{KL}} \big[ \pi(y \mid x) \Vert \pi_{\text{ref}}(y \mid x) \big].$

This objective balances two goals:

Maximizing reward: Encourage the model $\pi(y \mid x)$ to generate outputs $y$ that align with human preferences, as captured by the reward $r (x, y)$ .
Constraining deviation: Prevent the model from diverging too far from a reference policy $\pi_{\text{ref}}(y \mid x)$ (usually the supervised fine-tuned model), which ensures stability and avoids overly aggressive changes.

Expanding the KL Divergence Term

The KL divergence measures the "distance" between the current policy $\pi(y \mid x)$ and the reference policy $\pi_{\text{ref}}(y \mid x)$ . Expanding it gives:

$\text{D}_{\text{KL}}\big[\pi(y \mid x) \Vert \pi_{\text{ref}}(y \mid x)\big] = \mathbb{E}_{y \sim \pi(y \mid x)} \big[ \log \pi(y \mid x) - \log \pi_{\text{ref}}(y \mid x) \big].$

This term penalizes $π \pi$ for deviating from $\pi_{\text{ref}}$ . Substituting it back into the original objective gives:

$\max_{\pi} \ \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(y \mid x)} \big[ r(x, y) - \beta \log \pi(y \mid x) + \beta \log \pi_{\text{ref}}(y \mid x) \big].$

Switching to a Minimization Form

For simplicity, we rewrite the objective as a minimization problem (minimizing the negative of the maximization objective):

$\min_{\pi} \ \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(y \mid x)} \big[ \log \pi(y \mid x) - \log \pi_{\text{ref}}(y \mid x) - \frac{r(x, y)}{\beta} \big].$

This highlights the trade-offs:

The $ \log \pi(y \mid x) $ term encourages the policy to focus on likely outputs.
The $ -\log \pi_{\text{ref}}(y \mid x) $ term ensures outputs remain close to the reference model.
The $ -r(x, y) / \beta $ term biases the policy toward high-reward responses.

Introducing the Partition Function

Let's introduce a function $Z (x)$

$Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) \exp\big[ \frac{r(x, y)}{\beta} \big].$

Using $Z (x)$ , we can express $\pi(y \mid x)$ as:

$\pi(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp\big[ \frac{r(x, y)}{\beta} \big].$

Here’s the intuition:

$\pi_{\text{ref}}(y \mid x)$ : Acts as a baseline distribution (our starting point).
$\exp\big[\frac{r(x, y)}{\beta}\big]$ : Scales outputs $y$ based on their reward, making high-reward outputs more likely.
$Z (x)$ : Normalizes the distribution so probabilities sum to 1.

This formulation enables us to reweight the reference policy by incorporating preferences encoded in the reward model, without requiring direct reinforcement learning.

Derivation of DPO Loss

The key trick of Direct Preference Optimization (DPO) is focusing on pairwise preferences, which simplifies optimization. Let’s break it down:

Pairwise Preferences

For two completions $y_{1}$ (winner) and $y_{2}$ (loser), we care about the probability that humans prefer $y_{1}$ over $y_{2}$ . Using the Bradley-Terry model, this probability is:

$p(y_1 > y_2 \mid x) = \sigma\big(\beta \log \frac{\pi(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)} - \beta \log \frac{\pi(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)}\big),$

where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.

Simplifying $\pi(y \mid x)$

Substituting $\pi(y \mid x)$ from the earlier reformulation:

$\pi(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp\big[ \frac{r(x, y)}{\beta} \big].$

When comparing two outputs $y_{1}$ and $y_{2}$ , the partition function $Z (x)$ cancels out (since it’s the same for both), leaving:

$p(y_1 > y_2 \mid x) = \sigma\big(r(x, y_1) - r(x, y_2)\big).$

This simplifies the computation, as we no longer need to compute $Z (x)$ explicitly. The sigmoid function ensures that higher rewards correspond to higher probabilities.

DPO Loss

To train $\pi_\theta$ (the parameterized policy), we use maximum likelihood estimation (MLE) over human preferences. The DPO loss becomes:

$\mathcal{L}_{\text{DPO}}(\pi_\theta, \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \big[ \log \sigma\big(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\big) \big].$

Why Does This Trick Work?

No Reinforcement Learning Required:
By reframing the problem in terms of pairwise preferences and reweighting the reference policy, DPO eliminates the need for complex reinforcement learning.
Simpler Optimization:
The partition function $Z (x)$ cancels out in pairwise comparisons, reducing computational overhead. Training focuses directly on aligning with human preferences.
Improved Stability:
The KL constraint from $\pi_{\text{ref}}(y \mid x)$ ensures that $\pi_\theta$ remains grounded, avoiding extreme behavior changes often seen in RL.
Focus on Human Preferences:
By directly optimizing for pairwise preference probabilities, DPO centers the learning process around human-labeled data, aligning outputs more naturally with human expectations.

Conclusion

Direct Preference Optimization simplifies the alignment of large language models by replacing the RL phase of RLHF with a direct optimization framework. By working with pairwise preferences and avoiding reinforcement learning, DPO achieves alignment with reduced computational and implementation overhead, making it a compelling alternative for aligning large-scale models.

As alignment techniques evolve, DPO exemplifies how simplifying assumptions can lead to practical and effective solutions for real-world challenges in AI.

Community

NotShrirang

1 day ago

Great explanation! How they were able to convert an optimization problem into a differentiable equation is just amazing!
I was recently trying to understand what DPO does under the hood and I watched this video by @hkproj . Great work!

Also, just filling in for newbies like me:

The maximization equation in 3rd step in Reformulating the RLHF Objective
We divide the maximization equation with −β and because of the - sign, it becomes minimization problem.
In (Introducing the Partition Function), Z(x) is a normalization constant. I wasn't able to understand how this term Z(x) came into picture and how it is substituted. So I asked ChatGPT and I got this!

This makes little bit of sense, but I have not verified whether this is correct or not.
There are some helpful steps in "Mathematical Derivations" section in the DPO paper: https://arxiv.org/pdf/2305.18290

Sadly this interface doesn't allow me to directly select text and use it to comment on it like ChatGPT interface does. Maybe a UI feature? :)

ariG23498

Article author 1 day ago

I am glad you are interested in the topic and have taken the time out to read the entire post.

What I understand from your comment is that a more spelled out derivation (with reasons) would make the math even more digestible. I will try to fine tune this blog post to align with your comments (see what I did there? 😌)

ariG23498

Article author about 18 hours ago

@NotShrirang I have updated a lot of sections. If you fetch some time, could you go through the post and let me know how it is now.

NotShrirang

about 17 hours ago

@ariG23498 Hey, I read the article again, and it feels a lot easier to read. Kudos to your quick response! I know changing something you have put efforts into, is not easy.

Thanks for directly aligning with my preference!
Loss: 📉

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Simplifying Alignment: From RLHF to Direct Preference Optimization (DPO)

Reinforcement Learning with Human Feedback (RLHF)

1. Supervised Fine-Tuning (SFT)

2. Preference Sampling & Reward Learning

Preference Sampling

Reward Modeling

Modeling Pairwise Preferences

Training the Reward Model

3. Reinforcement Learning (RL) Optimization

The RL Objective

Challenges of RLHF

From RLHF to Direct Preference Optimization (DPO)

Reformulating the RLHF Objective

Expanding the KL Divergence Term

Switching to a Minimization Form

Introducing the Partition Function

Derivation of DPO Loss

Pairwise Preferences

Simplifying π(y∣x) \pi(y \mid x) π(y∣x)

DPO Loss

Why Does This Trick Work?

Conclusion

Community

Simplifying $\pi(y \mid x)$