Simplifying Alignment: From RLHF to Direct Preference Optimization (DPO)
Large language models (LLMs) are getting smarter every day, but teaching them to do what we want—aligning them with human preferences—is still a tough nut to crack. As deep learning folks, we know that if you want a model to learn something, you give it data, right? So why not gather some examples of what we like and have the model learn those preferences?
That’s where Reinforcement Learning with Human Feedback (RLHF) comes in. It’s a clever way to teach LLMs to follow human preferences by using feedback data. But RLHF can be a bit of a headache—it brings reinforcement learning into the mix, and optimization gets tricky fast.
Enter Direct Preference Optimization (DPO). DPO skips the RL part while still teaching models to follow preferences. It’s simpler, cleaner, and honestly, who doesn’t love simplicity?
In this blog, we’ll go on a journey from RLHF to DPO, break down the math (don’t worry, we’ll keep it chill), and see why DPO might just be the smarter, easier way forward.
Reinforcement Learning with Human Feedback (RLHF)
RLHF is a framework for aligning language models with human preferences through a structured, three-phase process. Each phase builds on the previous one, refining the model to better understand and produce responses that align with human expectations. Let’s break it down:
1. Supervised Fine-Tuning (SFT)
We start by taking a pre-trained language model and fine-tuning it on high-quality, task-specific data. This process generates a base policy , which represents the model's probability of generating an output given an input .
This base policy serves as a strong starting point, capturing general task-related behavior but still needing refinement to align with human preferences.
2. Preference Sampling & Reward Learning
This phase focuses on collecting data about human preferences and building a reward model to represent those preferences numerically.
Preference Sampling
Here’s how it works:
- The supervised fine-tuned model generates pairs of responses for a given input or prompt .
- Human annotators compare these responses and select their preferred response, (the "winner"), over the less preferred response, (the "loser").
These human preferences are then used as training data for the next step.
Reward Modeling
We want to create a reward model , which assigns a numerical score (reward) to each response given a prompt . This score reflects how well the response aligns with human preferences.
Modeling Pairwise Preferences
To train this reward model, we rely on pairwise comparisons of responses (winner) and (loser). The preferences are modeled using the Bradley-Terry framework, which assigns a probability to the preference:
The probability of being preferred over is:
- and : Rewards (scores) assigned to the winner and loser, respectively.
- The numerator : Represents the likelihood of the winner being the preferred choice.
- The denominator ensures probabilities sum to 1 by including both options: .
Rearranging, we write the probability in terms of a difference between rewards:
- : The difference between the scores of the loser and the winner.
- : Converts this difference into a scaling factor for the probability.
Using the sigmoid function , the equation becomes:
- : If the winner's score is much higher than the loser's, the probability approaches 1 (the winner is strongly preferred).
Training the Reward Model
To train , we optimize it to match human preferences as closely as possible. This is done using maximum likelihood estimation (MLE):
The loss function for the reward model is:
- : Dataset of human preferences (pairs for each prompt ).
- : Penalizes predictions that assign low probabilities to actual human preferences.
- The goal is to minimize the negative log-likelihood, ensuring the reward model aligns its predictions with the collected human feedback.
3. Reinforcement Learning (RL) Optimization
The final step involves fine-tuning the policy using reinforcement learning to maximize the reward. However, directly maximizing the reward can lead to excessive deviations from the base policy , causing unnatural or overly optimized behavior. To address this, we add a penalty to constrain the policy:
The RL Objective
The optimization objective is:
First Term:
- Encourages the policy to generate responses with higher rewards.
Second Term:
- : The Kullback-Leibler (KL) divergence, a measure of how much differs from .
- Penalizes the policy for straying too far from the reference policy (usually ).
: A weighting factor controlling the balance between maximizing rewards and staying close to the reference policy.
This process completes the RLHF pipeline, where reinforcement learning ensures the model generates responses that maximize alignment with human preferences while maintaining a natural and task-relevant behavior.
Challenges of RLHF
Non-Differentiability of Language Outputs:
Language generation involves sampling discrete tokens, which breaks the flow of gradients during optimization. This makes it tricky to directly use gradient-based methods (the backbone of deep learning) to adjust the model.Reward Model Struggles to Generalize:
The reward model learns to predict human preferences, but it’s hard to capture the subtlety and variability of what humans truly prefer. If the reward model fails to generalize, it can lead to misaligned or biased optimization.Computational and Implementation Overhead:
RL adds significant complexity to the pipeline. From designing the reward function to tuning hyperparameters like the KL penalty, it requires specialized expertise and significantly more compute power compared to simpler fine-tuning methods.
From RLHF to Direct Preference Optimization (DPO)
In this section, we’ll reframe the RLHF objective and introduce a key variable transformation. This reformulation will pave the way to understanding how Direct Preference Optimization (DPO) works and why it’s a simpler, more efficient alternative to RLHF.
Reformulating the RLHF Objective
The RLHF objective begins as:
This objective balances two goals:
- Maximizing reward: Encourage the model to generate outputs that align with human preferences, as captured by the reward .
- Constraining deviation: Prevent the model from diverging too far from a reference policy (usually the supervised fine-tuned model), which ensures stability and avoids overly aggressive changes.
Expanding the KL Divergence Term
The KL divergence measures the "distance" between the current policy and the reference policy . Expanding it gives:
This term penalizes for deviating from . Substituting it back into the original objective gives:
Switching to a Minimization Form
For simplicity, we rewrite the objective as a minimization problem (minimizing the negative of the maximization objective):
This highlights the trade-offs:
- The \( \log \pi(y \mid x) \) term encourages the policy to focus on likely outputs.
- The \( -\log \pi_{\text{ref}}(y \mid x) \) term ensures outputs remain close to the reference model.
- The \( -r(x, y) / \beta \) term biases the policy toward high-reward responses.
Introducing the Partition Function
Let's introduce a function
Using , we can express as:
Here’s the intuition:
- : Acts as a baseline distribution (our starting point).
- : Scales outputs based on their reward, making high-reward outputs more likely.
- : Normalizes the distribution so probabilities sum to 1.
This formulation enables us to reweight the reference policy by incorporating preferences encoded in the reward model, without requiring direct reinforcement learning.
Derivation of DPO Loss
The key trick of Direct Preference Optimization (DPO) is focusing on pairwise preferences, which simplifies optimization. Let’s break it down:
Pairwise Preferences
For two completions (winner) and (loser), we care about the probability that humans prefer over . Using the Bradley-Terry model, this probability is:
where is the sigmoid function.
Simplifying
Substituting from the earlier reformulation:
When comparing two outputs and , the partition function cancels out (since it’s the same for both), leaving:
This simplifies the computation, as we no longer need to compute explicitly. The sigmoid function ensures that higher rewards correspond to higher probabilities.
DPO Loss
To train (the parameterized policy), we use maximum likelihood estimation (MLE) over human preferences. The DPO loss becomes:
Why Does This Trick Work?
No Reinforcement Learning Required:
By reframing the problem in terms of pairwise preferences and reweighting the reference policy, DPO eliminates the need for complex reinforcement learning.Simpler Optimization:
The partition function cancels out in pairwise comparisons, reducing computational overhead. Training focuses directly on aligning with human preferences.Improved Stability:
The KL constraint from ensures that remains grounded, avoiding extreme behavior changes often seen in RL.Focus on Human Preferences:
By directly optimizing for pairwise preference probabilities, DPO centers the learning process around human-labeled data, aligning outputs more naturally with human expectations.
Conclusion
Direct Preference Optimization simplifies the alignment of large language models by replacing the RL phase of RLHF with a direct optimization framework. By working with pairwise preferences and avoiding reinforcement learning, DPO achieves alignment with reduced computational and implementation overhead, making it a compelling alternative for aligning large-scale models.
As alignment techniques evolve, DPO exemplifies how simplifying assumptions can lead to practical and effective solutions for real-world challenges in AI.