Skip to main content
Back to AI Landscape

RLHF / DPO

Generative AI (GenAI)

What is RLHF / DPO?

RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) are techniques used to make large language models more helpful, honest, and safe after their initial training. A freshly trained language model can generate fluent text but might produce harmful content, make up facts confidently, or give unhelpful answers. RLHF addresses this by having humans rate pairs of model responses, choosing which answer is better, more accurate, or safer, and then using reinforcement learning to teach the model to produce responses more like the preferred ones. Think of it as having a human tutor review the AI's homework and give feedback until it consistently produces high-quality answers. DPO is a newer, simpler alternative that achieves similar results without the complexity of reinforcement learning, directly optimizing the model on the preference data. These techniques are why modern chatbots like ChatGPT and Claude feel so much more helpful and less erratic than earlier language models.

Technical Deep Dive

Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are post-training alignment techniques that optimize language models against human preferences. The RLHF pipeline (Christiano et al., 2017; InstructGPT, 2022) involves three stages: supervised fine-tuning on demonstration data, training a reward model from pairwise human preference comparisons, and optimizing the language model policy via PPO (Proximal Policy Optimization) against the learned reward while constraining KL divergence from the base model. DPO (Rafailov et al., 2023) eliminates the separate reward model by reparameterizing the RLHF objective as a classification loss directly on preference pairs, proving that the optimal policy can be derived in closed form. Variants include KTO (Kahneman-Tversky optimization on binary feedback), IPO (identity preference optimization), ORPO (odds ratio preference optimization), and iterative DPO with online data generation. Constitutional AI (Anthropic) extends RLHF with AI-generated feedback based on principles. These methods are critical for transforming pretrained language models into helpful, harmless, and honest assistants.

Why It Matters

RLHF and DPO are the reason ChatGPT, Claude, and Gemini feel helpful and safe to use rather than producing random or harmful text. They are the 'training wheels' that turned raw language models into reliable AI assistants.

Related Concepts

Part of

Connected to