Question 8 of 10Pro Only
Explain RLHF and how it aligns LLMs with human preferences. How does DPO differ from RLHF, and what are the trade-offs between these alignment approaches?
Sample answer preview
Reinforcement Learning from Human Feedback aligns LLMs with human preferences by training them using feedback from people rather than solely optimizing predictive accuracy. RLHF has been critical to making models like ChatGPT helpful, harmless, and honest.
RLHFreward modelPPODPOpreference learningKL divergence