Entry
Deep Reinforcement Learning from Human Preferences
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei
Seminal RLHF paper introducing learning from pairwise human preferences over agent trajectories. Achieves Atari/locomotion task learning with feedback on under 1% of agent interactions, establishing the canonical training-from-use loop.
- Learning to summarize from human feedbackSeptember 2, 2020 · NeurIPS 2020
- How RLHF Amplifies SycophancyFebruary 1, 2026 · arXiv
- Strategyproof Reinforcement Learning from Human FeedbackMarch 12, 2025 · NeurIPS 2025
- Language Models Learn to Mislead Humans via RLHFSeptember 19, 2024 · ICLR 2025
- Towards Understanding Sycophancy in Language ModelsOctober 20, 2023 · ICLR 2024
- Training Language Models to Follow Instructions with Human FeedbackMarch 4, 2022 · NeurIPS 2022