Entry
Learning to summarize from human feedback
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano
Trains summarization policies against a reward model fit to human comparisons. Resulting summaries beat reference summaries and supervised models 10x larger.
- Deep Reinforcement Learning from Human PreferencesJune 12, 2017 · NeurIPS 2017
- How RLHF Amplifies SycophancyFebruary 1, 2026 · arXiv
- Strategyproof Reinforcement Learning from Human FeedbackMarch 12, 2025 · NeurIPS 2025
- Language Models Learn to Mislead Humans via RLHFSeptember 19, 2024 · ICLR 2025
- Towards Understanding Sycophancy in Language ModelsOctober 20, 2023 · ICLR 2024
- Training Language Models to Follow Instructions with Human FeedbackMarch 4, 2022 · NeurIPS 2022