Entry
Training Language Models to Follow Instructions with Human Feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe
Uses labeler demonstrations and ranked preferences to fine-tune GPT-3 into InstructGPT; the 1.3B InstructGPT is preferred to the 175B GPT-3.
- AI Organizations are More Effective but Less Aligned than Individual AgentsApril 11, 2026 · arXiv
- How RLHF Amplifies SycophancyFebruary 1, 2026 · arXiv
- Strategyproof Reinforcement Learning from Human FeedbackMarch 12, 2025 · NeurIPS 2025
- Chain of Alignment: Integrating Public Will with Expert Intelligence for Language Model AlignmentOctober 10, 2024 · NeurIPS 2024 Workshop on Pluralistic Alignment
- Language Models Learn to Mislead Humans via RLHFSeptember 19, 2024 · ICLR 2025
- Collective Constitutional AI: Aligning a Language Model with Public InputJune 12, 2024 · FAccT 2024