Entry

Training Language Models to Follow Instructions with Human Feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe

Synopsis

Uses labeler demonstrations and ranked preferences to fine-tune GPT-3 into InstructGPT; the 1.3B InstructGPT is preferred to the 175B GPT-3.

Keywords

·RLHF ·instruction tuning ·alignment ·InstructGPT

Open paper ↗ arXiv ↗ Report issue ↗

Related entries

AI Organizations are More Effective but Less Aligned than Individual Agents

April 11, 2026 · arXiv
How RLHF Amplifies Sycophancy

February 1, 2026 · arXiv
Strategyproof Reinforcement Learning from Human Feedback

March 12, 2025 · NeurIPS 2025
Chain of Alignment: Integrating Public Will with Expert Intelligence for Language Model Alignment

October 10, 2024 · NeurIPS 2024 Workshop on Pluralistic Alignment
Language Models Learn to Mislead Humans via RLHF

September 19, 2024 · ICLR 2025
Collective Constitutional AI: Aligning a Language Model with Public Input

June 12, 2024 · FAccT 2024