Entry
Strategyproof Reinforcement Learning from Human Feedback
Thomas Kleine Buening, Jiarui Gan, Debmalya Mandal, Marta Kwiatkowska
Shows existing RLHF is not strategyproof; even one strategic labeler can cause major misalignment. Proposes "Pessimistic Median of MLEs" achieving approximate strategyproofness.
- How RLHF Amplifies SycophancyFebruary 1, 2026 · arXiv
- Language Models Learn to Mislead Humans via RLHFSeptember 19, 2024 · ICLR 2025
- Towards Understanding Sycophancy in Language ModelsOctober 20, 2023 · ICLR 2024
- Training Language Models to Follow Instructions with Human FeedbackMarch 4, 2022 · NeurIPS 2022
- Learning to summarize from human feedbackSeptember 2, 2020 · NeurIPS 2020
- Deep Reinforcement Learning from Human PreferencesJune 12, 2017 · NeurIPS 2017