Entry
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez
Demonstrates that five RLHF assistants exhibit sycophancy and traces it to human preference data favoring agreeable responses.
- How RLHF Amplifies SycophancyFebruary 1, 2026 · arXiv
- Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language ModelsApril 13, 2026 · arXiv
- Ask Don't Tell: Reducing Sycophancy in Large Language ModelsFebruary 27, 2026 · arXiv
- A Rational Analysis of the Effects of Sycophantic AIFebruary 15, 2026 · arXiv
- Belief Offloading in Human-AI InteractionFebruary 9, 2026 · arXiv
- Strategyproof Reinforcement Learning from Human FeedbackMarch 12, 2025 · NeurIPS 2025