Entry

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

Synopsis

Demonstrates that five RLHF assistants exhibit sycophancy and traces it to human preference data favoring agreeable responses.

Keywords

·sycophancy ·RLHF ·preference data bias

Open paper ↗ arXiv ↗ Report issue ↗

Related entries

How RLHF Amplifies Sycophancy

February 1, 2026 · arXiv
Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models

April 13, 2026 · arXiv
Ask Don't Tell: Reducing Sycophancy in Large Language Models

February 27, 2026 · arXiv
A Rational Analysis of the Effects of Sycophantic AI

February 15, 2026 · arXiv
Belief Offloading in Human-AI Interaction

February 9, 2026 · arXiv
Strategyproof Reinforcement Learning from Human Feedback

March 12, 2025 · NeurIPS 2025