H Human–AI Coevolution

Entry

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

Synopsis

Demonstrates that five RLHF assistants exhibit sycophancy and traces it to human preference data favoring agreeable responses.

Keywords

·sycophancy ·RLHF ·preference data bias

Open paper ↗ arXiv ↗ Report issue ↗

Related entries