Entry

How RLHF Amplifies Sycophancy

Itai Shapira, Gerdus Benade, Ariel D. Procaccia

Synopsis

Formal analysis showing how preference-based post-training amplifies sycophancy via a covariance condition between belief endorsement and learned reward; proposes a closed-form correction.

Keywords

·sycophancy ·RLHF ·covariance condition ·preference training

Open paper ↗ arXiv ↗ Report issue ↗

Related entries

Towards Understanding Sycophancy in Language Models

October 20, 2023 · ICLR 2024
Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models

April 13, 2026 · arXiv
Ask Don't Tell: Reducing Sycophancy in Large Language Models

February 27, 2026 · arXiv
A Rational Analysis of the Effects of Sycophantic AI

February 15, 2026 · arXiv
Belief Offloading in Human-AI Interaction

February 9, 2026 · arXiv
Strategyproof Reinforcement Learning from Human Feedback

March 12, 2025 · NeurIPS 2025