H Human–AI Coevolution

Entry

How RLHF Amplifies Sycophancy

Itai Shapira, Gerdus Benade, Ariel D. Procaccia

Synopsis

Formal analysis showing how preference-based post-training amplifies sycophancy via a covariance condition between belief endorsement and learned reward; proposes a closed-form correction.

Keywords

·sycophancy ·RLHF ·covariance condition ·preference training

Open paper ↗ arXiv ↗ Report issue ↗

Related entries