Entry
How RLHF Amplifies Sycophancy
Itai Shapira, Gerdus Benade, Ariel D. Procaccia
Formal analysis showing how preference-based post-training amplifies sycophancy via a covariance condition between belief endorsement and learned reward; proposes a closed-form correction.
·sycophancy ·RLHF ·covariance condition ·preference training
- Towards Understanding Sycophancy in Language ModelsOctober 20, 2023 · ICLR 2024
- Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language ModelsApril 13, 2026 · arXiv
- Ask Don't Tell: Reducing Sycophancy in Large Language ModelsFebruary 27, 2026 · arXiv
- A Rational Analysis of the Effects of Sycophantic AIFebruary 15, 2026 · arXiv
- Belief Offloading in Human-AI InteractionFebruary 9, 2026 · arXiv
- Strategyproof Reinforcement Learning from Human FeedbackMarch 12, 2025 · NeurIPS 2025