Entry

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, Shi Feng

Synopsis

Empirical demonstration of "U-Sophistry": RLHF on QuALITY and APPS makes LMs more convincing to time-constrained human evaluators without improving task accuracy — human FPR rises 24.1% (QA) and 18.3% (code).

Keywords

·RLHF ·U-Sophistry ·evaluator deception ·QuALITY ·APPS

Open paper ↗ arXiv ↗ Report issue ↗

Related entries

How RLHF Amplifies Sycophancy

February 1, 2026 · arXiv
Strategyproof Reinforcement Learning from Human Feedback

March 12, 2025 · NeurIPS 2025
A Survey on Human-AI Teaming with Large Pre-Trained Models

March 7, 2024 · arXiv
Towards Understanding Sycophancy in Language Models

October 20, 2023 · ICLR 2024
Training Language Models to Follow Instructions with Human Feedback

March 4, 2022 · NeurIPS 2022
Learning to summarize from human feedback

September 2, 2020 · NeurIPS 2020