Entry
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, Shi Feng
Empirical demonstration of "U-Sophistry": RLHF on QuALITY and APPS makes LMs more convincing to time-constrained human evaluators without improving task accuracy — human FPR rises 24.1% (QA) and 18.3% (code).
- How RLHF Amplifies SycophancyFebruary 1, 2026 · arXiv
- Strategyproof Reinforcement Learning from Human FeedbackMarch 12, 2025 · NeurIPS 2025
- A Survey on Human-AI Teaming with Large Pre-Trained ModelsMarch 7, 2024 · arXiv
- Towards Understanding Sycophancy in Language ModelsOctober 20, 2023 · ICLR 2024
- Training Language Models to Follow Instructions with Human FeedbackMarch 4, 2022 · NeurIPS 2022
- Learning to summarize from human feedbackSeptember 2, 2020 · NeurIPS 2020