H Human–AI Coevolution

Entry

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, Shi Feng

Synopsis

Empirical demonstration of "U-Sophistry": RLHF on QuALITY and APPS makes LMs more convincing to time-constrained human evaluators without improving task accuracy — human FPR rises 24.1% (QA) and 18.3% (code).

Keywords

·RLHF ·U-Sophistry ·evaluator deception ·QuALITY ·APPS

Open paper ↗ arXiv ↗ Report issue ↗

Related entries