H Human–AI Coevolution

Entry

Strategyproof Reinforcement Learning from Human Feedback

Thomas Kleine Buening, Jiarui Gan, Debmalya Mandal, Marta Kwiatkowska

Synopsis

Shows existing RLHF is not strategyproof; even one strategic labeler can cause major misalignment. Proposes "Pessimistic Median of MLEs" achieving approximate strategyproofness.

Keywords

·strategyproof ·RLHF ·manipulation ·mechanism design

Open paper ↗ arXiv ↗ Report issue ↗

Related entries