Show HN: RewardHackBench: Using sandboxes to stop agents from cheating

9 points by rotemtam 13 days ago · 3 comments · 1 min read

Reader

hey all,

happy to share research i've been working on for islo.dev in recent months.

ever since the cheating agents (https://debugml.github.io/cheating-agents/) paper came out, revealing reward hacking was 4x more prevalent than previously estimated, i've been looking into how we can deal with the issue

the common approach (taken by the tbench team) is post hoc trajectory analysis.

i've been interested in the idea of reframing the problem as an endpoint security problem and tackling it via sandbox

i hope you find it interesting, and thanks to the islo.dev team for sponsoring this

happy to answer any Qs

yonSpektor 13 days ago

Curious what the distribution of hacking strategies looked like across different models — would expect RL-heavy vs RLHF models to cheat very differently.

adamgold7 13 days ago

love this. we are actually looking at reward hacking from a cyber security perspective - refreshing (unless you're from Israel).

Any collaborators that want to join us?

matankleyman1 13 days ago

that's one of the biggest long term issues with agents that no one has real interest talking about.

Settings

Show HN: RewardHackBench: Using sandboxes to stop agents from cheating

Keyboard Shortcuts