Show HN: SycoFact 4B: Open model detecting sycophancy and delusion confirmation

2 points by iwalton3 3 months ago · 0 comments · 1 min read

Reader

I published a model you can use now to help detect sycophantic AI responses before they harm users. It rejects 100% of the sycophantic delusion affirming responses from psychosis-bench. It also does well on the AISI Harmful Advice, PKU-SafeRLHF, and safety subsets of RewardBench.

It's small enough it can run on a gaming GPU locally. It's got a GGUF checkpoint on hugging face and is available on ollama. You can pull it and run scenarios against it in minutes: https://ollama.com/izzie/sycofact

The synthetic training data is also public, you can train other models over the data or reproduce my results. The labels were all generated by Gemma 3 27B with activation steering based on generated contrastive data. A write-up is planned at a later date, feel free to get in touch if curious.

No comments yet.

Settings

Show HN: SycoFact 4B: Open model detecting sycophancy and delusion confirmation

Keyboard Shortcuts