From Golden Gate Bridge to JSON: Why Anthropic's SAE Failed on JSON Output

2 points by maziyar 4 months ago · 1 comment

Reader

maziyarOP 4 months ago

After six experiments and dozens of failed attempts, I learned something I did not expect: activation steering, the technique Anthropic uses for AI safety, completely fails for one of the most common tasks in production LLM deployments: generating valid JSON.

And I don't mean "fails to help." My steering-only approach achieved 24.4% valid JSON, compared to 86.8% from the completely untrained base model. Steering made the model worse than doing nothing at all.

Here's what I learned, why it matters, and what actually works when you need guaranteed structured outputs from decoder-only language models.

Settings

From Golden Gate Bridge to JSON: Why Anthropic's SAE Failed on JSON Output

Keyboard Shortcuts