Settings

Theme

From Golden Gate Bridge to JSON: Why Anthropic's SAE Failed on JSON Output

huggingface.co

2 points by maziyar 25 days ago · 1 comment

Reader

maziyarOP 25 days ago

After six experiments and dozens of failed attempts, I learned something I did not expect: activation steering, the technique Anthropic uses for AI safety, completely fails for one of the most common tasks in production LLM deployments: generating valid JSON.

And I don't mean "fails to help." My steering-only approach achieved 24.4% valid JSON, compared to 86.8% from the completely untrained base model. Steering made the model worse than doing nothing at all.

Here's what I learned, why it matters, and what actually works when you need guaranteed structured outputs from decoder-only language models.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection