Steering Interpretable Language Models

Author: Giang Nguyen, Research Scientist

Published: February 25, 2026

We show that Steerling-8B enables concept algebra: you can add, remove, and compose human-understandable concepts at inference time to directly control what the model generates, without retraining or prompt engineering.

Concept Algebra with Steerling-8B

What if you could directly edit the internal representations of a model towards any concept you care about, without changing the prompt? Steerling-8B’s architecture natively supports injecting and suppressing any concept the model has learned, directly at inference time.

In multi-turn dialog settings, steering one concept at a time is insufficient. You need compositional control, not just on a neutral prompt, but on a conversation that is already shaped by prior context. Consider a content moderation that must suppress toxicity yet preserve fluency, or health assistant that needs to provide medical guidance while navigating the legal ramifications of its advice.

The demonstration below shows how Steerling-8B enables exactly this capability with concept algebra.

User

Things to do before you start:
1.

+ Tenant-landlord Legal Relations Concept

Inject Tenant-landlord legal concept

Current LLMs are not built to be reliably steered

Current methods for controlling language model behavior are blunt instruments.

Prompting is accessible but often unreliable. System prompts can be overridden through adversarial inputs. Few-shot examples consume context and don’t reliably generalize. More critically, prompting doesn’t reveal which internal mechanisms drove the result, so if your goal changes, nothing from one session transfers to the next.

Fine-tuning methods offer more control but at high cost. Fine-tuning modifies weights globally: suppressing one behavior can silently degrade others. Standard reinforcement learning based post-training reshapes the entire output distribution to satisfy a scalar reward signal. Even modest behavioral changes can require thousands of labeled examples, and both approaches demand full retraining for every new steering objective.

Post-hoc interpretability methods steer fragile artifacts. SAEs, linear probes, and activation patching attempt to discover controllable concepts in a model that might never have them to begin with. Probes can detect information in representations without confirming the model uses that information for generation. Activation patching offers no compositionality guarantees: patching direction A and B simultaneously may not produce the sum of their effects.

At Guide Labs, we believe that if you want reliable, composable, fine-grained control, the model has to be designed for it.

From Explanation to Control

In our previous post, we introduced the concept module: an architectural bottleneck that forces every prediction through human-interpretable concepts.

The concept module gives us something that black-box models lack: a clean, algebraic handle on the internal variables that drive generation. Every output logit is a linear function of concept activations and concept embeddings. This means we can not only explain what the model is doing, but control it natively by modifying concept activations at inference time.

To make this control reliable for diffusion decoding, we use mask-aligned injection: injecting concept embeddings only into currently masked (undecided) positions, matching the training distribution and naturally annealing as positions become unmasked to preserve text quality.

This post demonstrates that control in practice. We show three capabilities:

Concept injection: steering a generic prompt toward any target domain
Concept suppression: unlearning a concept the model would otherwise express
Multi-concept steering: perform concept algebra on multiple concepts simultaneously

All examples are generated by Steerling-8B, our 8B-parameter inherently interpretable diffusion language model. Note that Steerling-8B is a base model not an instruction tuned model.

Concept Injection: One Prompt, Five Destinations

The most common demonstration of steering is taking a single, domain-neutral prompt and showing how different concept injections redirect the output into entirely different domains, with no changes to the prompt itself.

This prompt contains no domain keywords. It could continue about anything. Below, we show the unsteered baseline followed by the same prompt steered toward five different concepts.

Input Prompt

Things to know before you start:

Injected concept

↓

Generated Output

You will need to create a new user account on your Mac. 2. You may need to update your computer's software in order for the game to work properly. 3. Make sure that you have enough storage space available on your hard drive before installing the game. 4. The installation process can take some time, so be patient! 5. Once the game is installed, you should be able to play it right away without any issues How do I install pc games on my mac? There are a few different ways to install PC games on your mac. One way is to use an app like Steam or Origin. Another

Analysis

Without steering, the model continues into a generic software installation guide.

Concept Suppression: Unlearning at Inference Time

Steering is not just about adding concepts; it can also remove them. The concept module enables a distinct mechanism for this: bottleneck intervention, which goes directly to the concept activation layer and wipes out a specific concept’s contribution before it can influence generation.

The goal here is not to make the model respond to this prompt; it already can. The goal is to make it stop mentioning this specific concept entirely.

Input Prompt

My landlord is refusing to return my security deposit after I moved out. The lease clearly states that the deposit should be returned within 30 days. What are my options? I think

Target Concept 12348 – Tenant-landlord Legal Relations

↓

Generated Output

Concept Activation

Normal

he's trying to keep it. A: If your landlord does not give you back the money in a timely manner, then you can take legal action against him. You will need to contact an attorney who specializes in landlord-tenant law and have them review all of the documents related to this situation before taking any further steps towards getting your money back from him! If there were no written terms regarding how long someone could stay at their rental property (i.e., they didn't sign anything), then it may also be possible for both parties involved with each other through mediation or arbitration instead of going into court where judges decide whether or not

Analysis

Without any intervention, the model responds naturally and relevantly: legal action, attorney referral, mediation. The Tenant-landlord legal concept is clearly active.

Quantitative Evaluation

To move beyond a few examples, we evaluate steering systematically across 100 concepts and 20 prompts per concept: 2,000 samples in total. A Mistral-24B LLM judge scores each generation on two dimensions:

Concept score (0–2): does the output express the target concept?
Quality score (0–2): is the text coherent, fluent, and easy to read?

We report the arithmetic and harmonic means, where the harmonic mean penalizes methods that score well on one axis but poorly on the other.

Method	Concept Score	Quality Score	Harmonic Mean
Unsteered	0.015	1.633	0.029
Steering	0.783	1.371	0.997

Starting from near-zero concept adherence (0.015), steering raises concept score to 0.783 while retaining 84% of baseline generation quality. The harmonic mean of 0.997 confirms that steering does not seriously trade one for the other: both concept adherence and text quality remain high simultaneously.

Conclusion

The steering capabilities demonstrated here are a direct consequence of the concept module’s linear architecture. Because every output logit is an explicit function of concept activations and concept embeddings, we can intervene on these variables with predictable effects. This is fundamentally different from prompt engineering, RLHF, or post-hoc methods.

To explore Steerling-8B yourself:

🤗 Steerling-8B on huggingface
💻 Code on GitHub