Mechanistic interpretability pipeline for prompts. Annotate regions in any prompt, run attention capture and logit lens on any HuggingFace model, get back heatmaps, cooking curves, animated layer sweeps, and comparative analysis showing what changed and where.
Existing MI tools (TransformerLens, NNsight, pyvene) are libraries: they give you hooks and activation access to build your own analysis. TeaLeaves is a pipeline that takes a structured prompt and produces attention heatmaps, cooking curves, and variant comparisons.
What it does
- Region annotation: Define named spans in your prompt via JSON config (markers, regex, or char ranges). The pipeline maps these to token positions, handling BPE boundary effects and chat template artifacts automatically.
- Attention capture: Hooks every attention layer to extract head-averaged attention weights at configurable query positions
- Logit lens: Projects the residual stream through the final norm + LM head at each layer to track token rank trajectories
- Visualization: Four renderers: per-token heatmaps, per-region cooking curves, animated layer sweeps, and multi-sample aggregates with confidence bands
- Variant comparison: Automated N-variant comparison with delta tables, multi-seed stability analysis, and markdown reports
Before and after
The cooking curves below show the same prompt before and after iterative tuning (both per-region normalized to 0–1 for direct comparison).
The "before" curve is from a first-draft prompt. Several regions show artifact peaks at L0 from near-zero values. current_message has a narrow, isolated spike around L42–45 but no sustained dominance. task_entity unexpectedly takes over the output formatting layers (L56+), and the mid-layers are noisy with no clear phase differentiation:
After several rounds of restructuring guided by the pipeline's output (adjusting region boundaries, reordering task sections, adding structural markers), the curves show clean phase progression. directive and conversation_turns peak first (L3), rules regions differentiate through the early-mid layers, current_message sustains dominance through the focus layers (L40–50), and output_format cleanly takes over the final layers. The mid-layer oscillations are structured rather than noisy:
Each change was validated by re-running the pipeline and comparing curves against the baseline.
Quick start
# Install (local, rendering and analysis only) pip install -e . # Install with GPU dependencies (for running the engine) pip install -e ".[gpu]"
1. Define regions
Create a regions.json describing named spans in your prompt:
{
"system_prompt": {
"regions": [
{"name": "rules", "start_marker": "## Rules", "end_marker": "## Examples"},
{"name": "examples", "start_marker": "## Examples", "end_marker": null}
]
},
"user_message": {
"regions": [
{"name": "context", "start_marker": "Previous:", "end_marker": "Current:"},
{"name": "current", "start_marker": "Current:", "end_marker": null}
]
}
}2. Prepare inputs
python -m tealeaves.prep.inputs \
--prompt system_prompt.txt \
--regions regions.json \
--conversations conversations.json \
--output test_cases.json3. Run analysis on GPU
# scp the self-contained engine, setup script, and test cases to your GPU box scp src/engine/run_analysis.py gpu:/workspace/ scp infra/vastai_setup.sh gpu:/workspace/ scp test_cases.json gpu:/workspace/ # Bootstrap the GPU box (MODEL_ID is required) ssh gpu 'MODEL_ID=meta-llama/Llama-3-8B bash /workspace/vastai_setup.sh' # Run analysis ssh gpu 'python /workspace/run_analysis.py \ --input /workspace/test_cases.json \ --output /workspace/results/ \ --model-path /workspace/models/Llama-3-8B \ --tracked-tokens "<" "keyword"'
4. Render results
# Per-token attention heatmap python -m tealeaves.render.heatmap --result results/case_0.json --mask-chatml # Per-region attention trajectories across layers python -m tealeaves.render.cooking_curves --result results/case_0.json --normalize per-region # Animated layer sweep python -m tealeaves.render.layer_gif --result results/case_0.json --mask-chatml # Multi-sample aggregate with confidence bands python -m tealeaves.render.aggregate --base-dir results/ --variants baseline:Baseline
5. Compare variants
python -m tealeaves.analysis.compare \
--base-dir results/ \
--variants baseline:Baseline modified:Modified \
--ratio context:current_message
python -m tealeaves.analysis.report \
--base-dir results/ \
--experiments baseline:Baseline:results_baseline modified:Modified:results_modified \
--output-dir reports/Input formats
The pipeline takes three input files. All content is model-agnostic: any system prompt, any conversation structure, any region definitions.
system_prompt.txt
Plain text file containing the full system prompt. This is the exact text that will be inserted into the chat template's system role.
conversations.json
Array of conversation objects. Each object represents one test case for MI analysis.
[
{
"id": "case_0",
"user_message": "What's the weather like in Tokyo?",
"response": "Tokyo is currently experiencing mild temperatures around 18°C."
},
{
"id": "case_1",
"user_message": "Tell me more about the forecast.",
"response": "The week ahead shows increasing cloud cover with rain expected Thursday."
}
]| Field | Required | Description |
|---|---|---|
id |
yes | Unique case identifier (used for output filenames) |
user_message |
yes | The user turn to analyze |
response |
yes | The assistant response (can be empty string "" if analyzing pre-response attention) |
user_regions |
no | Per-case region defs for user message (overrides global user_message.regions) |
response_regions |
no | Per-case region defs for response (overrides global response.regions) |
Conversations can come from any source: exported from a chat database, hand-written, pulled from logs, or generated synthetically.
regions.json
Defines named text spans to track attention over. Regions are matched against the assembled prompt text at character level, then resolved to token positions.
{
"system_prompt": {
"regions": [
{"name": "rules", "start_marker": "## Rules", "end_marker": "## Examples"},
{"name": "examples", "start_marker": "## Examples", "end_marker": null}
]
},
"user_message": {
"regions": [
{"name": "context", "start_marker": "Previous:", "end_marker": "Current:"},
{"name": "current", "start_marker": "Current:", "end_marker": null}
]
},
"response": {
"regions": [
{"name": "answer", "start_pattern": "^\\w", "end_pattern": null}
]
},
"query_positions": {
"terminal": "last_token",
"decision": {"after_text": "Answer:"}
},
"tracked_tokens": ["<", "yes", "no"]
}Region detection strategies: each region def needs a name and one of these boundary strategies:
| Strategy | Fields | Use when |
|---|---|---|
| Marker | start_marker, end_marker |
Boundaries are literal text strings in the prompt |
| Regex | start_pattern, end_pattern |
Boundaries need pattern matching |
| Character range | start_char, end_char |
You know exact character offsets |
Set end_marker, end_pattern, or end_char to null to extend to end of text. Regions can also be nested by including a regions array inside a region def.
Query positions define where in the token sequence to probe attention and logit lens:
| Value | Meaning |
|---|---|
"last_token" |
Last token of the user message |
{"after_text": "..."} |
First non-whitespace token after the specified text in the response |
{"at_text": "..."} |
Token at the specified text in the response |
Tracked tokens are specific tokens to monitor rank and probability for across all layers in the logit lens output.
Why this exists
Prompt engineering typically relies on eyeballing model outputs after each change. A tweak that seems to improve one case might silently degrade others.
TeaLeaves captures how the model distributes attention across every region of your prompt at every layer, so you can measure whether a change helped or hurt, and where in the forward pass the effect occurs.
The pipeline was originally built to tune the subcortical prompt for Mira, a persistent digital entity with self-directed memory and context window management. Because subcortical runs on a small model in a single forward pass with no reasoning, every word in the prompt either drives attention or wastes it. The techniques generalize to any prompt and any model.
Layer-by-layer attention heatmap
Model support
The engine auto-discovers model architecture from any HuggingFace decoder-only transformer:
- Reads layer count, head counts, hidden size, vocab size from
model.config - Walks the module tree to find attention submodules, LM head, and final norm
- Phase annotations and layer-dependent rendering scale automatically to any layer count
Verified model families
| Family | Example model | Chat template | Layers | Notes |
|---|---|---|---|---|
| Qwen | Qwen3-32B | ChatML | 64 | Full support |
| Llama 3 | Llama-3.1-8B-Instruct | Llama 3 format | 32 | Full support |
| Mistral | Mistral-7B-Instruct-v0.1 | [INST] format |
32 | Full support |
| Gemma | Gemma-2-9B-IT | Gemma format | 42 | System role auto-merged into user |
| GPT (OpenAI) | gpt-oss-20b | OpenAI format | 24 | Full support (MoE) |
Models without a system role (Gemma) are handled automatically: the engine merges system content into the first user message. Any HuggingFace model with a chat template and eager attention support should work.
Requirements: attn_implementation="eager" (flash attention doesn't materialize the attention matrix).
GPU requirements
Rule of thumb: model_params * 2 bytes + 5GB headroom (fp16 weights + attention capture overhead).
| Model | VRAM needed | Recommended GPU |
|---|---|---|
| 8B params | ~21GB | A100 40GB |
| 32B params | ~69GB | H100 80GB |
| 70B params | ~145GB | Won't fit single GPU; use quantization |
See Empirical Notes for memory estimation details and OOM prevention.
Documentation
- SKILL.md: Operational reference for running the full pipeline with all flags
- Pipeline Explained: How region annotation, attention hooks, logit lens, and per-token capture work mechanically
- Empirical Notes: What broke, why, and what works — failure modes and validated patterns


