GitHub - taylorsatula/TeaLeaves: End-to-end pipeline for seeing how LLMs actually process your prompts. Capture attention across every layer, render heatmaps and cooking curves, compare variants with evidence

Mechanistic interpretability pipeline for prompts. Annotate regions in any prompt, run attention capture and logit lens on any HuggingFace model, get back heatmaps, cooking curves, animated layer sweeps, and comparative analysis showing what changed and where.

Existing MI tools (TransformerLens, NNsight, pyvene) are libraries: they give you hooks and activation access to build your own analysis. TeaLeaves is a pipeline that takes a structured prompt and produces attention heatmaps, cooking curves, and variant comparisons.

What it does

Region annotation: Define named spans in your prompt via JSON config (markers, regex, or char ranges). The pipeline maps these to token positions, handling BPE boundary effects and chat template artifacts automatically.
Attention capture: Hooks every attention layer to extract head-averaged attention weights at configurable query positions
Logit lens: Projects the residual stream through the final norm + LM head at each layer to track token rank trajectories
Visualization: Four renderers: per-token heatmaps, per-region cooking curves, animated layer sweeps, and multi-sample aggregates with confidence bands
Variant comparison: Automated N-variant comparison with delta tables, multi-seed stability analysis, and markdown reports

Before and after

The cooking curves below show the same prompt before and after iterative tuning (both per-region normalized to 0–1 for direct comparison).

The "before" curve is from a first-draft prompt. Several regions show artifact peaks at L0 from near-zero values. current_message has a narrow, isolated spike around L42–45 but no sustained dominance. task_entity unexpectedly takes over the output formatting layers (L56+), and the mid-layers are noisy with no clear phase differentiation:

After several rounds of restructuring guided by the pipeline's output (adjusting region boundaries, reordering task sections, adding structural markers), the curves show clean phase progression. directive and conversation_turns peak first (L3), rules regions differentiate through the early-mid layers, current_message sustains dominance through the focus layers (L40–50), and output_format cleanly takes over the final layers. The mid-layer oscillations are structured rather than noisy:

Each change was validated by re-running the pipeline and comparing curves against the baseline.

Quick start

# Install (local, rendering and analysis only)
pip install -e .

# Install with GPU dependencies (for running the engine)
pip install -e ".[gpu]"

1. Define regions

Create a regions.json describing named spans in your prompt:

{
  "system_prompt": {
    "regions": [
      {"name": "rules", "start_marker": "## Rules", "end_marker": "## Examples"},
      {"name": "examples", "start_marker": "## Examples", "end_marker": null}
    ]
  },
  "user_message": {
    "regions": [
      {"name": "context", "start_marker": "Previous:", "end_marker": "Current:"},
      {"name": "current", "start_marker": "Current:", "end_marker": null}
    ]
  }
}

2. Prepare inputs

python -m tealeaves.prep.inputs \
    --prompt system_prompt.txt \
    --regions regions.json \
    --conversations conversations.json \
    --output test_cases.json

3. Run analysis on GPU

# scp the self-contained engine, setup script, and test cases to your GPU box
scp src/engine/run_analysis.py gpu:/workspace/
scp infra/vastai_setup.sh gpu:/workspace/
scp test_cases.json gpu:/workspace/

# Bootstrap the GPU box (MODEL_ID is required)
ssh gpu 'MODEL_ID=meta-llama/Llama-3-8B bash /workspace/vastai_setup.sh'

# Run analysis
ssh gpu 'python /workspace/run_analysis.py \
    --input /workspace/test_cases.json \
    --output /workspace/results/ \
    --model-path /workspace/models/Llama-3-8B \
    --tracked-tokens "<" "keyword"'

4. Render results

# Per-token attention heatmap
python -m tealeaves.render.heatmap --result results/case_0.json --mask-chatml

# Per-region attention trajectories across layers
python -m tealeaves.render.cooking_curves --result results/case_0.json --normalize per-region

# Animated layer sweep
python -m tealeaves.render.layer_gif --result results/case_0.json --mask-chatml

# Multi-sample aggregate with confidence bands
python -m tealeaves.render.aggregate --base-dir results/ --variants baseline:Baseline

5. Compare variants

python -m tealeaves.analysis.compare \
    --base-dir results/ \
    --variants baseline:Baseline modified:Modified \
    --ratio context:current_message

python -m tealeaves.analysis.report \
    --base-dir results/ \
    --experiments baseline:Baseline:results_baseline modified:Modified:results_modified \
    --output-dir reports/

Input formats

The pipeline takes three input files. All content is model-agnostic: any system prompt, any conversation structure, any region definitions.

`system_prompt.txt`

Plain text file containing the full system prompt. This is the exact text that will be inserted into the chat template's system role.

`conversations.json`

Array of conversation objects. Each object represents one test case for MI analysis.

[
  {
    "id": "case_0",
    "user_message": "What's the weather like in Tokyo?",
    "response": "Tokyo is currently experiencing mild temperatures around 18°C."
  },
  {
    "id": "case_1",
    "user_message": "Tell me more about the forecast.",
    "response": "The week ahead shows increasing cloud cover with rain expected Thursday."
  }
]

Field	Required	Description
`id`	yes	Unique case identifier (used for output filenames)
`user_message`	yes	The user turn to analyze
`response`	yes	The assistant response (can be empty string `""` if analyzing pre-response attention)
`user_regions`	no	Per-case region defs for user message (overrides global `user_message.regions`)
`response_regions`	no	Per-case region defs for response (overrides global `response.regions`)

Conversations can come from any source: exported from a chat database, hand-written, pulled from logs, or generated synthetically.

`regions.json`

Defines named text spans to track attention over. Regions are matched against the assembled prompt text at character level, then resolved to token positions.

{
  "system_prompt": {
    "regions": [
      {"name": "rules", "start_marker": "## Rules", "end_marker": "## Examples"},
      {"name": "examples", "start_marker": "## Examples", "end_marker": null}
    ]
  },
  "user_message": {
    "regions": [
      {"name": "context", "start_marker": "Previous:", "end_marker": "Current:"},
      {"name": "current", "start_marker": "Current:", "end_marker": null}
    ]
  },
  "response": {
    "regions": [
      {"name": "answer", "start_pattern": "^\\w", "end_pattern": null}
    ]
  },
  "query_positions": {
    "terminal": "last_token",
    "decision": {"after_text": "Answer:"}
  },
  "tracked_tokens": ["<", "yes", "no"]
}

Region detection strategies: each region def needs a name and one of these boundary strategies:

Strategy	Fields	Use when
Marker	`start_marker`, `end_marker`	Boundaries are literal text strings in the prompt
Regex	`start_pattern`, `end_pattern`	Boundaries need pattern matching
Character range	`start_char`, `end_char`	You know exact character offsets

Set end_marker, end_pattern, or end_char to null to extend to end of text. Regions can also be nested by including a regions array inside a region def.

Query positions define where in the token sequence to probe attention and logit lens:

Value	Meaning
`"last_token"`	Last token of the user message
`{"after_text": "..."}`	First non-whitespace token after the specified text in the response
`{"at_text": "..."}`	Token at the specified text in the response

Tracked tokens are specific tokens to monitor rank and probability for across all layers in the logit lens output.

Why this exists

Prompt engineering typically relies on eyeballing model outputs after each change. A tweak that seems to improve one case might silently degrade others.

TeaLeaves captures how the model distributes attention across every region of your prompt at every layer, so you can measure whether a change helped or hurt, and where in the forward pass the effect occurs.

The pipeline was originally built to tune the subcortical prompt for Mira, a persistent digital entity with self-directed memory and context window management. Because subcortical runs on a small model in a single forward pass with no reasoning, every word in the prompt either drives attention or wastes it. The techniques generalize to any prompt and any model.

Layer-by-layer attention heatmap

Model support

The engine auto-discovers model architecture from any HuggingFace decoder-only transformer:

Reads layer count, head counts, hidden size, vocab size from model.config
Walks the module tree to find attention submodules, LM head, and final norm
Phase annotations and layer-dependent rendering scale automatically to any layer count

Verified model families

Family	Example model	Chat template	Layers	Notes
Qwen	Qwen3-32B	ChatML	64	Full support
Llama 3	Llama-3.1-8B-Instruct	Llama 3 format	32	Full support
Mistral	Mistral-7B-Instruct-v0.1	`[INST]` format	32	Full support
Gemma	Gemma-2-9B-IT	Gemma format	42	System role auto-merged into user
GPT (OpenAI)	gpt-oss-20b	OpenAI format	24	Full support (MoE)

Models without a system role (Gemma) are handled automatically: the engine merges system content into the first user message. Any HuggingFace model with a chat template and eager attention support should work.

Requirements: attn_implementation="eager" (flash attention doesn't materialize the attention matrix).

GPU requirements

Rule of thumb: model_params * 2 bytes + 5GB headroom (fp16 weights + attention capture overhead).

Model	VRAM needed	Recommended GPU
8B params	~21GB	A100 40GB
32B params	~69GB	H100 80GB
70B params	~145GB	Won't fit single GPU; use quantization

See Empirical Notes for memory estimation details and OOM prevention.

Documentation

SKILL.md: Operational reference for running the full pipeline with all flags
Pipeline Explained: How region annotation, attention hooks, logit lens, and per-token capture work mechanically
Empirical Notes: What broke, why, and what works — failure modes and validated patterns