GitHub - base76-research-lab/token-compressor: Token compressor

Reduce LLM prompt tokens by 30–70% while preserving semantic meaning.

mcp-name: io.github.base76-research-lab/token-compressor

Semantic prompt compression for LLM workflows. Reduce token usage by 40–60% without losing meaning.

Built by Base76 Research Lab — research into epistemic AI architecture.

Live demo

Intent Compiler MVP is now live and uses this project as part of the idea -> spec -> compressed output flow:

Live: https://intent-compiler-mvp.pages.dev
Product repo: https://github.com/base76-research-lab/token-compressor

What it does

token-compressor is a two-stage pipeline that compresses prompts before they reach an LLM:

LLM compression — a local model (llama3.2:1b via Ollama) rewrites the prompt to its semantic minimum, preserving all conditionals and negations
Embedding validation — cosine similarity between original and compressed embeddings must exceed a threshold (default: 0.85) — if not, the original is sent unchanged

The result: shorter prompts, lower costs, same intent.

Input prompt (300 tokens)
        ↓
  LLM compresses
        ↓
  Embedding validates (cosine ≥ 0.85?)
        ↓
  Pass → compressed (120 tokens)   Fail → original (300 tokens)

Key design principle: conditionality is never sacrificed. If your prompt says "only do X if Y", that constraint survives compression.

Requirements

Python 3.10+
Ollama running locally
Two models pulled:

ollama pull llama3.2:1b
ollama pull nomic-embed-text

Python dependencies:

Quick start

from compressor import LLMCompressEmbedValidate

pipeline = LLMCompressEmbedValidate()
result = pipeline.process("Your prompt text here...")

print(result.output_text)   # compressed (or original if validation failed)
print(result.report())      # MODE / COVERAGE / TOKENS saved

Result object:

Field	Description
`output_text`	Text to send to your LLM
`mode`	`compressed` / `raw_fallback` / `skipped`
`coverage`	Cosine similarity (0.0–1.0)
`tokens_in`	Estimated input tokens
`tokens_out`	Estimated output tokens
`tokens_saved`	Difference

CLI usage

echo "Your long prompt here..." | python3 cli.py

Output: compressed text on stdout, stats on stderr.

Claude Code hook (recommended setup)

Add to your ~/.claude/settings.json under hooks → UserPromptSubmit:

{
  "type": "command",
  "command": "echo \"${CLAUDE_USER_PROMPT:-}\" | python3 /path/to/token-compressor/cli.py > /tmp/compressed_prompt.txt 2>/tmp/compress.log || true"
}

This runs on every prompt submission and writes the compressed version to a temp file, which can be injected back into context via a second hook or MCP server.

MCP server

The MCP server exposes compression as a tool callable from Claude Code and any MCP-compatible client.

Install:

pip install token-compressor-mcp

Tool: compress_prompt

Input: text (string)
Output: compressed text + stats footer

Claude Code MCP config (~/.claude/settings.json):

{
  "mcpServers": {
    "token-compressor": {
      "command": "uvx",
      "args": ["token-compressor-mcp"]
    }
  }
}

Or from source:

{
  "mcpServers": {
    "token-compressor": {
      "command": "python3",
      "args": ["-m", "token_compressor_mcp"],
      "cwd": "/path/to/token-compressor"
    }
  }
}

Configuration

pipeline = LLMCompressEmbedValidate(
    threshold=0.85,          # cosine similarity floor (lower = more aggressive)
    min_tokens=80,           # skip pipeline below this (not worth compressing)
    compress_model="llama3.2:1b",
    embed_model="nomic-embed-text",
)

How it works

Stage 1 — LLM compression

The compression prompt instructs the model to:

Preserve all conditionals (if, only if, unless, when, but only)
Preserve all negations
Remove filler, hedging, redundancy
Target 40–60% of original length

Stage 2 — Embedding validation

Computes cosine similarity between the original and compressed text using nomic-embed-text. If similarity falls below threshold, the original is returned unchanged. This prevents silent meaning loss.

Results

Tested across Swedish and English prompts, technical and natural language:

Input	Tokens in	Tokens out	Saved
Research abstract (EN)	89	38	57%
Session intent (SV)	32	18	44%
Technical instruction	47	22	53%
Short command (<80t)	—	—	skipped

Research background

This tool implements the architecture from:

Wikström, B. (2026). When Alignment Reduces Uncertainty: Epistemic Variance Collapse and Its Implications for Metacognitive AI. DOI: 10.5281/zenodo.18731535

Part of the Base76 Research Lab toolchain for epistemic AI infrastructure.

License

MIT — Base76 Research Lab, Sweden