GitHub - zahraarmantech/ZATRON: Privacy-preserving semantic search via modular arithmetic. 98% retrieval quality, 8x faster than FHE. Patent pending.

Stop choosing between smart search and private data.

ZATRON gives you both — 98% retrieval quality on 626K documents, and the server can't read a thing.

A neural network trained on 80,000 labeled pairs couldn't recover document similarity from ZATRON's barcodes (AUC 0.505 — chance level), while the same attack broke unprotected embeddings almost perfectly (AUC 0.999).

▶ Live Demo · PyPI · Benchmarks · Patent Pending

ZATRON (Zero-Access Transformed Retrieval Over Noise) turns embeddings into modular barcodes. Search still works; structure disappears. Private by math, not by promise.

The Problem

Standard semantic search stores embeddings as plain vectors. Anyone with database access can cluster documents by topic and infer content — without reading a single word.

The Solution

ZATRON transforms embeddings into modular barcodes. Search still works. Structure disappears.

Left: raw embeddings — same-topic documents cluster together. An attacker immediately sees the structure.

Right: ZATRON protected — random scatter. No topic structure visible.

Attack Analysis

Can an observer recover document similarity from ZATRON barcodes?

Left: raw embedding distances perfectly correlate with true similarity (ρ = 1.00). Attacker wins.

Right: ZATRON barcode distances show zero correlation (ρ = 0.09). Attacker gets nothing.

Results

All numbers on real data. All reproducible.

Retrieval Quality

Benchmark	Corpus	Quality (% of cosine)
MSMARCO	626,906 passages	98.2%
Natural Questions	5,000 passages, 100 real queries	101.6%
SciFact	5,183 scientific docs	95.7%
NFCorpus	3,633 medical docs	89.9%
STS-B	1,379 sentence pairs	100.1%

Real-World Challenge Tests

Dataset	What it is	Security (ρ)	Top-10 Accuracy
Enron Emails	350,652 real corporate emails	0.034 ✓	98.0%
Natural Questions	5,000 passages, real Google queries	0.031 ✓	100%
MSMARCO	626,906 web passages	0.10 ✓	98.2% MRR

Enron Corporate Email Archive — Full Results

Privacy-preserving retrieval on 350,652 real corporate emails from the Enron leak. This is a stress test: corporate emails have extremely high content similarity (forwarded threads, replies, meeting updates), making retrieval harder than typical document search.

	Cosine Baseline	ZATRON	Retention
Top-1	100%	55.5%	55.5%
Top-5	100%	95.0%	95.0%
Top-10	100%	98.0%	98.0%
MRR@10	100%	72.5%	72.5%
Security (ρ)	—	0.034	✓ SAFE

The correct result appears in the top 10 results 98% of the time. Top-1 is lower due to near-duplicate emails swapping ranks — not information loss.

Comparison with Existing Methods

Method	MRR@10	Encrypted?
Cosine (float32)	.530	No
Binary quantization	.514	No
Product quantization	.520	No
ZATRON (ours)	.528	Yes

8× faster than CKKS FHE on identical hardware (5ms vs 39ms per comparison).

Three embedding models. Five languages. Eight security tests passed.

Multilingual

Language	Quality
Arabic	93.4%
Spanish	93.8%
Korean	95.1%
Chinese	95.4%
English	93.5%

Speed Optimization: Coarse-to-Fine Search

Brute-force comparison against all documents is slow at scale. We developed a two-stage approach inspired by radio tuning — scan coarsely first, then fine-tune on candidates:

Stage 1 (Coarse): Compute distance using only 2 of 6 primes across all documents. This is one-third the computation and filters out 95% of irrelevant documents.

Stage 2 (Fine): Compute full 6-prime distance on the remaining 5% of candidates.

Dataset	Brute Force	Radio Tuning	Speedup	Quality Loss
MSMARCO 626K	2399s	957s	2.5×	Zero
Enron 5K	2.1s	0.8s	2.6×	Zero

Top-10 accuracy: 100% on both datasets. The coarse filter never drops the correct result.

No stored index or bucket structure. Zero additional leakage — the optimization happens at query time by the key holder.

At 1M+ documents, the speedup grows to an estimated 5–10×, since the fine stage (5% of corpus) dominates.

Try It

Install:

Live demo (no install): https://huggingface.co/spaces/zahraarman/ZATRON

Run the demo locally:

pip install zatron sentence-transformers matplotlib
python demo.py

Quick Start

from zatron import ModularBarcodeSystem

system = ModularBarcodeSystem(key="your-secret-key", n_channels=200)
system.fit(corpus_embeddings)

barcodes = system.encode(corpus_embeddings, doc_ids)
query_bc = system.encode_query(query_embedding)
distance = system.compare(query_bc, barcodes[0])

How It Works

Decompose: Project embedding onto 200 PCA channels
Quantize: Convert each channel to integer (0–49)
Mask: Apply rejection-sampled salt + wave interference per document
Store: Keep only modular residues (mod prime)
Search: Compare in modular space — raw embedding never reconstructed

Security

Learned Inversion Attack (Neural Attacker)

Statistical tests only rule out simple attackers. The stronger question: can a neural network, trained on labeled examples, learn to recover similarity from ZATRON barcodes?

Threat model. Known-plaintext observer: the attacker sees stored barcodes and obtains ~80,000 document pairs with true cosine similarities (a partial plaintext leak). A linear probe and a 3-layer MLP are trained on per-prime circular-difference features to predict similarity of unseen pairs. Train/test pairs share no anchor documents.

Result (MSMARCO, 50,000 passages, 100,000 pairs):

Input the attacker sees	Linear probe	MLP (3-layer)
Unprotected quantized signals (control)	ρ = 0.79, AUC = 0.985	ρ = 0.90, AUC = 0.999
ZATRON masked barcodes	ρ = 0.00, AUC = 0.498	ρ = 0.00, AUC = 0.505

The identical attack that almost perfectly recovers similarity from unprotected signals performs exactly at chance level against ZATRON barcodes — with 80,000 labeled training pairs at its disposal.

Reproduce: benchmarks/zatron_ML_attack.py (Colab, ~15 min on T4).

Head-to-Head: ZATRON vs ASPE

ASPE (Wong et al., SIGMOD 2009) is the classic encrypted-kNN baseline. It preserves scalar products exactly — perfect retrieval — but that same property means any observer can compute document similarities directly from the ciphertexts, no key and no training required.

	ASPE (SIGMOD '09)	ZATRON
Retrieval recall@10 (strict set-overlap)	100%	81.4%
Observer computes similarity directly	ρ = +0.87 (leaks)	ρ = −0.06 (nothing)
Learned attack (MLP, 80K labeled pairs)	ρ = +0.91, AUC = 0.993	ρ = +0.01, AUC = 0.518

ASPE's perfect recall is bought with total similarity leakage — by design. ZATRON trades a margin on the strictest retrieval metric for chance-level leakage under both direct and learned attacks. (Strict set-overlap is harder than the top-1-in-top-10 metric used in the headline table; same system, stricter ruler. MSMARCO 50,000 passages, identical conditions for both schemes.)

Reproduce: benchmarks/zatron_vs_ASPE.py (Colab, ~10 min on T4).

Classical Attack Battery

Eight independent attack vectors tested:

Attack	Result	Status
IND-CPA indistinguishability	p = 0.48	Pass
Statistical correlation	ρ = 0.10	Pass
Entropy analysis	100%	Pass
Per-channel leakage	\|r\| = 0.30	Pass
Key recovery	1.0% vs 1.9% baseline	Pass
Chosen-plaintext	ρ = 0.00	Pass
Timing side-channel	p = 1.00	Pass
CRT reconstruction	\|r\| = 0.01	Pass

Threat model: Protected against unauthorized database observers — including learned (neural) attackers with known-plaintext training data, per the table above. The key holder computes distances but never reconstructs raw embeddings; a key holder computing many pairwise distances can still partially recover embedding geometry via MDS (ρ ≈ 0.35 after the log transform), which is inherent to any distance-preserving scheme, FHE included. This is a randomized privacy-preserving encoding, distinct from reversible block cipher encryption. Independent cryptographic review remains the appropriate bar for production use.

Formal proofs under PRF assumption (HMAC-SHA256) in paper/Formal_Security_Proof.pdf.

Project Structure

ZATRON/
├── README.md
├── zatron_search.py              # Core system (self-testing)
├── demo.py                       # One-command demo
├── generate_visuals.py           # Generate comparison images
├── zatron_comparison.png         # t-SNE visualization
├── zatron_attack.png             # Attack analysis visualization
├── benchmarks/
│   ├── zatron_ML_attack.py       # Learned inversion attack (reproduce security table)
│   └── zatron_vs_ASPE.py         # Head-to-head vs ASPE (SIGMOD'09)
├── demo/
│   └── encrypted_search_demo.jsx # Interactive web demo
├── paper/
│   ├── Lightweight_Encrypted_Semantic_Search.pdf
│   └── Formal_Security_Proof.pdf
└── LICENSE

Cite

@misc{arman2026zatron,
  title={Lightweight Encrypted Semantic Search via Multi-Channel Modular Signaling},
  author={Zahra Arman},
  year={2026},
  note={US Provisional Patent Filed. github.com/zahraarmantech/ZATRON}
}

License

MIT License. The method is covered by a pending US provisional patent.

Author

Zahra Arman — Independent Researcher — zahra.arman.tech@gmail.com