A local-first, reversible PII scrubber for AI workflows using ONNX and Regex

5 min read Original article ↗

Press enter or click to view image in full size

Photo by Egor Komarov on Unsplash

Tom Jordi Ruesch

The Privacy-Translation Paradox

Every engineering team eventually faces the same dilemma: You need to translate user content (support tickets, documents, chat logs) using high-quality engines like DeepL or LLMs like GPT-5, but you strictly cannot send Personally Identifiable Information (PII) to third-party APIs (yes, I’m European).

The solution is seemingly simple: Redact the data. The problem? Redaction destroys translation quality.

If you scrub “John bought a generic gift for Mary” into “PERSON bought a generic gift for PERSON,” the translation engine loses the context needed for grammatical gender agreement, case endings, and prepositions in target languages like French or German. Furthermore, most open-source PII scrubbers are “one-way” — they clean data for analytics, not for a round-trip translation workflow.

I built a solution for this. Today, we are open-sourcing rehydra: a TypeScript library for reversible, context-aware PII masking designed specifically for translation pipelines.

How rehydra Works

Unlike general-purpose scrubbers, rehydra is designed around a lifecycle:

Detect -> Mask -> Translate -> Rehydrate

It (Mask & Rehydrate, not Translate) runs entirely on-device (Node.js or Bun) using a hybrid engine of RegEx and quantized ONNX models.

1. Hybrid Detection Strategy

A single detection method isn’t enough. Regex is fast but limited; NER (Named Entity Recognition) is smart but heavy. rehydra uses both:

  • Structured PII (Regex): We use strict patterns for things that follow rules — IBANs (with Mod-97 checksum validation), Credit Cards (Luhn algorithm), and Emails.
  • Soft PII (NER): For Names, Organizations, and Locations, we wrap a powerful NER model via ONNX Runtime.

This allows developers to choose their trade-off. You can run anonymizeRegexOnly() for sub-millisecond performance on streams, or the full anonymize() pipeline for high-precision document scrubbing.

import { createAnonymizer } from 'rehydra';

// Auto-downloads the ~280MB quantized model on first run
const anonymizer = createAnonymizer({
ner: { mode: 'quantized' }
});

await anonymizer.initialize();

2. The Semantic Masking Challenge

While preventing leaks is solved, preserving context remains the frontier.

In our current roadmap, we are tackling Semantic Masking. The goal is to enrich our PII tags with metadata (Gender, Location Scope) so the machine translation engine can generate grammatically correct output.

The Problem

  • Gender Agreement: Replacing “Mary” with generic XML leads to “Il a vu <PII type=”PERSON” />” (masculine default) instead of “Elle a vu…” (feminine).
  • Prepositions: Replacing “Berlin” vs. “Germany” with generic <PII type="LOCATION"/> confuses engines that need to know if the location is a city, country, landmark or other place.

Our V1 Approach: Lookup Tables

For our first iteration, we are implementing a ‘lightweight’ semantic enricher that runs post-NER detection.

// Before
<PII type="PERSON" id="1"/>

// After (Enriched)
<PII type="PERSON" gender="female" id="1"/>

The Implementation:

Data Sources: We aggregate open data from gender-guesser (approx. 40k Western names) and GeoNames (cities >15k population).

Persons: We check the first name against our database. If ambiguous (e.g., “Andrea” is male in Italian but female in German), we can use the locale hint provided to the anonymizer to disambiguate.

Locations: We classify entities into city, country, or region based on the GeoNames export.

But there is a trade-off: We explicitly chose lookup tables over ML for Version 1 to keep the library as lightweight as possible. While a model would handle edge cases better, carrying a static JSON/TXT files is significantly cheaper than loading another 100MB ONNX model just for gender inference. This covers ~90% of common Western names and major cities with near-zero runtime overhead.

In the future, we’ll put additional research into custom ML solutions to cover a broader (and context-aware) enrichment strategies.

3. Fuzzy Rehydration

When you send a string like Hello <PII id="1"/> to a generic LLM or MT engine, the output often comes back "mangled." The engine might change the quotes to smart quotes, add spaces inside the tags, or reorder attributes.

If your library relies on strict string replacement, your pipeline breaks.

We implemented a Fuzzy Tag Matcher that is resilient to these hallucinations. It detects variations in spacing, quoting, and attribute order to ensure we can always map the token back to the original value.

// The mapping table is encrypted using AES-256-GCM
const { anonymizedText, piiMap } = await anonymizer.anonymize("Call John at +49...");

// Translate via any external API...
const translated = await externalTranslate(anonymizedText);
// Even if the API returns: "Rufen Sie < PII id = «1» type='PERSON' > an..."

// Rehydrate seamlessly
const final = rehydrate(translated, piiMap);
// Result: "Rufen Sie John an..."

4. Security First

Because the “PII Map” (the link between ID:1 and John Smith) effectively is the PII, we treat it as sensitive material.

The library includes a crypto module that forces AES-256-GCM encryption for the mapping table. The raw PII never leaves the local memory space, and the state object that persists between the masking and rehydration steps is encrypted at rest.

Performance & Architecture

We built this for Node.js environments for easy use in web-based applications (electron, tauri) and where Python isn’t always an option.

  • Runtime: We abstract the ONNX runtime to support both onnxruntime-node and onnxruntime-web (for Bun/Edge support).
  • Quantization: By default, we pull a quantized (INT8) version of the XLM-RoBERTa model (~280MB) which provides 95%+ of the accuracy of the full model at 1/4 the size (custom models are supported, too).

Try it out

The project is MIT licensed and available on npm.

We’d love feedback on the NER implementation and edge cases in the rehydration logic!