GitHub - flexorch/flexorch-audit: Zero-dependency PII + quality + noise audit for LLM datasets (TR/EU/US)

Zero-dependency PII detection, quality grading, and noise audit for LLM datasets — in a single function call.

Why

Before feeding documents into an LLM pipeline you need to answer three questions:

Does this text contain personal data? Sending PII to a language model is a compliance risk.
Is the text quality high enough? Short, noisy, or duplicate records hurt fine-tuning and RAG retrieval.
How bad is the noise? Garbled encodings and symbol clutter degrade model output silently.

Most tools that answer these questions require heavy NLP frameworks, model weights, or cloud APIs. flexorch-audit answers all three with one call — using only regex and Python's standard library. No model weights, no network calls, no external packages.

Features

Quality grade — A/B/C/D composite score: is this text LLM-ready at a glance?
Noise ratio — line-level symbol clutter detection (noise_ratio); values above 0.20 indicate likely extraction artifacts
PII detection — 30+ types across 8 countries (TR/DE/FR/IT/NL/ES/UK/US) + universal types; all regex-based with checksum validation
Batch audit — audit_batch() aggregates duplicate ratio and PII counts across an entire dataset in one call
Masking — four strategies: redact, replace (synthetic), token, hash
Zero runtime dependencies — pure Python stdlib, Python 3.10+

Install

pip install flexorch-audit

Quick start

from flexorch_audit import audit, mask

text = open("contract.txt").read()  # extract from PDF/DOCX first

result = audit(text)               # "und" by default — all detectors active
# result = audit(text, locale="tr")  # restrict to TR-only detectors

result.quality_grade      # "B"
result.quality_score      # 0.73  (0.0–1.0 composite)
result.noise_ratio        # 0.04  (fraction of blank/garbage lines; >0.20 = low quality)
result.detected_language  # "und" (locale you passed in; caller controls language)
result.pii_summary        # [{"type": "email", "count": 2}, {"type": "national_id_tr", "count": 1}]

# Full findings and raw metrics — dict access also works:
result["pii"]    # [{"type": "email", "value": "ali@example.com", "start": 8, "end": 23}]
result["quality"]  # {"completeness": 1.0, "avg_length": 342, "duplicate_ratio": None}
result["noise"]    # {"garbage_ratio": 0.0, "encoding_ok": True}

clean = mask(text, result["pii"], strategy="redact")
# "Contact: [REDACTED_EMAIL]"

One-shot redaction

from flexorch_audit import redact_for_llm

clean = redact_for_llm("TCKN: 12345678950, email: ali@example.com", locale="tr")
# "TCKN: [REDACTED_NATIONAL_ID_TR], email: [REDACTED_EMAIL]"

# Different masking strategies
redact_for_llm(text, locale="tr", strategy="token")   # <PII_NATIONAL_ID_TR_1>
redact_for_llm(text, locale="tr", strategy="hash")    # [3d4f9a1b2c8e7f0a]
redact_for_llm(text, locale="tr", strategy="replace") # static synthetic value

No PII found → original text returned unchanged.

Token estimation

from flexorch_audit import estimate_tokens

estimate_tokens("The quick brown fox jumps over the lazy dog.")  # → 16
estimate_tokens("")  # → 0

Heuristic: words × 4/3 — no tiktoken required. Accuracy within ~15% of the real tokenizer for English and most European languages; treat as a planning estimate for context window sizing and cost forecasting.

Batch audit

from flexorch_audit import audit_batch

texts = [record["text"] for record in dataset]
batch = audit_batch(texts)           # locale="und" by default

batch["duplicate_ratio"]    # 0.12 — fraction of exact-duplicate records
batch["avg_quality_score"]  # 0.78
batch["pii_summary"]        # [{"type": "email", "count": 47}, ...]
batch["results"]            # list of AuditResult, one per text

Country coverage

`locale`	Detectors activated
`"und"` (default)	All locales combined — use when document language is unknown
`"all"`	Alias for `"und"`
`"tr"`	TCKN · VKN · phone_tr · name · IBAN_TR · company_name_tr · MERSIS · postal_code_tr · province_tr
`"de"`	Steueridentifikationsnummer · Sozialversicherungsnummer
`"fr"`	SIREN · SIRET · INSEE/NIR
`"it"`	Codice Fiscale · Partita IVA
`"nl"`	BSN · KvK
`"es"`	DNI/NIE · CIF
`"uk"`	NI number · UTR
`"us"`	SSN · EIN · ITIN
`"eu"`	E.164 phone · IBAN (EU+GB+CH+NO) · company name

Universal detectors (always active regardless of locale): email · iban · credit_card · ip · ip_v6

Language detection: flexorch-audit is zero-dependency — no language detection library is included. Pass the correct locale yourself, or use "und" (default) to activate all detectors.

PII types

Universal

Type	Description
`email`	RFC-5321 email address
`iban`	ISO 13616 IBAN — mod-97 validated; suppressed when `iban_tr` or `iban_intl` fires on same span
`credit_card`	16-digit groups, Luhn-validated
`ip`	IPv4 address
`ip_v6`	IPv6 — full, compressed `::`, loopback forms

Turkey (`locale="tr"`)

Type	Description
`national_id_tr`	TCKN — 11-digit, modular arithmetic checksum
`tax_id_tr`	VKN — 10-digit, Luhn-variant checksum
`phone_tr`	Turkish mobile: `+90`/`0` prefix + 10 digits
`name`	Label-prefixed name: `Adı:`, `Full Name:`, `Customer Name:`, etc.
`iban_tr`	Turkish IBAN (`TR` + 24 chars), mod-97 validated
`company_name_tr`	Company with TR legal suffix: A.Ş. · Ltd.Şti. · Koll.Şti. · Koop. · T.A.Ş.
`mersis_no`	MERSIS — 16-digit company registry number
`postal_code_tr`	Turkish postal code (province plate 01–81)
`province_tr`	All 81 Turkish provinces

Germany (`locale="de"`)

Type	Description
`tax_id_de`	Steueridentifikationsnummer — 11 digits, ISO 7064 MOD 11,2 checksum
`social_id_de`	Sozialversicherungsnummer — area + DOB + letter + serial

France (`locale="fr"`)

Type	Description
`siret_fr`	SIRET — 14 digits, label-prefix gated
`company_id_fr`	SIREN — 9 digits, label-prefix gated
`social_id_fr`	INSEE/NIR — 15 digits, starts with `1` or `2`

Italy (`locale="it"`)

Type	Description
`national_id_it`	Codice Fiscale — 16 chars alphanumeric, uppercase normalized
`tax_id_it`	Partita IVA — 11 digits, Agenzia delle Entrate checksum

Netherlands (`locale="nl"`)

Type	Description
`national_id_nl`	BSN — 9 digits, 11-check (weighted sum mod 11)
`company_id_nl`	KvK — 8 digits, label-prefix gated

Spain (`locale="es"`)

Type	Description
`national_id_es`	DNI (8 digits + letter, mod-23) and NIE (X/Y/Z prefix, same check)
`tax_id_es`	CIF — letter prefix + 7 digits + control character

United Kingdom (`locale="uk"`)

Type	Description
`social_id_uk`	NI number — 2 letters + 6 digits + A/B/C/D; HMRC forbidden prefixes excluded
`tax_id_uk`	UTR — 10 digits, label-prefix gated

United States (`locale="us"`)

Type	Description
`ssn`	SSN — `###-##-####`, invalid prefixes (000/666/9xx) excluded
`tax_id_us`	EIN — `XX-XXXXXXX`, IRS invalid area prefixes excluded
`national_id_us`	ITIN — `9XX-7X/8X/9X-XXXX` middle group validated

EU / International (`locale="eu"`)

Type	Description
`phone_intl`	E.164 international phone — 7–15 digits, TR (+90) excluded
`iban_intl`	IBAN for EU+GB+CH+NO — ISO 13616 country+length table + mod-97
`company_name_intl`	Company with international suffix: GmbH · LLC · S.r.l. · B.V. · SAS · Inc. · Ltd. etc.

Noise detection

noise_ratio measures the fraction of lines that are blank or contain symbol clutter:

result = audit("clean line\n@@@garbage\n\nclean")
result.noise_ratio   # 0.5  (2 noisy lines out of 4)

A line is "noisy" when it is blank (after strip) or contains 3+ consecutive characters from @ # ! ~ * =.

`noise_ratio`	Signal
`< 0.05`	Clean — likely well-extracted text
`0.05–0.20`	Acceptable — minor formatting artifacts
`> 0.20`	Low quality — likely OCR noise or extraction failure

Masking strategies

clean = mask(text, result["pii"], strategy="redact")   # default
clean = mask(text, result["pii"], strategy="token")
clean = mask(text, result["pii"], strategy="hash")
clean = mask(text, result["pii"], strategy="replace")

Strategy	Example output
`redact` (default)	`[REDACTED_EMAIL]`
`replace`	`user@example.com` (static synthetic)
`token`	`<PII_EMAIL_1>` (unique per type per call)
`hash`	`[3d4f9a1b2c8e7f0a]` (SHA-256 first 16 hex chars)

Quality grade

quality_grade (A–D) and quality_score (0.0–1.0) are composite signals:

Grade	Score	Signal
A	≥ 0.85	Ready for LLM training or RAG
B	≥ 0.65	Usable with minor cleanup
C	≥ 0.40	Review before use
D	< 0.40	Not suitable — empty, too short, or high noise

Score formula: completeness × (0.4 × noise_score + 0.4 × length_score + 0.2)
length_score = min(char_count / 500, 1.0) · noise_score = max(0, 1 − garbage_ratio × 10)

Limitations

No automatic language detection — flexorch-audit has zero dependencies. Pass locale explicitly, or use the default "und" to activate all detectors. See LIMITATIONS.md.
Free-standing name detection (without a label prefix) requires NLP/NER — not included.
replace masking uses static synthetic values; locale-aware realistic synthesis is not implemented.
The library audits plain text. PDF/DOCX parsing, e-invoice extraction, and pipeline orchestration are out of scope.

Integrations

flexorch-audit slots into any LangChain or LlamaIndex pipeline as a pre-load filter — audit quality, detect PII, and optionally mask before your documents reach the LLM.

LangChain — examples/langchain_loader.py

from examples.langchain_loader import AuditedLoader  # copy to your project

loader = AuditedLoader(
    texts=my_texts,
    locale="tr",       # or "de", "fr", "us", "und" (all)
    mask_pii=True,     # redact PII before loading
    min_grade="B",     # skip low-quality documents
)
docs = loader.load()
# doc.metadata → {"quality_grade": "A", "quality_score": 0.91, "pii_summary": [...], ...}

LlamaIndex — examples/llamaindex_reader.py

from examples.llamaindex_reader import AuditedReader  # copy to your project

reader = AuditedReader(locale="tr", mask_pii=True)
docs = reader.load_data(my_texts, min_grade="B")
# doc.extra_info → {"quality_grade": "A", "quality_score": 0.91, "pii_summary": [...], ...}

Both loaders are thin wrappers (~60 lines) with no new dependencies beyond langchain-core or llama-index-core. Copy them into your project — no framework lock-in.

Also available for JavaScript / TypeScript

npm install @flexorch/audit

Contributing

See CONTRIBUTING.md.

License

MIT

Why