Zero-dependency PII detection, quality grading, and noise audit for LLM datasets — in a single function call.
Why
Before feeding documents into an LLM pipeline you need to answer three questions:
- Does this text contain personal data? Sending PII to a language model is a compliance risk.
- Is the text quality high enough? Short, noisy, or duplicate records hurt fine-tuning and RAG retrieval.
- How bad is the noise? Garbled encodings and symbol clutter degrade model output silently.
Most tools that answer these questions require heavy NLP frameworks, model weights, or cloud APIs. flexorch-audit answers all three with one call — using only regex and Python's standard library. No model weights, no network calls, no external packages.
Features
- Quality grade — A/B/C/D composite score: is this text LLM-ready at a glance?
- Noise ratio — line-level symbol clutter detection (
noise_ratio); values above 0.20 indicate likely extraction artifacts - PII detection — 30+ types across 8 countries (TR/DE/FR/IT/NL/ES/UK/US) + universal types; all regex-based with checksum validation
- Batch audit —
audit_batch()aggregates duplicate ratio and PII counts across an entire dataset in one call - Masking — four strategies: redact, replace (synthetic), token, hash
- Zero runtime dependencies — pure Python stdlib, Python 3.10+
Install
pip install flexorch-audit
Quick start
from flexorch_audit import audit, mask text = open("contract.txt").read() # extract from PDF/DOCX first result = audit(text) # "und" by default — all detectors active # result = audit(text, locale="tr") # restrict to TR-only detectors result.quality_grade # "B" result.quality_score # 0.73 (0.0–1.0 composite) result.noise_ratio # 0.04 (fraction of blank/garbage lines; >0.20 = low quality) result.detected_language # "und" (locale you passed in; caller controls language) result.pii_summary # [{"type": "email", "count": 2}, {"type": "national_id_tr", "count": 1}] # Full findings and raw metrics — dict access also works: result["pii"] # [{"type": "email", "value": "ali@example.com", "start": 8, "end": 23}] result["quality"] # {"completeness": 1.0, "avg_length": 342, "duplicate_ratio": None} result["noise"] # {"garbage_ratio": 0.0, "encoding_ok": True} clean = mask(text, result["pii"], strategy="redact") # "Contact: [REDACTED_EMAIL]"
One-shot redaction
from flexorch_audit import redact_for_llm clean = redact_for_llm("TCKN: 12345678950, email: ali@example.com", locale="tr") # "TCKN: [REDACTED_NATIONAL_ID_TR], email: [REDACTED_EMAIL]" # Different masking strategies redact_for_llm(text, locale="tr", strategy="token") # <PII_NATIONAL_ID_TR_1> redact_for_llm(text, locale="tr", strategy="hash") # [3d4f9a1b2c8e7f0a] redact_for_llm(text, locale="tr", strategy="replace") # static synthetic value
No PII found → original text returned unchanged.
Token estimation
from flexorch_audit import estimate_tokens estimate_tokens("The quick brown fox jumps over the lazy dog.") # → 16 estimate_tokens("") # → 0
Heuristic: words × 4/3 — no tiktoken required. Accuracy within ~15% of the real tokenizer for English and most European languages; treat as a planning estimate for context window sizing and cost forecasting.
Batch audit
from flexorch_audit import audit_batch texts = [record["text"] for record in dataset] batch = audit_batch(texts) # locale="und" by default batch["duplicate_ratio"] # 0.12 — fraction of exact-duplicate records batch["avg_quality_score"] # 0.78 batch["pii_summary"] # [{"type": "email", "count": 47}, ...] batch["results"] # list of AuditResult, one per text
Country coverage
locale |
Detectors activated |
|---|---|
"und" (default) |
All locales combined — use when document language is unknown |
"all" |
Alias for "und" |
"tr" |
TCKN · VKN · phone_tr · name · IBAN_TR · company_name_tr · MERSIS · postal_code_tr · province_tr |
"de" |
Steueridentifikationsnummer · Sozialversicherungsnummer |
"fr" |
SIREN · SIRET · INSEE/NIR |
"it" |
Codice Fiscale · Partita IVA |
"nl" |
BSN · KvK |
"es" |
DNI/NIE · CIF |
"uk" |
NI number · UTR |
"us" |
SSN · EIN · ITIN |
"eu" |
E.164 phone · IBAN (EU+GB+CH+NO) · company name |
Universal detectors (always active regardless of locale): email · iban · credit_card · ip · ip_v6
Language detection:
flexorch-auditis zero-dependency — no language detection library is included. Pass the correctlocaleyourself, or use"und"(default) to activate all detectors.
PII types
Universal
| Type | Description |
|---|---|
email |
RFC-5321 email address |
iban |
ISO 13616 IBAN — mod-97 validated; suppressed when iban_tr or iban_intl fires on same span |
credit_card |
16-digit groups, Luhn-validated |
ip |
IPv4 address |
ip_v6 |
IPv6 — full, compressed ::, loopback forms |
Turkey (locale="tr")
| Type | Description |
|---|---|
national_id_tr |
TCKN — 11-digit, modular arithmetic checksum |
tax_id_tr |
VKN — 10-digit, Luhn-variant checksum |
phone_tr |
Turkish mobile: +90/0 prefix + 10 digits |
name |
Label-prefixed name: Adı:, Full Name:, Customer Name:, etc. |
iban_tr |
Turkish IBAN (TR + 24 chars), mod-97 validated |
company_name_tr |
Company with TR legal suffix: A.Ş. · Ltd.Şti. · Koll.Şti. · Koop. · T.A.Ş. |
mersis_no |
MERSIS — 16-digit company registry number |
postal_code_tr |
Turkish postal code (province plate 01–81) |
province_tr |
All 81 Turkish provinces |
Germany (locale="de")
| Type | Description |
|---|---|
tax_id_de |
Steueridentifikationsnummer — 11 digits, ISO 7064 MOD 11,2 checksum |
social_id_de |
Sozialversicherungsnummer — area + DOB + letter + serial |
France (locale="fr")
| Type | Description |
|---|---|
siret_fr |
SIRET — 14 digits, label-prefix gated |
company_id_fr |
SIREN — 9 digits, label-prefix gated |
social_id_fr |
INSEE/NIR — 15 digits, starts with 1 or 2 |
Italy (locale="it")
| Type | Description |
|---|---|
national_id_it |
Codice Fiscale — 16 chars alphanumeric, uppercase normalized |
tax_id_it |
Partita IVA — 11 digits, Agenzia delle Entrate checksum |
Netherlands (locale="nl")
| Type | Description |
|---|---|
national_id_nl |
BSN — 9 digits, 11-check (weighted sum mod 11) |
company_id_nl |
KvK — 8 digits, label-prefix gated |
Spain (locale="es")
| Type | Description |
|---|---|
national_id_es |
DNI (8 digits + letter, mod-23) and NIE (X/Y/Z prefix, same check) |
tax_id_es |
CIF — letter prefix + 7 digits + control character |
United Kingdom (locale="uk")
| Type | Description |
|---|---|
social_id_uk |
NI number — 2 letters + 6 digits + A/B/C/D; HMRC forbidden prefixes excluded |
tax_id_uk |
UTR — 10 digits, label-prefix gated |
United States (locale="us")
| Type | Description |
|---|---|
ssn |
SSN — ###-##-####, invalid prefixes (000/666/9xx) excluded |
tax_id_us |
EIN — XX-XXXXXXX, IRS invalid area prefixes excluded |
national_id_us |
ITIN — 9XX-7X/8X/9X-XXXX middle group validated |
EU / International (locale="eu")
| Type | Description |
|---|---|
phone_intl |
E.164 international phone — 7–15 digits, TR (+90) excluded |
iban_intl |
IBAN for EU+GB+CH+NO — ISO 13616 country+length table + mod-97 |
company_name_intl |
Company with international suffix: GmbH · LLC · S.r.l. · B.V. · SAS · Inc. · Ltd. etc. |
Noise detection
noise_ratio measures the fraction of lines that are blank or contain symbol clutter:
result = audit("clean line\n@@@garbage\n\nclean") result.noise_ratio # 0.5 (2 noisy lines out of 4)
A line is "noisy" when it is blank (after strip) or contains 3+ consecutive characters from @ # ! ~ * =.
noise_ratio |
Signal |
|---|---|
< 0.05 |
Clean — likely well-extracted text |
0.05–0.20 |
Acceptable — minor formatting artifacts |
> 0.20 |
Low quality — likely OCR noise or extraction failure |
Masking strategies
clean = mask(text, result["pii"], strategy="redact") # default clean = mask(text, result["pii"], strategy="token") clean = mask(text, result["pii"], strategy="hash") clean = mask(text, result["pii"], strategy="replace")
| Strategy | Example output |
|---|---|
redact (default) |
[REDACTED_EMAIL] |
replace |
user@example.com (static synthetic) |
token |
<PII_EMAIL_1> (unique per type per call) |
hash |
[3d4f9a1b2c8e7f0a] (SHA-256 first 16 hex chars) |
Quality grade
quality_grade (A–D) and quality_score (0.0–1.0) are composite signals:
| Grade | Score | Signal |
|---|---|---|
| A | ≥ 0.85 | Ready for LLM training or RAG |
| B | ≥ 0.65 | Usable with minor cleanup |
| C | ≥ 0.40 | Review before use |
| D | < 0.40 | Not suitable — empty, too short, or high noise |
Score formula: completeness × (0.4 × noise_score + 0.4 × length_score + 0.2)
length_score = min(char_count / 500, 1.0) · noise_score = max(0, 1 − garbage_ratio × 10)
Limitations
- No automatic language detection —
flexorch-audithas zero dependencies. Passlocaleexplicitly, or use the default"und"to activate all detectors. See LIMITATIONS.md. - Free-standing name detection (without a label prefix) requires NLP/NER — not included.
replacemasking uses static synthetic values; locale-aware realistic synthesis is not implemented.- The library audits plain text. PDF/DOCX parsing, e-invoice extraction, and pipeline orchestration are out of scope.
Integrations
flexorch-audit slots into any LangChain or LlamaIndex pipeline as a pre-load filter — audit quality, detect PII, and optionally mask before your documents reach the LLM.
LangChain — examples/langchain_loader.py
from examples.langchain_loader import AuditedLoader # copy to your project loader = AuditedLoader( texts=my_texts, locale="tr", # or "de", "fr", "us", "und" (all) mask_pii=True, # redact PII before loading min_grade="B", # skip low-quality documents ) docs = loader.load() # doc.metadata → {"quality_grade": "A", "quality_score": 0.91, "pii_summary": [...], ...}
LlamaIndex — examples/llamaindex_reader.py
from examples.llamaindex_reader import AuditedReader # copy to your project reader = AuditedReader(locale="tr", mask_pii=True) docs = reader.load_data(my_texts, min_grade="B") # doc.extra_info → {"quality_grade": "A", "quality_score": 0.91, "pii_summary": [...], ...}
Both loaders are thin wrappers (~60 lines) with no new dependencies beyond langchain-core or llama-index-core. Copy them into your project — no framework lock-in.
Also available for JavaScript / TypeScript
npm install @flexorch/audit
Contributing
See CONTRIBUTING.md.
License
MIT