A high-performance name classifier that infers probabilistic attributes about a person from their name alone. Input may be a first name, full name, or any combination — e.g. "Steven" or "Raúl Guillermo Rodríguez Castro".
Core premise
A name is a short sequence of characters drawn from a cultural naming tradition. That tradition encodes a surprising amount of information: gender norms, geographic origin, and linguistic family. The classifier's job is to decode that encoding.
Two things follow from this:
-
Character-level morphology is the fundamental signal. The suffix
-ovamarks Slavic female names.-sonmarks Nordic/Germanic patronymics.-inhomarks Portuguese diminutives. A classifier that cannot generalize from these patterns will have a hard ceiling — no dictionary is large enough to cover all names in all transliterations. -
Uncertainty is a first-class output. "Alex" is genuinely ambiguous by gender. "Kim" is ambiguous by both gender and origin. The output should represent that uncertainty faithfully, not paper over it with a majority-class label.
Attributes
| Attribute | Notes |
|---|---|
| Gender | Probability distribution over {male, female, neutral/unisex}. Not a binary. |
| Cultural origin | The naming tradition the family name belongs to. 11 traditions: English, French, Germanic, Nordic, Iberian, Italian, Eastern European, Arabic, East Asian, South Asian, Southeast Asian. Detected from the family name token and used to refine gender predictions for culturally ambiguous given names (e.g. "Jean" → French → male, "Priya" → South Asian → female). |
| Script family | Inferred from input encoding: Latin, Cyrillic, Arabic, CJK, Devanagari (covers Bengali/Gurmukhi/Gujarati/Tibetan), Hangul, Hebrew, Myanmar (covers Thai/Lao/Khmer/Burmese). Drives tokenisation order and provides deterministic origin for non-Latin input. |
| Token roles | Each whitespace-delimited token is classified as given, family, middle, or particle. Family tokens carry origin signal; given tokens carry gender signal. |
Attributes deliberately excluded from v1: language phonotactics, generational era, socioeconomic class, religion, ethnicity.
Prediction architecture: ensemble, not cascade
An earlier design used a tiered cascade — dictionary first, then statistical NLP, then AI/ML — escalating when confidence fell below a threshold. This design replaces that with a parallel ensemble.
Why not a cascade?
A cascade assumes tier 1 knows when to pass. In practice:
- A dictionary lookup can return a spuriously high confidence for an ambiguous name (e.g. "Kim" appears heavily in both Korean and English female-name dictionaries).
- If tier 1 fires, tier 2 and tier 3 never run, so their evidence never contributes.
- The escalation threshold is a hyperparameter that is hard to tune and varies by name type.
The ensemble
Every input runs through all classifiers simultaneously. Each classifier produces a probability distribution and a reliability estimate based on how well-represented this input is in its training domain. A fusion layer combines them into a final output.
Input name
│
├──► Surname lookup → sur_prob (given↔family score)
│
├──► Lexicon lookup → P(gender | name), coverage flag
│
├──► N-gram LM ensemble → P(gender | name), perplexity
│
└──► Neural (stub) → uniform distribution placeholder
│
▼
Fusion layer
(learned weights + Platt
scaling per class)
│
▼
Origin detection
(script prior → hash table
→ n-gram fallback)
│
▼
Origin correction
(blend given-name probs
toward origin-specific prior)
│
▼
Calibrated output +
provenance metadata
The fusion layer is not neural — it is a simple learned linear combination over the classifier outputs, fit on held-out calibration data. This is fast, interpretable, and avoids the cascade's blind spots.
For a well-known, unambiguous name like "Mary", all classifiers agree strongly and the ensemble collapses to near-certainty. For an ambiguous or novel name, disagreement between classifiers is itself a signal that gets encoded in the output confidence.
The n-gram model is the workhorse
The lexicon covers known names. Character n-gram language models do the most useful work.
Train one character-level n-gram LM per class (male, female, neutral) on large corpora. At query time, compute the likelihood of the input under each class model and use Bayes' rule to infer class posteriors.
P(class | name) ∝ P(name | class) · P(class)
└─ n-gram LM └─ base rate from corpus
Why n-grams for names specifically:
- Names are short (typically 3–20 characters). The entire morphological structure fits in a few n-gram lookups. A 4-gram model captures essentially all the phonotactic signal.
- A 4-gram model with Kneser-Ney smoothing over a large name corpus fits in a few MB of flat tables. Inference is a handful of array lookups and multiplications — microseconds per name.
- N-gram models generalize gracefully to unseen names through smoothing. A name the model has never seen still gets a meaningful score based on its character sequences.
The lexicon short-circuits n-gram scoring for high-frequency names — an internal optimization, not a separate tier, and invisible to callers.
Full name decomposition
A full name is not just a longer string — its components carry different types of signal and should be modeled separately.
"Jean Carteaux"
│
▼
Tokenizer + surname model
│
┌────┴──────────────────┐
│ │
given name family name
"Jean" "Carteaux"
role=given role=family
role_weight=1.0 role_weight≈0.04
(sur_prob=0.91)
│ │
▼ ▼
gender model origin model
male=0.05 (English) → "french"
female=0.95 (English)
│
▼
origin correction
(French context:
Jean → male=0.97)
│
▼
Final: male=0.770
Two-pass inference for full names:
- Normalise all tokens → assign roles (given/family) + detect context origin from the family token
- Classify given-name tokens with origin correction blended in (α = 0.85)
The surname model (models/surname.bin) provides a sur_prob ∈ [0,1] for each token. Family tokens with high sur_prob receive a much lower weight in the gender ensemble (weight = 0.4 × (1 − sur_prob)), so common surnames like "Johnson" or "Bell" don't dilute the gender signal from the given name.
Origin detection
Origin is detected from the family (surname) token using a three-layer pipeline:
Layer 1: Script prior (deterministic)
Non-Latin scripts give near-certain origin at zero model cost, applied before any dictionary or n-gram lookup:
| Script detected | Origin assigned |
|---|---|
| CJK (Chinese, Japanese) | east_asian |
| Hangul (Korean) | east_asian |
| Devanagari (Hindi, Bengali, Gurmukhi, Gujarati, Tibetan) | south_asian |
| Myanmar (Burmese, Thai, Lao, Khmer) | southeast_asian |
| Arabic | arabic (model lookup continues for confirmation) |
| Cyrillic | eastern_european (model lookup continues) |
| Latin | falls through to hash-table lookup |
Layer 2: Hash-table lookup (exact, high precision)
The surname origin model (models/surname_origin.bin) is a FNV-1a-64 hash table mapping normalized surname → origin ID + confidence (0–100). Entries with confidence ≥ 60 are accepted. The lookup iterates all family tokens; the first high-confidence hit wins. Compound surnames (e.g. "Bianchi-Ferraro") are split on hyphens and each part is looked up independently.
Layer 3: N-gram fallback (generalisation)
If no hash-table hit is found, the n-gram origin classifier (models/surname_origin_ngram.bin) provides softmax scores over all 11 origin classes. Predictions above a 0.45 probability threshold are accepted. This handles unseen surnames whose character morphology is distinctive (e.g. -son for Nordic, -enko for Ukrainian, -nen for Finnish).
Origin × gender correction
Once origin is detected, gender probabilities for the given-name token are blended toward an origin-specific prior. This corrects culturally ambiguous names:
| Name | Without origin | With origin |
|---|---|---|
| Jean Carteaux | female=0.95 (English prior) | male=0.77 (French prior) |
| Andrea Rossi | female=0.92 | male=0.80 (Italian prior) |
| Priya Singh | neutral ≈ 0.5 | female=0.997 (South Asian prior) |
Design goals
Calibration over accuracy. If the model says 70% confidence, it should be right 70% of the time — not 85% and not 55%. Raw accuracy optimizes for the majority class and hides uncertainty. Calibration is measured explicitly and model outputs are post-processed with Platt scaling.
Batch throughput as a primary metric. Individual-name latency matters, but the most common use case is classifying millions of names in a data pipeline.
Provenance in the output. Every response includes which classifiers contributed and their individual weights. This is essential for debugging and understanding why a prediction was made.
Immutable, versioned model artifacts. Models are data. They are stored alongside the code that generated them and never mutated in place.
Embeddable. The core classifier is a C library with a stable public header. It links into any process and carries no runtime dependencies beyond utf8proc.
Language and implementation
Core library: C + utf8proc
The core library is written in C for performance and minimal footprint. Unicode is handled via utf8proc, a small MIT-licensed C library (same as used by PostgreSQL, Julia, libgit2).
Normalization pipeline: utf8proc CASEFOLD+STRIPMARK → ASCII fold table → keep a-z only.
Vocabulary: a-z = 0–25, ^ = 26 (start sentinel), $ = 27 (end sentinel). V = 28.
Model training: Python
Standard Python tooling: pandas, numpy, scipy, scikit-learn. No deep learning framework — all models are fast analytical fits (MLE, KN smoothing, SGD n-gram classifiers, logistic regression). Training is fully reproducible given content-addressed input data.
Python bindings: cffi
The Python package wraps the C library using cffi in ABI mode. The binding reads the public header and provides a NameClassifier class matching the C API exactly.
CLI
A thin CLI written in C links the core library directly (see CLI usage below).
Build
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/nc --helpThe default build produces build/nc — a fully self-contained binary with all model files embedded (~38 MB). Requires the training pipeline to have been run first (models must exist in models/).
For a lightweight binary that loads models from disk at runtime (~60 KB):
cmake -B build -DCMAKE_BUILD_TYPE=Release -DNC_EMBED_MODELS=OFF cmake --build build -j$(nproc) ./build/nc-slim -m models "Jean Carteaux"
Run unit tests:
for t in build/tests/test_*; do NC_MODEL_DIR=models $t; done
Data & training pipeline
Run these scripts in order. Each step caches its output; re-running is safe.
| # | Script | Output | Notes |
|---|---|---|---|
| 1 | py/data/fetch_ssa.py |
data/processed/ssa_names.csv |
US SSA baby names 1880–present. Columns: name, male_count, female_count, total_count. |
| 2 | py/data/fetch_insee.py |
data/processed/insee_names.csv |
French INSEE census (34K names). Requires data/raw/external/insee_nat2021.zip to be manually placed. |
| 3 | py/data/merge_sources.py |
data/processed/names_combined.csv |
Merges SSA + INSEE with per-source median scaling so each source contributes equally by unique-name count. |
| 4 | py/training/train_ngram.py |
models/ngram_gender.bin (~7 MB) |
Interpolated KN-smoothed character 4-gram LMs (one per class). |
| 5 | py/training/train_lexicon.py |
models/lexicon_gender.bin |
Laplace-smoothed MLE whole-name lexicon. LAPLACE_K=0.1. |
| 6 | py/training/train_fusion.py |
models/fusion_gender.bin (192 B) |
Logistic-regression fusion weights + Platt scaling over lexicon/ngram/neural outputs. |
| 7 | py/data/fetch_surnames.py |
data/processed/surnames.csv |
US Census 2010 zip (place at data/raw/surnames/names.zip) + Wikidata Q101352 + embedded fallback. |
| 8 | py/training/train_surname.py |
models/surname.bin (~12 MB) |
Maps name → sur_prob ∈ [0,1] (0=given, 1=surname). FNV-1a-64 hash table. |
| 9 | py/data/fetch_surname_origins.py |
data/processed/surname_origins.csv |
Knowledge-seeded dict (~150–200 surnames per origin, 11 origins) + optional Wikidata P495 enrichment. |
| 10 | py/data/fetch_given_name_origins.py |
data/processed/given_name_origins.csv |
Given names whose gender differs from English priors by cultural context + optional Wikidata Q12308941/Q11879590. MIN_DELTA=0.10. |
| 11 | py/training/train_surname_origin.py |
models/surname_origin.bin (~24 KB) |
Maps surname → origin_id (uint8) + confidence (0–100). "SORG" magic, 12-byte slots. |
| 12 | py/training/train_given_origin.py |
models/given_origin_gender.bin (~16 KB) |
Maps (given_name, origin_id) → (prob_male, prob_female). Only stores entries where origin gender differs from English by >0.25. "GOGG" magic. |
| 13 | py/training/train_surname_origin_ngram.py |
models/surname_origin_ngram.bin (~2 MB) |
Character n-gram softmax classifier for surname origin (11 classes). Fallback for surnames not in the hash table. SGD-trained, temperature-calibrated. "SNGM" magic. |
Steps 1–6 produce the core gender classification models. Steps 7–13 produce the surname-aware role assignment and origin detection models. All 13 model files are required for full accuracy; the system degrades gracefully if any are absent.
Manual prerequisites:
- Step 2: Download
insee_nat2021.zipfrom the INSEE website and place atdata/raw/external/insee_nat2021.zip - Step 7: Download the Census 2010 surname file from the Census Bureau and place at
data/raw/surnames/names.zip
CLI usage
# Classify a name (human-readable output) ./build/nc Mary # Classify with JSON output ./build/nc -j "Raúl Guillermo Rodríguez Castro" # Parse mode: show token roles and surname scores only ./build/nc -p "Jean Carteaux" ./build/nc -p -j "Jean Carteaux" # Custom model directory ./build/nc -m /path/to/models "Andrea Rossi" # Batch mode: read names from file, one per line ./build/nc -b names.txt # Force n-gram origin model (skip exact hash-table lookup) ./build/nc --ngram-origin "Jean Dupont" # Print version ./build/nc -v
Output format
{
"input": "Jean Carteaux",
"script": "latin",
"components": [
{ "token": "Jean", "role": "given", "index": 0, "surname_score": 0.002 },
{ "token": "Carteaux", "role": "family", "index": 1, "surname_score": 0.386 }
],
"attributes": {
"gender": {
"male": 0.770,
"female": 0.217,
"neutral": 0.013
},
"origin": {
"english": 0.0,
"french": 1.0,
"germanic": 0.0,
"nordic": 0.0,
"iberian": 0.0,
"italian": 0.0,
"eastern_european": 0.0,
"arabic": 0.0,
"east_asian": 0.0,
"south_asian": 0.0,
"southeast_asian": 0.0
}
},
"calibrated": true,
"model_version": "dev",
"provenance": {
"gender": { "lexicon": 0.482, "ngram": 0.389, "neural": 0.130 }
}
}Parse mode (-p) omits attributes and provenance, showing only token roles, surname scores, and detected origin — useful for debugging name decomposition:
Jean given sur=0.002
Carteaux family sur=0.386
Origin: french
Repository layout
name-classifier/
├── include/
│ └── nameclass.h # stable public C API
├── src/
│ ├── core/ # classify.c — ensemble pipeline, two-pass inference
│ ├── lexicon/ # FNV hash table, MLE log-prob lookup
│ ├── ngram/ # KN-smoothed 4-gram LM engine
│ ├── neural/ # stub returning uniform distribution
│ ├── fusion/ # learned fusion weights + Platt scaling
│ ├── surname/ # surname.bin loader; sur_prob lookup
│ ├── origin/ # origin.bin + origin_ngram.bin loader; given-origin correction
│ ├── unicode/ # utf8proc integration, normalization, script detection
│ └── cli/ # command-line interface (main.c, render.c)
├── py/
│ ├── name_classifier/ # cffi ABI-mode Python binding + NameClassifier class
│ ├── training/ # 7 training scripts (steps 4–6, 8, 11–13)
│ ├── data/ # 6 data fetch + merge scripts (steps 1–3, 7, 9–10)
│ └── eval/ # run_benchmark.py, run_origin_eval.py, benchmark JSON files
├── models/ # trained model artifacts (gitignored except .gitkeep)
├── data/
│ ├── raw/ # unprocessed source data (gitignored)
│ ├── processed/ # normalized training CSVs (gitignored)
│ └── SOURCES.md
├── tests/ # 7 unit test binaries (CMake)
│ └── unit/
│ ├── test_normalize.c # UTF-8 normalization + ASCII fold
│ ├── test_ngram.c # n-gram LM scoring
│ ├── test_lexicon.c # lexicon hash-table lookup
│ ├── test_fusion.c # fusion softmax + weight arithmetic
│ ├── test_classify.c # end-to-end gender + script classification
│ ├── test_script.c # nc_detect_script() for all 10 script types
│ └── test_origin.c # origin detection (hash table, script priors, n-gram, edge cases)
└── third_party/
└── utf8proc/ # vendored header
Data sources
| Source | Used for | Steps |
|---|---|---|
| US SSA baby names (1880–present) | Gender training data | 1 |
INSEE France census (insee_nat2021.zip) |
French given name gender data | 2 |
| Wikidata Q101352 (family name items) | Surname vocabulary | 7 |
| Wikidata Q12308941/Q11879590 (given name gender) | Per-origin given name gender | 10 |
| Wikidata P495 (country of origin) | Surname origin labels | 9 |
US Census 2010 surnames (names.zip) |
Surname probability | 7 |
| Knowledge-seeded dicts | Surname origins + given-name/origin gender (11 origins) | 9, 10 |
All sources are documented in data/SOURCES.md.
Benchmarks
# Gender accuracy python3 py/eval/run_benchmark.py --benchmark py/eval/benchmark_names.json python3 py/eval/run_benchmark.py --benchmark py/eval/benchmark_extended.json # Origin accuracy (requires data/raw/world_names.jsonl) python3 py/eval/run_origin_eval.py
Current accuracy (with all 13 models trained):
| Benchmark | Score |
|---|---|
benchmark_names.json (common names) |
98.4% |
benchmark_extended.json (edge cases) |
95.4% |
| Cross-cultural (Jean/Andrea/Priya with surnames) | 100% (6/6) |
Origin detection (world_names.jsonl, 106 names) |
70.8% |
Per-origin accuracy:
| Origin | Accuracy | Notes |
|---|---|---|
| nordic | 100% | Finnish -nen pattern covered |
| east_asian | 100% | Script prior (CJK/Hangul) + romanized surnames |
| italian | 100% | |
| eastern_european | 82% | Ukrainian -enko pattern covered |
| iberian | 79% | |
| arabic | 71% | |
| french | 60% | |
| germanic | 60% | |
| south_asian | 50% | Romanized Indian names are heterogeneous |
| southeast_asian | 50% | Vietnamese/Indonesian romanizations overlap with other origins |
| english | 40% | Irish/Scottish Gaelic names skew results |
Key improvements from surname-aware classification:
| Name | Without surname model | With |
|---|---|---|
| Michael Bell | male=0.727 | male=0.980 |
| Mary Johnson | female=0.711 | female=0.991 |
| Jean Carteaux | female=0.950 | male=0.770 |
| Jan van Dyck | female=0.850 | male=0.857 |
| Andrea Rossi | female=0.920 | male=0.804 |
| Priya Singh | neutral ≈ 0.5 | female=0.997 |
Trade-offs
The ensemble requires calibration data. The fusion weights must be learned from labeled data, which requires a disciplined training and evaluation pipeline from the start.
Parallel classifiers use more memory than a cascade that runs only one model at a time. Total model footprint is ~22 MB on disk; at runtime models are memory-mapped.
C requires explicit safety discipline. Memory safety is the programmer's responsibility. This is managed through AddressSanitizer builds and unit tests — but remains a cost that a memory-safe language would eliminate.
The neural tier is a stub. The fusion layer currently assigns it weight ≈ 0; it returns a uniform distribution. The architecture reserves a slot for a future character-level model.
Origin coverage is uneven. The 11-origin system covers the major Eurasian naming traditions well. Sub-Saharan African, Greek, Persian/Farsi, and Polynesian names fall through to NC_ORIGIN_UNKNOWN by design — naming patterns are too heterogeneous for reliable n-gram classification at this data scale.