GitHub - douglas-larocca/name-classifier: A high-performance name classifier that infers probabilistic attributes about a person from their name alone.

14 min read Original article ↗

A high-performance name classifier that infers probabilistic attributes about a person from their name alone. Input may be a first name, full name, or any combination — e.g. "Steven" or "Raúl Guillermo Rodríguez Castro".

Core premise

A name is a short sequence of characters drawn from a cultural naming tradition. That tradition encodes a surprising amount of information: gender norms, geographic origin, and linguistic family. The classifier's job is to decode that encoding.

Two things follow from this:

  1. Character-level morphology is the fundamental signal. The suffix -ova marks Slavic female names. -son marks Nordic/Germanic patronymics. -inho marks Portuguese diminutives. A classifier that cannot generalize from these patterns will have a hard ceiling — no dictionary is large enough to cover all names in all transliterations.

  2. Uncertainty is a first-class output. "Alex" is genuinely ambiguous by gender. "Kim" is ambiguous by both gender and origin. The output should represent that uncertainty faithfully, not paper over it with a majority-class label.

Attributes

Attribute Notes
Gender Probability distribution over {male, female, neutral/unisex}. Not a binary.
Cultural origin The naming tradition the family name belongs to. 11 traditions: English, French, Germanic, Nordic, Iberian, Italian, Eastern European, Arabic, East Asian, South Asian, Southeast Asian. Detected from the family name token and used to refine gender predictions for culturally ambiguous given names (e.g. "Jean" → French → male, "Priya" → South Asian → female).
Script family Inferred from input encoding: Latin, Cyrillic, Arabic, CJK, Devanagari (covers Bengali/Gurmukhi/Gujarati/Tibetan), Hangul, Hebrew, Myanmar (covers Thai/Lao/Khmer/Burmese). Drives tokenisation order and provides deterministic origin for non-Latin input.
Token roles Each whitespace-delimited token is classified as given, family, middle, or particle. Family tokens carry origin signal; given tokens carry gender signal.

Attributes deliberately excluded from v1: language phonotactics, generational era, socioeconomic class, religion, ethnicity.

Prediction architecture: ensemble, not cascade

An earlier design used a tiered cascade — dictionary first, then statistical NLP, then AI/ML — escalating when confidence fell below a threshold. This design replaces that with a parallel ensemble.

Why not a cascade?

A cascade assumes tier 1 knows when to pass. In practice:

  • A dictionary lookup can return a spuriously high confidence for an ambiguous name (e.g. "Kim" appears heavily in both Korean and English female-name dictionaries).
  • If tier 1 fires, tier 2 and tier 3 never run, so their evidence never contributes.
  • The escalation threshold is a hyperparameter that is hard to tune and varies by name type.

The ensemble

Every input runs through all classifiers simultaneously. Each classifier produces a probability distribution and a reliability estimate based on how well-represented this input is in its training domain. A fusion layer combines them into a final output.

Input name
    │
    ├──► Surname lookup       → sur_prob (given↔family score)
    │
    ├──► Lexicon lookup       → P(gender | name), coverage flag
    │
    ├──► N-gram LM ensemble   → P(gender | name), perplexity
    │
    └──► Neural (stub)        → uniform distribution placeholder
              │
              ▼
        Fusion layer
    (learned weights + Platt
    scaling per class)
              │
              ▼
     Origin detection
    (script prior → hash table
     → n-gram fallback)
              │
              ▼
     Origin correction
    (blend given-name probs
    toward origin-specific prior)
              │
              ▼
     Calibrated output +
     provenance metadata

The fusion layer is not neural — it is a simple learned linear combination over the classifier outputs, fit on held-out calibration data. This is fast, interpretable, and avoids the cascade's blind spots.

For a well-known, unambiguous name like "Mary", all classifiers agree strongly and the ensemble collapses to near-certainty. For an ambiguous or novel name, disagreement between classifiers is itself a signal that gets encoded in the output confidence.

The n-gram model is the workhorse

The lexicon covers known names. Character n-gram language models do the most useful work.

Train one character-level n-gram LM per class (male, female, neutral) on large corpora. At query time, compute the likelihood of the input under each class model and use Bayes' rule to infer class posteriors.

P(class | name) ∝ P(name | class) · P(class)
                   └─ n-gram LM    └─ base rate from corpus

Why n-grams for names specifically:

  • Names are short (typically 3–20 characters). The entire morphological structure fits in a few n-gram lookups. A 4-gram model captures essentially all the phonotactic signal.
  • A 4-gram model with Kneser-Ney smoothing over a large name corpus fits in a few MB of flat tables. Inference is a handful of array lookups and multiplications — microseconds per name.
  • N-gram models generalize gracefully to unseen names through smoothing. A name the model has never seen still gets a meaningful score based on its character sequences.

The lexicon short-circuits n-gram scoring for high-frequency names — an internal optimization, not a separate tier, and invisible to callers.

Full name decomposition

A full name is not just a longer string — its components carry different types of signal and should be modeled separately.

"Jean Carteaux"
        │
        ▼
   Tokenizer + surname model
        │
   ┌────┴──────────────────┐
   │                       │
given name            family name
"Jean"                "Carteaux"
 role=given            role=family
 role_weight=1.0       role_weight≈0.04
                       (sur_prob=0.91)
        │                       │
        ▼                       ▼
  gender model            origin model
  male=0.05 (English)     → "french"
  female=0.95 (English)
        │
        ▼
  origin correction
  (French context:
   Jean → male=0.97)
        │
        ▼
  Final: male=0.770

Two-pass inference for full names:

  1. Normalise all tokens → assign roles (given/family) + detect context origin from the family token
  2. Classify given-name tokens with origin correction blended in (α = 0.85)

The surname model (models/surname.bin) provides a sur_prob ∈ [0,1] for each token. Family tokens with high sur_prob receive a much lower weight in the gender ensemble (weight = 0.4 × (1 − sur_prob)), so common surnames like "Johnson" or "Bell" don't dilute the gender signal from the given name.

Origin detection

Origin is detected from the family (surname) token using a three-layer pipeline:

Layer 1: Script prior (deterministic)

Non-Latin scripts give near-certain origin at zero model cost, applied before any dictionary or n-gram lookup:

Script detected Origin assigned
CJK (Chinese, Japanese) east_asian
Hangul (Korean) east_asian
Devanagari (Hindi, Bengali, Gurmukhi, Gujarati, Tibetan) south_asian
Myanmar (Burmese, Thai, Lao, Khmer) southeast_asian
Arabic arabic (model lookup continues for confirmation)
Cyrillic eastern_european (model lookup continues)
Latin falls through to hash-table lookup

Layer 2: Hash-table lookup (exact, high precision)

The surname origin model (models/surname_origin.bin) is a FNV-1a-64 hash table mapping normalized surname → origin ID + confidence (0–100). Entries with confidence ≥ 60 are accepted. The lookup iterates all family tokens; the first high-confidence hit wins. Compound surnames (e.g. "Bianchi-Ferraro") are split on hyphens and each part is looked up independently.

Layer 3: N-gram fallback (generalisation)

If no hash-table hit is found, the n-gram origin classifier (models/surname_origin_ngram.bin) provides softmax scores over all 11 origin classes. Predictions above a 0.45 probability threshold are accepted. This handles unseen surnames whose character morphology is distinctive (e.g. -son for Nordic, -enko for Ukrainian, -nen for Finnish).

Origin × gender correction

Once origin is detected, gender probabilities for the given-name token are blended toward an origin-specific prior. This corrects culturally ambiguous names:

Name Without origin With origin
Jean Carteaux female=0.95 (English prior) male=0.77 (French prior)
Andrea Rossi female=0.92 male=0.80 (Italian prior)
Priya Singh neutral ≈ 0.5 female=0.997 (South Asian prior)

Design goals

Calibration over accuracy. If the model says 70% confidence, it should be right 70% of the time — not 85% and not 55%. Raw accuracy optimizes for the majority class and hides uncertainty. Calibration is measured explicitly and model outputs are post-processed with Platt scaling.

Batch throughput as a primary metric. Individual-name latency matters, but the most common use case is classifying millions of names in a data pipeline.

Provenance in the output. Every response includes which classifiers contributed and their individual weights. This is essential for debugging and understanding why a prediction was made.

Immutable, versioned model artifacts. Models are data. They are stored alongside the code that generated them and never mutated in place.

Embeddable. The core classifier is a C library with a stable public header. It links into any process and carries no runtime dependencies beyond utf8proc.

Language and implementation

Core library: C + utf8proc

The core library is written in C for performance and minimal footprint. Unicode is handled via utf8proc, a small MIT-licensed C library (same as used by PostgreSQL, Julia, libgit2).

Normalization pipeline: utf8proc CASEFOLD+STRIPMARK → ASCII fold table → keep a-z only. Vocabulary: a-z = 0–25, ^ = 26 (start sentinel), $ = 27 (end sentinel). V = 28.

Model training: Python

Standard Python tooling: pandas, numpy, scipy, scikit-learn. No deep learning framework — all models are fast analytical fits (MLE, KN smoothing, SGD n-gram classifiers, logistic regression). Training is fully reproducible given content-addressed input data.

Python bindings: cffi

The Python package wraps the C library using cffi in ABI mode. The binding reads the public header and provides a NameClassifier class matching the C API exactly.

CLI

A thin CLI written in C links the core library directly (see CLI usage below).

Build

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/nc --help

The default build produces build/nc — a fully self-contained binary with all model files embedded (~38 MB). Requires the training pipeline to have been run first (models must exist in models/).

For a lightweight binary that loads models from disk at runtime (~60 KB):

cmake -B build -DCMAKE_BUILD_TYPE=Release -DNC_EMBED_MODELS=OFF
cmake --build build -j$(nproc)
./build/nc-slim -m models "Jean Carteaux"

Run unit tests:

for t in build/tests/test_*; do NC_MODEL_DIR=models $t; done

Data & training pipeline

Run these scripts in order. Each step caches its output; re-running is safe.

# Script Output Notes
1 py/data/fetch_ssa.py data/processed/ssa_names.csv US SSA baby names 1880–present. Columns: name, male_count, female_count, total_count.
2 py/data/fetch_insee.py data/processed/insee_names.csv French INSEE census (34K names). Requires data/raw/external/insee_nat2021.zip to be manually placed.
3 py/data/merge_sources.py data/processed/names_combined.csv Merges SSA + INSEE with per-source median scaling so each source contributes equally by unique-name count.
4 py/training/train_ngram.py models/ngram_gender.bin (~7 MB) Interpolated KN-smoothed character 4-gram LMs (one per class).
5 py/training/train_lexicon.py models/lexicon_gender.bin Laplace-smoothed MLE whole-name lexicon. LAPLACE_K=0.1.
6 py/training/train_fusion.py models/fusion_gender.bin (192 B) Logistic-regression fusion weights + Platt scaling over lexicon/ngram/neural outputs.
7 py/data/fetch_surnames.py data/processed/surnames.csv US Census 2010 zip (place at data/raw/surnames/names.zip) + Wikidata Q101352 + embedded fallback.
8 py/training/train_surname.py models/surname.bin (~12 MB) Maps name → sur_prob ∈ [0,1] (0=given, 1=surname). FNV-1a-64 hash table.
9 py/data/fetch_surname_origins.py data/processed/surname_origins.csv Knowledge-seeded dict (~150–200 surnames per origin, 11 origins) + optional Wikidata P495 enrichment.
10 py/data/fetch_given_name_origins.py data/processed/given_name_origins.csv Given names whose gender differs from English priors by cultural context + optional Wikidata Q12308941/Q11879590. MIN_DELTA=0.10.
11 py/training/train_surname_origin.py models/surname_origin.bin (~24 KB) Maps surname → origin_id (uint8) + confidence (0–100). "SORG" magic, 12-byte slots.
12 py/training/train_given_origin.py models/given_origin_gender.bin (~16 KB) Maps (given_name, origin_id) → (prob_male, prob_female). Only stores entries where origin gender differs from English by >0.25. "GOGG" magic.
13 py/training/train_surname_origin_ngram.py models/surname_origin_ngram.bin (~2 MB) Character n-gram softmax classifier for surname origin (11 classes). Fallback for surnames not in the hash table. SGD-trained, temperature-calibrated. "SNGM" magic.

Steps 1–6 produce the core gender classification models. Steps 7–13 produce the surname-aware role assignment and origin detection models. All 13 model files are required for full accuracy; the system degrades gracefully if any are absent.

Manual prerequisites:

  • Step 2: Download insee_nat2021.zip from the INSEE website and place at data/raw/external/insee_nat2021.zip
  • Step 7: Download the Census 2010 surname file from the Census Bureau and place at data/raw/surnames/names.zip

CLI usage

# Classify a name (human-readable output)
./build/nc Mary

# Classify with JSON output
./build/nc -j "Raúl Guillermo Rodríguez Castro"

# Parse mode: show token roles and surname scores only
./build/nc -p "Jean Carteaux"
./build/nc -p -j "Jean Carteaux"

# Custom model directory
./build/nc -m /path/to/models "Andrea Rossi"

# Batch mode: read names from file, one per line
./build/nc -b names.txt

# Force n-gram origin model (skip exact hash-table lookup)
./build/nc --ngram-origin "Jean Dupont"

# Print version
./build/nc -v

Output format

{
  "input": "Jean Carteaux",
  "script": "latin",
  "components": [
    { "token": "Jean",     "role": "given",  "index": 0, "surname_score": 0.002 },
    { "token": "Carteaux", "role": "family", "index": 1, "surname_score": 0.386 }
  ],
  "attributes": {
    "gender": {
      "male":    0.770,
      "female":  0.217,
      "neutral": 0.013
    },
    "origin": {
      "english":          0.0,
      "french":           1.0,
      "germanic":         0.0,
      "nordic":           0.0,
      "iberian":          0.0,
      "italian":          0.0,
      "eastern_european": 0.0,
      "arabic":           0.0,
      "east_asian":       0.0,
      "south_asian":      0.0,
      "southeast_asian":  0.0
    }
  },
  "calibrated": true,
  "model_version": "dev",
  "provenance": {
    "gender": { "lexicon": 0.482, "ngram": 0.389, "neural": 0.130 }
  }
}

Parse mode (-p) omits attributes and provenance, showing only token roles, surname scores, and detected origin — useful for debugging name decomposition:

Jean       given   sur=0.002
Carteaux   family  sur=0.386
Origin:    french

Repository layout

name-classifier/
├── include/
│   └── nameclass.h          # stable public C API
├── src/
│   ├── core/                # classify.c — ensemble pipeline, two-pass inference
│   ├── lexicon/             # FNV hash table, MLE log-prob lookup
│   ├── ngram/               # KN-smoothed 4-gram LM engine
│   ├── neural/              # stub returning uniform distribution
│   ├── fusion/              # learned fusion weights + Platt scaling
│   ├── surname/             # surname.bin loader; sur_prob lookup
│   ├── origin/              # origin.bin + origin_ngram.bin loader; given-origin correction
│   ├── unicode/             # utf8proc integration, normalization, script detection
│   └── cli/                 # command-line interface (main.c, render.c)
├── py/
│   ├── name_classifier/     # cffi ABI-mode Python binding + NameClassifier class
│   ├── training/            # 7 training scripts (steps 4–6, 8, 11–13)
│   ├── data/                # 6 data fetch + merge scripts (steps 1–3, 7, 9–10)
│   └── eval/                # run_benchmark.py, run_origin_eval.py, benchmark JSON files
├── models/                  # trained model artifacts (gitignored except .gitkeep)
├── data/
│   ├── raw/                 # unprocessed source data (gitignored)
│   ├── processed/           # normalized training CSVs (gitignored)
│   └── SOURCES.md
├── tests/                   # 7 unit test binaries (CMake)
│   └── unit/
│       ├── test_normalize.c # UTF-8 normalization + ASCII fold
│       ├── test_ngram.c     # n-gram LM scoring
│       ├── test_lexicon.c   # lexicon hash-table lookup
│       ├── test_fusion.c    # fusion softmax + weight arithmetic
│       ├── test_classify.c  # end-to-end gender + script classification
│       ├── test_script.c    # nc_detect_script() for all 10 script types
│       └── test_origin.c    # origin detection (hash table, script priors, n-gram, edge cases)
└── third_party/
    └── utf8proc/            # vendored header

Data sources

Source Used for Steps
US SSA baby names (1880–present) Gender training data 1
INSEE France census (insee_nat2021.zip) French given name gender data 2
Wikidata Q101352 (family name items) Surname vocabulary 7
Wikidata Q12308941/Q11879590 (given name gender) Per-origin given name gender 10
Wikidata P495 (country of origin) Surname origin labels 9
US Census 2010 surnames (names.zip) Surname probability 7
Knowledge-seeded dicts Surname origins + given-name/origin gender (11 origins) 9, 10

All sources are documented in data/SOURCES.md.

Benchmarks

# Gender accuracy
python3 py/eval/run_benchmark.py --benchmark py/eval/benchmark_names.json
python3 py/eval/run_benchmark.py --benchmark py/eval/benchmark_extended.json

# Origin accuracy (requires data/raw/world_names.jsonl)
python3 py/eval/run_origin_eval.py

Current accuracy (with all 13 models trained):

Benchmark Score
benchmark_names.json (common names) 98.4%
benchmark_extended.json (edge cases) 95.4%
Cross-cultural (Jean/Andrea/Priya with surnames) 100% (6/6)
Origin detection (world_names.jsonl, 106 names) 70.8%

Per-origin accuracy:

Origin Accuracy Notes
nordic 100% Finnish -nen pattern covered
east_asian 100% Script prior (CJK/Hangul) + romanized surnames
italian 100%
eastern_european 82% Ukrainian -enko pattern covered
iberian 79%
arabic 71%
french 60%
germanic 60%
south_asian 50% Romanized Indian names are heterogeneous
southeast_asian 50% Vietnamese/Indonesian romanizations overlap with other origins
english 40% Irish/Scottish Gaelic names skew results

Key improvements from surname-aware classification:

Name Without surname model With
Michael Bell male=0.727 male=0.980
Mary Johnson female=0.711 female=0.991
Jean Carteaux female=0.950 male=0.770
Jan van Dyck female=0.850 male=0.857
Andrea Rossi female=0.920 male=0.804
Priya Singh neutral ≈ 0.5 female=0.997

Trade-offs

The ensemble requires calibration data. The fusion weights must be learned from labeled data, which requires a disciplined training and evaluation pipeline from the start.

Parallel classifiers use more memory than a cascade that runs only one model at a time. Total model footprint is ~22 MB on disk; at runtime models are memory-mapped.

C requires explicit safety discipline. Memory safety is the programmer's responsibility. This is managed through AddressSanitizer builds and unit tests — but remains a cost that a memory-safe language would eliminate.

The neural tier is a stub. The fusion layer currently assigns it weight ≈ 0; it returns a uniform distribution. The architecture reserves a slot for a future character-level model.

Origin coverage is uneven. The 11-origin system covers the major Eurasian naming traditions well. Sub-Saharan African, Greek, Persian/Farsi, and Polynesian names fall through to NC_ORIGIN_UNKNOWN by design — naming patterns are too heterogeneous for reliable n-gram classification at this data scale.