pg_kazsearch
License: LGPL v3 PostgreSQL: 16–18
The first full-text search stemmer for the Kazakh language — for PostgreSQL and Elasticsearch.
Kazakh is heavily agglutinative: a single word like мектептерімізде carries plural, possessive, and locative suffixes that must all be stripped to reach the root мектеп. No existing PostgreSQL or Elasticsearch analyzer handles this. pg_kazsearch fills that gap with a Rust stemmer that plugs into both PostgreSQL (via pgrx) and Elasticsearch (via JNI native plugin).
-- PostgreSQL CREATE EXTENSION pg_kazsearch; SELECT to_tsvector('kazakh_cfg', 'президенттің жарлығы'); -- 'жарлық':2 'президент':1
// Elasticsearch { "filter": { "kaz_stem": { "type": "kazsearch_stem" } } } // алмаларымыздағы → алма // мектептеріміздегі → мектеп
Install
Pre-built package (Debian/Ubuntu)
Download the .deb for your PostgreSQL version from GitHub Releases:
# Example: PostgreSQL 18 on amd64
curl -LO https://github.com/darkhanakh/pg-kazsearch/releases/latest/download/postgresql-18-pg-kazsearch_2.0.0_amd64.deb
sudo dpkg -i postgresql-18-pg-kazsearch_2.0.0_amd64.debThen in psql:
CREATE EXTENSION pg_kazsearch;
Docker
Use the pre-built image as a drop-in replacement for postgres:
# docker-compose.yml services: db: image: ghcr.io/darkhanakh/pg-kazsearch:18
Or add to your existing Dockerfile:
FROM ghcr.io/darkhanakh/pg-kazsearch:18 AS kazsearch FROM postgres:18 COPY --from=kazsearch /usr/share/postgresql/18/extension/pg_kazsearch* /usr/share/postgresql/18/extension/ COPY --from=kazsearch /usr/lib/postgresql/18/lib/pg_kazsearch* /usr/lib/postgresql/18/lib/ COPY --from=kazsearch /usr/share/postgresql/18/tsearch_data/kaz_* /usr/share/postgresql/18/tsearch_data/
From source
# Requires: Rust toolchain, cargo-pgrx, postgresql-server-dev cargo install --locked cargo-pgrx --version "=0.17.0" cargo pgrx init --pg18 $(which pg_config) git clone https://github.com/darkhanakh/pg-kazsearch.git cd pg-kazsearch cargo pgrx install --release -p pg_kazsearch # Install lexicon and stopwords cp data/tsearch_data/kaz_stems.dict $(pg_config --sharedir)/tsearch_data/ cp data/tsearch_data/kaz_stopwords.stop $(pg_config --sharedir)/tsearch_data/
Elasticsearch
The same Kazakh stemmer is available as an Elasticsearch analysis plugin (kazsearch_stem token filter). All stemmer logic stays in Rust — the Java side is a thin JNI bridge.
Install from GitHub Releases
Download the plugin ZIP from GitHub Releases and install:
bin/elasticsearch-plugin install https://github.com/darkhanakh/pg-kazsearch/releases/latest/download/analysis-kazsearch-0.1.0.zip
The pre-built ZIP includes native libraries for linux/amd64 and linux/aarch64.
Configuration
{
"settings": {
"analysis": {
"filter": {
"kaz_stem": { "type": "kazsearch_stem" }
},
"analyzer": {
"kazakh": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "kaz_stem"]
}
}
}
}
}Verify
curl -X POST 'localhost:9200/my_index/_analyze' \ -H 'Content-Type: application/json' \ -d '{"analyzer": "kazakh", "text": "алмаларымыздағы мектептеріміздегі"}' # → tokens: ["алма", "мектеп"]
Build from source
Requires: Rust toolchain, JDK 21, Gradle 8+, and cargo-zigbuild for cross-compilation.
# Build Rust cdylib (native stemmer library) just es-native # Build ES plugin ZIP (includes Java bridge + native lib) just es-build # → elastic/java/build/distributions/analysis-kazsearch-0.1.0.zip # Run tests just es-up just es-load-corpus # index 3000 articles just es-eval # run search quality evaluation
Usage (PostgreSQL)
The extension creates everything automatically — a text search template, dictionaries, and a ready-to-use configuration called kazakh_cfg:
CREATE EXTENSION pg_kazsearch; -- Stem individual words SELECT ts_lexize('pg_kazsearch_dict', 'алмаларымыздағы'); -- {алма} -- Build tsvectors SELECT to_tsvector('kazakh_cfg', 'мектептеріміздегі оқушылардың'); -- 'мектеп':1 'оқушы':2 -- Add FTS to a table ALTER TABLE articles ADD COLUMN fts tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('kazakh_cfg', title), 'A') || setweight(to_tsvector('kazakh_cfg', body), 'B') ) STORED; CREATE INDEX idx_fts ON articles USING GIN (fts); -- Search SELECT title FROM articles WHERE fts @@ websearch_to_tsquery('kazakh_cfg', 'президенттің жарлығы') ORDER BY ts_rank_cd(fts, websearch_to_tsquery('kazakh_cfg', 'президенттің жарлығы')) DESC LIMIT 10;
Tuning weights
Penalty weights are tunable at runtime without restarting PostgreSQL:
ALTER TEXT SEARCH DICTIONARY pg_kazsearch_dict (w_deriv = 3.5, w_short_char = 100.0);
Benchmarks
Tested on 2,999 Kazakh news articles from kaz.tengrinews.kz with 9,048 evaluation queries.
PostgreSQL: pg_kazsearch vs pg_trgm
Retrieval QualityRelative ImprovementQuery Latency
| Metric | pg_kazsearch | pg_trgm | Improvement |
|---|---|---|---|
| Recall@10 | 0.784 | 0.635 | +23% |
| MRR@10 | 0.712 | 0.566 | +26% |
| nDCG@10 | 0.729 | 0.582 | +25% |
| Query latency | 0.5 ms | 1.4 ms | 2.8x faster |
Elasticsearch: kazsearch_stem vs standard analyzer
On human-written gold queries, the stemmer finds more relevant articles and ranks them higher:
| Metric | kazsearch_stem | standard | Improvement |
|---|---|---|---|
| Recall@10 | 0.358 | 0.309 | +16% |
| MRR@10 | 0.671 | 0.591 | +13% |
vs Tengrinews.kz native search
Searching the same articles on tengrinews.kz vs ES with kazsearch_stem:
| Search query (Kazakh with suffixes) | tengrinews.kz | ES + kazsearch_stem |
|---|---|---|
| мектептердегі оқушылар | 2 | 159 |
| балалардың денсаулығы | 13 | 391 |
| мұғалімдердің наразылығы | 0 | 28 |
| спортшылардың жетістіктері | 0 | 87 |
| бензиннің бағасын көтеру | 0 | 69 |
| мектептеріміздегі мәселелер | 0 | 609 |
Stemmer examples
| Input | Output | Stripped |
|---|---|---|
| мектептерімізде | мектеп | plural + possessive + locative |
| президенттерінің | президент | plural + possessive + genitive |
| өзгеруі | өзгеру | verbal noun possessive |
| берді | бер | past tense |
| экономикалық | экономика | derivational adjective |
Architecture
┌──────────────────────────────────────────────────────────┐
│ Cargo Workspace │
│ │
│ core/ Pure Rust stemmer (no PG/ES deps) │
│ pg_ext/ pgrx PostgreSQL extension │
│ cli/ CLI tool (kazsearch stem/analyze/bench) │
│ elastic/ Elasticsearch plugin (Rust cdylib + JNI) │
│ src/ C ABI + JNI exports calling core::stem() │
│ java/ Java bridge + Lucene TokenFilter (~50 LoC)│
│ docker/ ES with plugin pre-installed │
└──────────────────────────────────────────────────────────┘
One stemmer, multiple consumers. The core/ crate is the single source of truth for all stemming logic — PostgreSQL, Elasticsearch, and CLI all call into it.
The stemmer algorithm:
- BFS suffix stripper — breadth-first search over layered morphological rules (predicate, case, possessive, plural, derivational for nouns; person, tense, negation, voice for verbs), with vowel harmony validation
- Penalty scoring — candidates scored by syllable count, suffix weakness, derivational depth, and lexicon hits
- Lexicon — 21,863 POS-tagged stems from Apertium-kaz for overstemming protection
- Stem repair — consonant mutation reversal (б→п, г→к, ғ→қ), vowel elision restoration, lexicon-based vowel append
CLI
The kazsearch CLI works standalone without PostgreSQL:
cargo build -p kazsearch-cli --release # Stem a word kazsearch stem алмаларымыздағы # алмаларымыздағы алма # Morphological analysis kazsearch analyze мектептеріміздегі # Benchmark kazsearch bench wordlist.txt # Validate lexicon kazsearch lexicon validate data/tsearch_data/kaz_stems.dict
Development
PostgreSQL
just up # Start PG container just build # Build + install extension just reload # DROP + CREATE extension just test-core # Core Rust unit tests just test-ext # Smoke test via SQL just cli # Build CLI
Elasticsearch
just es-native # Build Rust cdylib for ES plugin just es-build # Build plugin ZIP (Gradle) just es-up # Start ES container with plugin just es-load-corpus # Index 3000 articles just es-eval # Run search quality evaluation just es-down # Stop ES container
Contributing
- Fork the repo and create a feature branch
- Make your changes — stemmer logic lives in
core/src/, extension glue inpg_ext/src/lib.rs - Run
cargo test -p kazsearch-core --test stem_teststo verify stemmer correctness - Run
just up && just reload && just test-extto verify the extension works end-to-end - Open a PR
Key things to know:
- Penalty weights in
core/src/explore.rsare empirically tuned via CMA-ES — changing one can affect many test cases - Layer guards encode real morphotactic constraints, not heuristics
- Vowel harmony (back/front) is mandatory for suffix validation
References
- Krippes, K.A. (1993). Kazakh (Qazaq-) Grammatical Sketch with Affix List. ERIC.
- Washington, J., Salimzyanov, I., Tyers, F. (2014). Finite-state morphological transducers for three Kypchak languages. LREC.
- Makhambetov, O. et al. (2015). Data-driven morphological analysis and disambiguation for Kazakh. CICLing.
License
- Code: LGPL-3.0
- Lexicon data derived from Apertium-kaz (GPL-3.0).