GitHub - akgitrepos/regex-engine-rust: A small readable regex engine in Rust with Thompson NFA core

3 min read Original article ↗

A small, readable regex engine in Rust with a Thompson NFA core.

The project follows the contract in regex-engine-mini-challenge and prioritizes:

  • correctness for a clearly scoped feature set
  • predictable runtime behavior (no recursive backtracking engine)
  • explainable internals (AST -> NFA -> simulation)

Current Status (v1 core)

Implemented:

  • literals
  • dot (.), excluding newline
  • concatenation
  • alternation (|)
  • quantifiers (*, +, ?) with greedy behavior
  • grouping (())
  • character classes ([]) and ranges ([a-z])
  • escapes: \d, \w, \s, \t, \n, \\, \.
  • anchors: ^, $ (input boundaries only)
  • APIs: compile, is_match, find_first, find_all
  • bounded lazy DFA-style transition cache for repeated scans

Deferred / out of scope:

  • captures
  • backreferences
  • lookarounds
  • possessive quantifiers and atomic groups
  • full PCRE compatibility

Semantics

  • find_first uses deterministic leftmost-first semantics:
    1. earliest start index
    2. longest match at that same start index
  • find_all is non-overlapping
  • zero-length match progression rule in find_all:
    • if match length > 0, continue at end
    • if match length == 0, advance by one code unit
  • class shortcuts (\d, \w, \s) are ASCII semantics in v1
  • input is UTF-8; matching offsets are byte offsets

Quick Start

use regex_engine_rust::{compile, Match};

fn main() {
    let re = compile("a|ab").expect("valid pattern");

    assert!(re.is_match("zab"));
    assert_eq!(re.find_first("zab"), Some(Match { start: 1, end: 3 }));

    let all: Vec<Match> = compile("a*")
        .expect("valid")
        .find_all("bbb")
        .collect();
    assert_eq!(
        all,
        vec![
            Match { start: 0, end: 0 },
            Match { start: 1, end: 1 },
            Match { start: 2, end: 2 },
            Match { start: 3, end: 3 },
        ]
    );
}

API

pub fn compile(pattern: &str) -> Result<Regex, CompileError>;

impl Regex {
    pub fn is_match(&self, text: &str) -> bool;
    pub fn find_first(&self, text: &str) -> Option<Match>;
    pub fn find_all<'r, 't>(&'r self, text: &'t str) -> FindAll<'r, 't>;
    pub fn cache_metrics(&self) -> CacheMetrics;
    pub fn clear_cache(&self);
}

Cache Notes

  • The engine always remains correct via NFA simulation.
  • A bounded transition cache speeds up repeated scans for non-anchored patterns.
  • Cache metrics expose hit/miss/insert/eviction counts for profiling.

Development

Run tests:

Run benchmarks:

cargo bench --bench basic

Benchmark Snapshot

Command:

cargo bench --bench basic

Environment:

  • rustc 1.92.0
  • Darwin arm64 (Apple M1 class) from uname -a

Recent run (Criterion time estimates):

Benchmark Time (approx)
is_match literal 0.74 us
is_match class plus 0.88 us
is_match anchored 1.80 us
find_first mixed pattern 12.6 us
find_all repeated 5.78 us

Note: these are microbenchmarks intended for reproducible relative tracking, not cross-machine absolute comparisons.

Project Layout

  • src/syntax: lexer + parser + AST validation
  • src/automata: NFA model, Thompson compiler, simulation
  • src/engine: public matching APIs
  • tests: conformance and regression tests

License

MIT (LICENSE)