GitHub - akgitrepos/regex-engine-rust: A small readable regex engine in Rust with Thompson NFA core

A small, readable regex engine in Rust with a Thompson NFA core.

The project follows the contract in regex-engine-mini-challenge and prioritizes:

correctness for a clearly scoped feature set
predictable runtime behavior (no recursive backtracking engine)
explainable internals (AST -> NFA -> simulation)

Current Status (v1 core)

Implemented:

literals
dot (.), excluding newline
concatenation
alternation (|)
quantifiers (*, +, ?) with greedy behavior
grouping (())
character classes ([]) and ranges ([a-z])
escapes: \d, \w, \s, \t, \n, \\, \.
anchors: ^, $ (input boundaries only)
APIs: compile, is_match, find_first, find_all
bounded lazy DFA-style transition cache for repeated scans

Deferred / out of scope:

captures
backreferences
lookarounds
possessive quantifiers and atomic groups
full PCRE compatibility

Semantics

find_first uses deterministic leftmost-first semantics:
1. earliest start index
2. longest match at that same start index
find_all is non-overlapping
zero-length match progression rule in find_all:
- if match length > 0, continue at end
- if match length == 0, advance by one code unit
class shortcuts (\d, \w, \s) are ASCII semantics in v1
input is UTF-8; matching offsets are byte offsets

Quick Start

use regex_engine_rust::{compile, Match};

fn main() {
    let re = compile("a|ab").expect("valid pattern");

    assert!(re.is_match("zab"));
    assert_eq!(re.find_first("zab"), Some(Match { start: 1, end: 3 }));

    let all: Vec<Match> = compile("a*")
        .expect("valid")
        .find_all("bbb")
        .collect();
    assert_eq!(
        all,
        vec![
            Match { start: 0, end: 0 },
            Match { start: 1, end: 1 },
            Match { start: 2, end: 2 },
            Match { start: 3, end: 3 },
        ]
    );
}

API

pub fn compile(pattern: &str) -> Result<Regex, CompileError>;

impl Regex {
    pub fn is_match(&self, text: &str) -> bool;
    pub fn find_first(&self, text: &str) -> Option<Match>;
    pub fn find_all<'r, 't>(&'r self, text: &'t str) -> FindAll<'r, 't>;
    pub fn cache_metrics(&self) -> CacheMetrics;
    pub fn clear_cache(&self);
}

Cache Notes

The engine always remains correct via NFA simulation.
A bounded transition cache speeds up repeated scans for non-anchored patterns.
Cache metrics expose hit/miss/insert/eviction counts for profiling.

Development

Run tests:

Run benchmarks:

cargo bench --bench basic

Benchmark Snapshot

Command:

cargo bench --bench basic

Environment:

rustc 1.92.0
Darwin arm64 (Apple M1 class) from uname -a

Recent run (Criterion time estimates):

Benchmark	Time (approx)
`is_match literal`	0.74 us
`is_match class plus`	0.88 us
`is_match anchored`	1.80 us
`find_first mixed pattern`	12.6 us
`find_all repeated`	5.78 us

Note: these are microbenchmarks intended for reproducible relative tracking, not cross-machine absolute comparisons.

Project Layout

src/syntax: lexer + parser + AST validation
src/automata: NFA model, Thompson compiler, simulation
src/engine: public matching APIs
tests: conformance and regression tests

License

MIT (LICENSE)