A small, readable regex engine in Rust with a Thompson NFA core.
The project follows the contract in regex-engine-mini-challenge and prioritizes:
- correctness for a clearly scoped feature set
- predictable runtime behavior (no recursive backtracking engine)
- explainable internals (AST -> NFA -> simulation)
Current Status (v1 core)
Implemented:
- literals
- dot (
.), excluding newline - concatenation
- alternation (
|) - quantifiers (
*,+,?) with greedy behavior - grouping (
()) - character classes (
[]) and ranges ([a-z]) - escapes:
\d,\w,\s,\t,\n,\\,\. - anchors:
^,$(input boundaries only) - APIs:
compile,is_match,find_first,find_all - bounded lazy DFA-style transition cache for repeated scans
Deferred / out of scope:
- captures
- backreferences
- lookarounds
- possessive quantifiers and atomic groups
- full PCRE compatibility
Semantics
find_firstuses deterministic leftmost-first semantics:- earliest start index
- longest match at that same start index
find_allis non-overlapping- zero-length match progression rule in
find_all:- if match length > 0, continue at
end - if match length == 0, advance by one code unit
- if match length > 0, continue at
- class shortcuts (
\d,\w,\s) are ASCII semantics in v1 - input is UTF-8; matching offsets are byte offsets
Quick Start
use regex_engine_rust::{compile, Match}; fn main() { let re = compile("a|ab").expect("valid pattern"); assert!(re.is_match("zab")); assert_eq!(re.find_first("zab"), Some(Match { start: 1, end: 3 })); let all: Vec<Match> = compile("a*") .expect("valid") .find_all("bbb") .collect(); assert_eq!( all, vec![ Match { start: 0, end: 0 }, Match { start: 1, end: 1 }, Match { start: 2, end: 2 }, Match { start: 3, end: 3 }, ] ); }
API
pub fn compile(pattern: &str) -> Result<Regex, CompileError>; impl Regex { pub fn is_match(&self, text: &str) -> bool; pub fn find_first(&self, text: &str) -> Option<Match>; pub fn find_all<'r, 't>(&'r self, text: &'t str) -> FindAll<'r, 't>; pub fn cache_metrics(&self) -> CacheMetrics; pub fn clear_cache(&self); }
Cache Notes
- The engine always remains correct via NFA simulation.
- A bounded transition cache speeds up repeated scans for non-anchored patterns.
- Cache metrics expose hit/miss/insert/eviction counts for profiling.
Development
Run tests:
Run benchmarks:
cargo bench --bench basic
Benchmark Snapshot
Command:
cargo bench --bench basic
Environment:
rustc 1.92.0- Darwin arm64 (
Apple M1 class) fromuname -a
Recent run (Criterion time estimates):
| Benchmark | Time (approx) |
|---|---|
is_match literal |
0.74 us |
is_match class plus |
0.88 us |
is_match anchored |
1.80 us |
find_first mixed pattern |
12.6 us |
find_all repeated |
5.78 us |
Note: these are microbenchmarks intended for reproducible relative tracking, not cross-machine absolute comparisons.
Project Layout
src/syntax: lexer + parser + AST validationsrc/automata: NFA model, Thompson compiler, simulationsrc/engine: public matching APIstests: conformance and regression tests
License
MIT (LICENSE)