Developed a suffix-array-based search engine for the site today. While a simple regex search was enough, couldn’t resist the technical elegance of a proper index.
Indexer: Implemented the indexer in Perl to crawl the HTML, lowercase the text, and encode it into UTF-8 bytes. Used a null byte sentinel to mark document boundaries and stored the lexicographically sorted 32-bit unsigned integer offsets to sa.bin:
my @sa = 0 .. (length($corpus) - 1);
{
use bytes; # Force compare raw bytes
@sa = sort {
# First 64 bytes check (fast path)
(substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
# Full string fallback (required for correctness)
(substr($corpus, $a) cmp substr($corpus, $b))
} @sa;
}
32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but comforting to have.
Linear SA-IS is not in the base system. Implemented O(L⋅N log N) sort with a fast path that caps L at 64 instead. Without the fast path indexing 100 8 KB articles takes 6.58 minutes. Fast path reduces that to 2.69 seconds. Tested with 16 (2.68s), 32 (2.69s), 128 (2.75s), and 256 (2.77s) bytes. 64 bytes balances fast path coverage and speed.
Search: Implemented the search in a FastCGI script as a textbook range query with two binary searches. Leveraged the fixed-width offsets for fast random access to the index:
seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
Chose seek + read over mmap because it outperformed mmap for <1k files. At 10k, mmap was occasionally faster (~200 µs), but used more memory—possibly due to OpenBSD’s VM security trade-offs. Results may vary by OS.
Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear regex search:
============================================================= SEARCH BENCHMARK: Suffix array vs. Linear regex ARTICLE SIZE: 8 KB ============================================================= 500 files (Targeting: keyword_-1): ----------------+----------------------+--------------------- METRIC | SA | REGEX ----------------+----------------------+--------------------- Search time | 0.0014s | 0.0451s Peak RAM | 8124 KB | 9612 KB Indexing time | 18.1865s | N/A Index size | 19610.39 KB | N/A ----------------+----------------------+--------------------- 1000 files (Targeting: keyword_-1): ----------------+----------------------+--------------------- METRIC | SA | REGEX ----------------+----------------------+--------------------- Search time | 0.0021s | 0.0918s Peak RAM | 8280 KB | 9960 KB Indexing time | 43.1748s | N/A Index size | 39225.06 KB | N/A ----------------+----------------------+--------------------- 10000 files (Targeting: keyword_-1): ----------------+----------------------+--------------------- METRIC | SA | REGEX ----------------+----------------------+--------------------- Search time | 0.0173s | 1.1275s Peak RAM | 11848 KB | 13392 KB Indexing time | 663.3909s | N/A Index size | 392263.01 KB | N/A ----------------+----------------------+---------------------
Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system permissions to govern access. Hardened the system by running it in chroot.
Resource exhaustion and XSS attacks are inherent. Limited concurrent searches using lock-file semaphores, and capped the query length (64 B) and the result set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape.
At six articles a year, this should work for the next 1600 years. Penciled in the next release for Anno Domini 3626.