Paraglob 2
A fairly quick data structure for matching a string against a large list of patterns.
For example, given a list of patterns
and an input string dog, paraglob will return
How it works
For any pattern, there exist a set of sub-strings that a string must contain in order for it to have any hope of matching against that pattern. We call these meta-words. Here are some examples:
*og -> |og|
dog*fish -> |dog| |fish|
When a pattern is added to a Paraglob the pattern is stored and is split into
its meta-words. Those meta words are then added to an Aho-Corasick data
structure that can be found in multifast-ac.
When Paraglob is given a query, it first gets the meta-words contained in the
query using multifast-ac. Then, it builds a set of all patterns associated with
those meta-words and runs fnmatch on the query and those patterns. It finally
returns a vector of all the patterns that match.
Installation
# ./configure && make && make test && make install
How to use it
paraglob-test is a small
benchmarking script that takes three parameters: the number of patterns to
generate, the number of queries to perform, and the percentage generated of
patterns that will match.
As an example, running paraglob-test 10000 50 50 will add 10,000 patterns,
perform 50 queries on them (of which 50% should match), and then return the
results.
Inside Zeek
Paraglob is integrated with Zeek & provides a simple api inside of its
scripting language. In Zeek, paraglob is implemented as an
OpaqueType and its syntax closely follows other similar constructs
inside Zeek. A paraglob can only be instantiated once from a vector of
patterns and then only supports get operations which return a vector
of all patterns matching an input string. These patterns are different than
the pattern type in Zeek in that they are just strings. The syntax is as
follows:
local v = vector("*", "d?g", "*og", "d?", "d[!wl]g");
local p = paraglob_init(v);
print paraglob_match(p1, "dog");
out:
Notes
Paraglob can make queries very quickly, but does not build instantly. It takes about 1.5 seconds to build for 10,000 items, 3 seconds for 20,000, and so on. This is because of the time required to build the Aho-Corasick structure.