Minimal Perfect Hashing
stevehanov.caThe article takes a very weird turn:
If we need extra information about the words, we can use an additional data structure along with the MA-FSA. We can store it in a hash table.
This is not necessary at all, since an acyclic finite state automaton can be used as a perfect hash function. If you sum the right language cardinalities of the transitions followed, you get a unique hash code (1..n, where n is the number of strings in the automaton). For efficiency, you can precompute the cardinalities in the transitions or states. You can then use a normal array for whatever data you want to associate with a word.
With respect to parts 1 & 2 of the series, one can construct a Levenshtein automaton for a given word in linear time[1]. Such an automaton matches any string within the given edit distance of the word. You can then compute the intersection (language) of the Levenshtein automaton and the dictionary automaton to get all words within the given edit distance.
(Another approach would be to make the dictionary a transducer that returns the words within the given edit distance. I tried this once, but the resulting automaton would be too large for realistic dictionaries.)
I wrote a Java library for dictionary automata, perfect hash automata, and Levenshtein automata. It also provides immutable Map containers for String keys, that store strings in an automaton and uses a flat array to store values:
This is a great solution!
What would be the reason to do it this way if you don't already have the FSA?
If you just need to hash a small-ish set of known items and keep memory usage low, I'd think starting with the table would make more sense. Wouldn't the FSA approach require more memory than the table, since it has to maintain all the intermediate nodes?
Wouldn't the FSA approach require more memory than the table, since it has to maintain all the intermediate nodes?
What do you mean by 'a table'? If you want a set without false positives/negatives or a perfect hash function, you have to represent the possible elements in some way. FSAs are compact when you store a set of sequences where there is a certain amount of redundancy in the form of subsequences that occur in multiple strings. In such cases, FSAs are far more compact that a flat table of strings.
Of course, in an actual compact implementation, you would not implement states and transitions using classes. E.g. Dictomaton stores states as an array of transitions (which are unboxed), which is far more compact (see the numbers on the Github page for some comparisons with comparable data structures).
For what its worth, I have a quite fast MPHF implementation that uses an average of 3.2 bits per key.
https://github.com/ahmetaa/zemberek-nlp/blob/master/core/src...
it is used for language model compression. Also there is a Dart implementation as part of a small library.
I had the luck to have Fabiano Botelho — the person behind the Minimal Perfect Hash thesis — as an Algorithms and Data Structures professor back when I was studying Computer Engineering at CEFET-MG. He is absolutely brilliant, and has some incredible knowledge on both the theoretical and practical aspects of using hash tables.
IIRC, he had just finished publishing his thesis, and had each of us write our own implementations of minimal perfect hashing for extra credit. The data structure is really useful for indexing large, immutable data sets.
For what it's worth, if you have in the order of N things to store, and you hash them to a space of size N^3, the possibility for a collision is less than N^(-1). Basically nonexistent. Hence, if you want to save space, just use a normal hash table implementation, but use a more precise hash instead of the keys. This is very useful when your keys are document sized.
I don't understand what you mean?
10 things to store, 1000 buckets to store in, less than 0.1 probability of a collision? That seams both not very impressive or even correct?
It's just a rough (but certain for all n) upper bound. The precise probability for your case seems to be slightly greater than 0.01. The point is that you can replace a precise, expensive equality test with a second hash with greater range.
Change that dang font!
Feel free to.