Fuzzify: A tiny library for fuzzy search
github.comiirc the go-to data structure for this kind of problem is called a “vantage point tree” (vp tree).
Another approach that I haven’t explored is to prep the array of words to search into a tree. normal words in Latin languages have lots of common prefixes and suffixes so you can dramatically reduce the amount of nodes (see my own old blog which compresses a scrabble dictionary https://williame.github.io/post/87682811573.html - same prefix suffix sharing our work on non-scrabble rotations too). Now walk the tree doing Levenshtein but checking multiple words at once?
Not for strings, no. The default algorithm would be Levenshtein edit distance and friends (i.e. what this library uses). If you want to get fancy, you could even go for something like Levenshtein automata (which are used by Lucene to implement fuzzy search across terms).
In my case I needed to match multiple names (consisting of 2-5 words), I had better matches and performance using the bag of words model (https://en.wikipedia.org/wiki/Bag-of-words_model).
I’m a bit confused. “Unit” brings up Turkmenistan before United States.
Turkmenistan is 8 edits, United States is 9. Levenshtein Distance is not actually a good search algorithm :)
That's correct, yes. Turkmenistan: 8 deletions (Trkmesan). United States: 9 deletions (ed States), including the space character.
You'll have to add more weights to substring matches, fuzzy search by itself is usually not enough for intuitive search
There are tons of js fuzzy search libraries with basically the same functionality. Is there any reason I should use this over others?
Regarding the name, doesn’t it implify fuzzification of data?