Optimizing trie-based spelling correction algorithms at Constructor.io
blog.constructor.ioI'm curious, how do you guys estimate "fitness" of the search results you found, i.e. whether it was what user searched for or not.
Just looking whether user later clicked on a result, or continued to search?
That’s definitely one signal, but it can be refined by tracking whether the search led to click-throughs, conversions, a different search, a bounce, etc... The conversion is the most important event, but there are a lot of important features aside from that which we'll use to improve results over time. If you're interested, our behavioral api (https://constructor.io/docs/#behavioral-data) goes through some of this as well. We've done a lot of work to figure out which signals are the most important, and are continuing to develop this feature all the time as we acquire more data.
Does this mean all prefixes have to be regenerated for every addition of a new item? Or can you invalidate the relevant prefixes?
This was a problem with the very first working implementation! However, with some intermediate data structures we were able to take advantage of the fact that, for an index with millions of items, adding or deleting a few thousand probably does not change the set of prefixes or their relative rankings very much. So although the initial precomputation of the prefix mappings from scratch takes ~10 minutes for our largest customers, we were able to get subsequent index updates down to mere seconds.
Nice post. How do you customize the Damerau-Levenshtein algorithm? Did you write a new from scratch?
Hi, and thanks! Well, there's no need to re-invent the wheel, and there already exist fast implementations of DL distance. However, Damerau-Levenshtein distance is just one piece of our string comparison evaluation system that is built in-house and is constantly under development and improvement.
For example, another important aspect of the comparison metrics is our in-house phonetics library that we've built to be sensitive to vowel context, syllabification, diphthongs, stemming and lemmatization, and other language phenomena, and we are fleshing it out to handle other languages including some Eastern European and CJK.
Cool) sounds as interesting approach. It is good to compare it in real life.