Settings

Theme

SymSpell: 1M times faster spelling correction

github.com

203 points by mci 4 years ago · 32 comments

Reader

LordGrey 4 years ago

What they are calling "Symmetric Delete" seems to be the same as an older concept called "deletion neighborhoods". It is a term coined in an academic paper written by Thomas Bocek, Ela Hunt, and Burkhard Stiller from the University of Zurich, titled "Fast Similarity Search in Large Dictionaries"[1]. The work described there was expanded in a paper written by Daniel Karch, Dennis Luxen, and Peter Sanders from the Karlsruhe Institute of Technology, titled "Improved Fast Similarity Search in Dictionaries"[2]. Both of these papers deal with efficient searching for similar string values, given a query string.

LookupCompound and WordSegmentation, algorithms built on Symmetric Delete/Deletion Neighborhoods, are pretty interesting.

[1]https://fastss.csg.uzh.ch/ifi-2007.02.pdf [2]https://arxiv.org/abs/1008.1191v2

kevincox 4 years ago

This seems very focused on spelling. What I find is key for a good spell correction system is how the words sound. For example I find that Firefox often can't find the word I meant for it's suggestion list but pasting the misspelt word into Google gets the right result 99% of the time as the one provided option.

I wonder how difficult it would be to adapt this to work on sounds or other frequent typos and misspelling sources instead of just characters. It seems it should he possible if you can define a decent "normalization" function.

  • ChrisMarshallNY 4 years ago

    I used an old algorithm in a text search, called Metaphone[0]. It works relatively well, for English (There's a Spanish version, too). I used Double Metaphone.

    [0] https://en.wikipedia.org/wiki/Metaphone

    • faizshah 4 years ago

      There's a pretty cool python library with a huge number of these if you want to experiment (GPLv3): https://github.com/chrislit/abydos

    • ars 4 years ago

      Metaphone is good if you typed in something that sounds like the correct word. But often spelling typos are missing letters, or pressing a nearby letter instead of the one you meant.

      For those the sound of the word isn't right, you need something that checks for nearby keys (I guess it needs to know the keyboard layout), and for missing or extra letters.

      • kevincox 4 years ago

        There are definitely two related concerns here:

        - I mistyped the word.

        - I don't know how to spell the word.

        The errors in these two look quite different and may need different methods of correcting.

  • ksec 4 years ago

    Same here, Apple often auto correct my word into something else, or failed to even suggest the correct word. I had to use Google to get a correct spelling. ( I am not sure if I am alone in this sometimes I dont think I can spell any more. )

  • hashingroll 4 years ago

    Comparing the spellcheck quality of Firefox to Google (search, not Chrome) might not be completely fair. The browser most likely uses a client-side spelling model with tight memory & performance constraints. The suggestions from Google come from high quality models hosted on their servers.

  • phkahler 4 years ago

    https://en.wikipedia.org/wiki/Soundex#:~:text=Soundex%20is%2....

    Soundex is a rough phonetic "spelling" used for such things. It's also something I never see mentioned.

    • ok123456 4 years ago

      Soundex was designed as a way to index last names before computers. It's primary benefit it that it's designed to do by hand. It performs pretty lousy as a general purpose natural language hashing scheme.

      The precision and recall is too bad to use as a spell checker.

    • eloff 4 years ago

      It's also pretty simple, I'm sure there are much better options readily available as libraries these days.

  • RobertMiller 4 years ago

    Firefox's spelling correction is very poor. For instance, try spelling the name of a country starting with a lowercase letter and omitting a single letter.

    "argetina" suggests "retina", Argentina isn't in the list.

    "nethrlands" suggests "fatherland", Netherlands isn't in the list.

    "ukriane" suggests "hurricane", Ukraine isn't in the list.

    It fails for every country I try.

cb321 4 years ago

As jamra correctly points out in a sibling comment, the entry point to this (which gets a lot of traction on HN) is indeed attacking a strawman tutorial-written-on-an-airplane-in-Python algorithm. So, the 1M speed-up is very over-hyped.

That said, the technique is not wholly without merit, but does carry certain "average-worst case" trade offs related to latency in the memory/storage system because of SymSpell's reliance upon large hash tables. For details see https://github.com/c-blake/suggest

EDIT: Also - I am unaware of other implementations even saving the large, slow-to-compute index. The others I am aware of seem to rebuild the large index every time which seems kind of lame. EDIT2 - I guess there is a recent Rust one that is persistent as well as the "mmap & go" Nim one. Still, what should be standard is very rare.

  • ccleve 4 years ago

    Links to the other implementations would be very helpful.

    • cb321 4 years ago

      There are many such links in the main article linked here, actually. That's where I just saw the Rust one using RocksDB. (No link to the Nim one, though...which has no external dependencies beyond the Nim stdlib.)

jamra 4 years ago

This seems very suspicious to me. They’re comparing performance to a tutorial blog post that is extremely inefficient.

How about comparing it to. Levenshtein automaton or another state of the art approach?

  • cb321 4 years ago

    You are absolutely right to be suspicious. https://news.ycombinator.com/item?id=30580011

    The TL;DR is that on a "worst case" basis, it may be 10x faster or even 10x slower than a non-indexed linear scan depending upon latency of backing stores. Physical memories in computers are pretty big these days, but this technique would have been a total non-starter just 10 years before the 2007 article by Norvig. Meanwhile its linear scan competitor would have worked fine. It is also not a great approach if you have a big dictionary of long words and want truly large edit distances (>>5, say). I would say "mid-size edit distances of 2-4" are SymSpell's sweet spot.

injidup 4 years ago

Slightly OT but trying to look up friends names using Android auto speech recognition whilst driving. I have two Austrian friend "Viktor" and "Patric". If I say. "Hey google, call Viktor" google says. "Sorry you have no Victor in your phone book. Same with Patric. "Sorry you have no Patrick" in your phone book. I'm surprised that there is not even basic scoring done when looking up names in the phone book with the most likely one offered.

*EDIT* I just found a solution to this problem. https://support.google.com/assistant/thread/559644?hl=en&msg... You can supply a phonetic name and this helps google match. This seems a bit low tech though.

danielscrubs 4 years ago

It was a bit fun comparing the Haskell version and the C# version:

https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell/S...

https://github.com/cbeav/symspell/blob/master/src/SymSpell.h...

  • 0xcoffee 4 years ago

    I don't really see a reason why the C# version couldn't be written in a similar functional style as the Haskell version. I don't see them using linq anywhere. I see shallow clones, double constructors, no use of records. So I won't be too quick to judge. Would love to see someone try their hand at optimizing the C# version, using the full availability of modern c#.

nicoburns 4 years ago

I'd love to see better open source spell checking. The state of the art spell checkers (in say, MS Word or Google Search) are excellent and more than good enough for my needs. And yet the spell checkers in browsers (e.g. Chrome and Firefox) and other apps tend to be terrible only catching very basic cases and often not being able to suggest the correct word.

Does anyone have any insight into what's holding this back?

tgv 4 years ago

So, this is time vs memory?

  • LordGrey 4 years ago

    Very much so.

    Edit: To determine the number of records created from permuting a single string, use this:

      for (r = 1; r <= e; r++)
         numRecs += (fact(n) / (fact(r) * fact(n - r)));
    
    where e = edit distance and n = length of string. fact() is your standard factorial function.
    • idealmedtech 4 years ago

      This actually has a closed form!

      numRecs = -1 + 2^n + (n nCr e+1) * 2F1(1, e-n+1; e+2; -1)

      n nCr k combinatorial choice, 2F1 is the hypergeometric function

      • cb321 4 years ago

        In a dictionary with more than a few words, some "created corruptions" in the lookup table/index collide with each other. (EDIT: e.g. for "hand" and "and" the overlap of the deletion-corrupted sets is substantial.) You only know which/how many by running the algo against a concrete dictionary. So, these expressions are at best a rough guide.

wodenokoto 4 years ago

It says it does fuzzy search as well. Wonder how it compares to fzf

tootie 4 years ago

Is this how modern spell checkers worked? I assumed they were more heuristic at this point. For example, Google's "did you mean" is based on mapping common misspellings to what people actually clicked on.

amelius 4 years ago

Can this be used to match DNA sequences?

Or sounds (like Shazam does)?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection