Settings

Theme

Building a full-text search engine in 150 lines of Python code (2021)

bart.degoe.de

84 points by matt_daemon a year ago · 22 comments

Reader

jankovicsandras a year ago

This is a good intro to text search. Shameless plug: If you throw in a bit more, ca. 250 SLOC, you can have BM25 search: https://github.com/jankovicsandras/bm25opt

  • marginalia_nu a year ago

    You can probably have phrase matching in a hundred lines more, maybe less.

    Most of the difficulty in search is dealing with the sheer volume of data. The algorithms themselves are pretty trivial for the most part.

  • benob a year ago

    Mine: bm25 I use for teaching (sorry for the French example)

    https://gist.github.com/benob/69d48421f88f5dcc2b26a204d3251d...

    • heresie-dabord a year ago

      Merci! Mais faut-il s'en excuser? L'exemple est en français, voilà tout!

      • usrnm a year ago

        This is an English-speaking forum, so yes, it is expected to use English and link to other stuff in English (or provide a translation). It doesn't mean that French (or any other language) is a bad thing, it just isn't the language used by this community.

        • heresie-dabord a year ago

          Sure... but for the record, the code that the parent posted is in Python -- one of the global tech economy's pre-eminent programming languages for more than a decade. The parent's example contains some variables with Unicode values that are French words. It strikes me as odd that the parent would say sorry at all. Maybe the parent has a better understanding than I of linguistic sensitivities.

cocodill a year ago

> return [token for token in tokens if token]

I love that kind of bullshit poetry.

  • ks2048 a year ago

    If you're used to this, it's nice and readable.

    Or you can do,

        return filter(None,tokens)
    
    Not obvious, but giving None is like giving "lambda x : x" to filter().
    • evnp a year ago

      Is this any different from `filter(bool, tokens)`? (similar to the JS equivalent mentioned in a sibling)

      If not, I'm now curious why the `None` special case was added to `filter`..

      • hacked_greenhat a year ago

        https://stackoverflow.com/questions/8748036/is-there-a-built...

        While this doesn't explain why filter accepts None as the lambda, I think the fact that there are no builtin identity function kind of explains why. It's easier to write None rather than put an identity lambda. I reckon it also makes it easier for a compiler to optimize this specific case.

        • evnp a year ago

          Thanks for the explanation. I take your point, but in the `filter` case `bool` seems just as easy as `None`, but with the added benefit that no awareness of a special case is needed -- any python programmer should understand how the `bool` builtin works. Curious if there's any downside.

    • sroussey a year ago

      You can do the same thing in JavaScript with Boolean

      Eg, arr.filter(Boolean)

    • chaos_emergent a year ago

      Holy shit, I’ve been writing python for 15 years and it’s the first time I’ve seen None used as an identity function. That’s nuts! How does it work under the hood? Does filter have a special case for evaluating None?

  • dkga a year ago

    It grew on me, too. Originally I frowned upon this kind of python shenanigans but now I must confess it makes me a bit happier inside whenever I have to type a similar thing.

    • graemep a year ago

      I think its lovely if its reasonably readable (which this is) but they can be convoluted. I have written Python list comprehensions that I could not read myself the next day and so I am more careful now.

  • pastage a year ago

    Yeah, but what is worst? I tried to understand the Haskell way in an article last week[0]

      pure (n, guard (factor /= n) $> factor)
    
    Which I think is more or less the same as this python line.

      return [factor for factor in factors if factor and not factor == n]
    
    The article does fancy stuff with memory caches which I believe is easy to do in python but I need to understand the Haskell code better.

    [0] Haskell: A Great Procedural Language https://entropicthoughts.com/haskell-procedural-programming#... https://news.ycombinator.com/item?id=42754098

    • mrkeen a year ago

      > pure (n, guard (factor /= n) $> factor)

      Returns a tuple: Left side is n, right side is (Just factor) if factor is not n, or Nothing if it is.

    • itishappy a year ago

      You're being way to clever! The machinery from the article is only needed to deal with IO.

      A more direct translation:

          # py
          [token for token in tokens if token]
      
          -- hs
          filter (not . null) tokens
      
      The full functions:

          # py
          def analyze(text):
            tokens = tokenize(text)
            tokens = lowercase_filter(tokens)
            tokens = stopword_filter(tokens)
            tokens = stem_filter(tokens)
            return [token for token in tokens if token]
      
          -- hs
          analyze :: String -> [String]
          analyze = filter (not . null) . stem_filter . stopword_filter . lowercase_filter . tokenize
      • pastage a year ago

        Thanks I will look at that too. Clever and ignorant are synonyms in my book, and I am ignorant here. In this case IO and memory was what was interesting in the article.

        • itishappy a year ago

          I'm being way to clever... I read your article when it was originally posted, but hadn't realized that you're quoting from it! My bad...

          If you'll humor me, I'd love to take another shot, but from the more interesting direction!

          Adding a type might clear stuff up a bit:

              -- hs
              pure (n, guard (factor /= n) $> factor) :: IO (Int, Maybe Int)
          
              # py
              print((n, if factor == n then factor else None))
          
          So without the for loop, there's no list to comprehend. Adding the list back gives something more equivalent to a standard for loop in Python:

              -- hs
              for numbers $ \n ->
                  pure (n, guard (factor /= n) $> factor) :: IO [(Int, Maybe Int)]
          
              # py
              for n in numbers:
                  print((n, if factor == n then factor else None))
          
          Which could also be written as a list comprehension in either language:

              -- hs
              [pure (n, guard (factor /= n) $> factor) | n <- numbers] :: IO [(Int, Maybe Int)]
          
              # py
              [print((n, if factor == n then factor else None)) for n in numbers]
          
          Note that they behave a bit differently though, and it's the reason I haven't included `return` in the python line (what does `print` return?). Python will loop through the list and actually run the `print` function on each element, while Haskell will loop through and collect all the `IO` into one function to run later. Although it's starting to get pretty un-Pythonic, you can hack the behavior into Python with something like:

              # py
              return lambda: [f for f in (print((n, if factor == n then factor else None)) for n in numbers)]
          
          Which would need to be run using something unholy like:

              # py
              factorize(numbers)()

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection