Building a full-text search engine in 150 lines of Python code (2021)
bart.degoe.deThis is a good intro to text search. Shameless plug: If you throw in a bit more, ca. 250 SLOC, you can have BM25 search: https://github.com/jankovicsandras/bm25opt
You can probably have phrase matching in a hundred lines more, maybe less.
Most of the difficulty in search is dealing with the sheer volume of data. The algorithms themselves are pretty trivial for the most part.
Mine: bm25 I use for teaching (sorry for the French example)
https://gist.github.com/benob/69d48421f88f5dcc2b26a204d3251d...
Merci! Mais faut-il s'en excuser? L'exemple est en français, voilà tout!
This is an English-speaking forum, so yes, it is expected to use English and link to other stuff in English (or provide a translation). It doesn't mean that French (or any other language) is a bad thing, it just isn't the language used by this community.
Sure... but for the record, the code that the parent posted is in Python -- one of the global tech economy's pre-eminent programming languages for more than a decade. The parent's example contains some variables with Unicode values that are French words. It strikes me as odd that the parent would say sorry at all. Maybe the parent has a better understanding than I of linguistic sensitivities.
> return [token for token in tokens if token]
I love that kind of bullshit poetry.
If you're used to this, it's nice and readable.
Or you can do,
Not obvious, but giving None is like giving "lambda x : x" to filter().return filter(None,tokens)Is this any different from `filter(bool, tokens)`? (similar to the JS equivalent mentioned in a sibling)
If not, I'm now curious why the `None` special case was added to `filter`..
https://stackoverflow.com/questions/8748036/is-there-a-built...
While this doesn't explain why filter accepts None as the lambda, I think the fact that there are no builtin identity function kind of explains why. It's easier to write None rather than put an identity lambda. I reckon it also makes it easier for a compiler to optimize this specific case.
Thanks for the explanation. I take your point, but in the `filter` case `bool` seems just as easy as `None`, but with the added benefit that no awareness of a special case is needed -- any python programmer should understand how the `bool` builtin works. Curious if there's any downside.
You can do the same thing in JavaScript with Boolean
Eg, arr.filter(Boolean)
Holy shit, I’ve been writing python for 15 years and it’s the first time I’ve seen None used as an identity function. That’s nuts! How does it work under the hood? Does filter have a special case for evaluating None?
Yes, that’s just a special case. You can’t call None as a function.
https://docs.python.org/3/library/functions.html#filter
> If function is None, the identity function is assumed, that is, all elements of iterable that are false are removed.
It grew on me, too. Originally I frowned upon this kind of python shenanigans but now I must confess it makes me a bit happier inside whenever I have to type a similar thing.
I think its lovely if its reasonably readable (which this is) but they can be convoluted. I have written Python list comprehensions that I could not read myself the next day and so I am more careful now.
Yeah, but what is worst? I tried to understand the Haskell way in an article last week[0]
Which I think is more or less the same as this python line.pure (n, guard (factor /= n) $> factor)
The article does fancy stuff with memory caches which I believe is easy to do in python but I need to understand the Haskell code better.return [factor for factor in factors if factor and not factor == n][0] Haskell: A Great Procedural Language https://entropicthoughts.com/haskell-procedural-programming#... https://news.ycombinator.com/item?id=42754098
> pure (n, guard (factor /= n) $> factor)
Returns a tuple: Left side is n, right side is (Just factor) if factor is not n, or Nothing if it is.
You're being way to clever! The machinery from the article is only needed to deal with IO.
A more direct translation:
The full functions:# py [token for token in tokens if token] -- hs filter (not . null) tokens# py def analyze(text): tokens = tokenize(text) tokens = lowercase_filter(tokens) tokens = stopword_filter(tokens) tokens = stem_filter(tokens) return [token for token in tokens if token] -- hs analyze :: String -> [String] analyze = filter (not . null) . stem_filter . stopword_filter . lowercase_filter . tokenizeThanks I will look at that too. Clever and ignorant are synonyms in my book, and I am ignorant here. In this case IO and memory was what was interesting in the article.
I'm being way to clever... I read your article when it was originally posted, but hadn't realized that you're quoting from it! My bad...
If you'll humor me, I'd love to take another shot, but from the more interesting direction!
Adding a type might clear stuff up a bit:
So without the for loop, there's no list to comprehend. Adding the list back gives something more equivalent to a standard for loop in Python:-- hs pure (n, guard (factor /= n) $> factor) :: IO (Int, Maybe Int) # py print((n, if factor == n then factor else None))
Which could also be written as a list comprehension in either language:-- hs for numbers $ \n -> pure (n, guard (factor /= n) $> factor) :: IO [(Int, Maybe Int)] # py for n in numbers: print((n, if factor == n then factor else None))
Note that they behave a bit differently though, and it's the reason I haven't included `return` in the python line (what does `print` return?). Python will loop through the list and actually run the `print` function on each element, while Haskell will loop through and collect all the `IO` into one function to run later. Although it's starting to get pretty un-Pythonic, you can hack the behavior into Python with something like:-- hs [pure (n, guard (factor /= n) $> factor) | n <- numbers] :: IO [(Int, Maybe Int)] # py [print((n, if factor == n then factor else None)) for n in numbers]
Which would need to be run using something unholy like:# py return lambda: [f for f in (print((n, if factor == n then factor else None)) for n in numbers)]# py factorize(numbers)()