BM25opt
faster BM25 search algorithms in Python
based on https://github.com/dorianbrown/rank_bm25 by Dorian Brown
Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0
News:
- 1.1.0 supports updating the index with new
add_documents(),delete_documents()andupdate_documents()functions, see Example 4
Usage:
Input:
corpusis a list of strings, e.g.[ 'bla bla bla', 'this is document two', ... ]questionis a string, e.g.'which text contains the word two?'- optional arguments:
algo: BM25 algorithm, the default is'okapi';'l'and'plus'availabletokenizer_function: the default istokenizer_defaultwhich is split-on-whitespace, lowercase, remove common punctiationsidf_algo: default uses the same IDF asrank_bm25; values'okapi','l'and'plus'can override to fix dorianbrown/rank_bm25#35k1,b,epsilon,delta: constants with standard default values, see https://en.wikipedia.org/wiki/Okapi_BM25
Example 1:
This example uses the default tokenizer and the default BM25Okapi algorithm and returns the top 5 highest scoring document ids and scores.
# creating the index bm25opt_index = BM25opt( corpus ) # search results = bm25opt_index.topk( question, 5 ) print( 'results[0] id', results[0][0], 'results[0] score', results[0][1], 'results[0] document', corpus[ results[0][0] ] )
Example 2:
This example returns the list of document scores (order is the same as the document order in corpus), shows algoritm selection and custom tokenizer function.
bm25opt_index = BM25opt( corpus, algo='plus', tokenizer_function=some_tokenizer_function ) doc_scores = bm25opt_index.get_scores( question )
Example 3: comparison with rank_bm25
This example shows the score list and the similarity with rank_bm25, but NOTE: BM25opt input is not tokenized beforehand.
corpus = [ ... ] question = '...' tokenized_corpus = [ tokenizer_default(document) for document in corpus ] tokenized_question = tokenizer_default( question ) rank_bm25_index = BM25Okapi( tokenized_corpus ) bm25opt_index = BM25opt( corpus, algo='okapi' ) rank_bm25_scores = rank_bm25_index.get_scores( tokenizedquestion ) bm25opt_scores = bm25opt_index.get_scores( question )
Example 4: updating the index
# creating the index bm25opt_index = BM25opt( corpus ) # add new documents bm25opt_index.add_documents( corpus2 ) # delete from the index delete_ids = [ 1, 3, 5 ] # list of document ids (indices in corpus) to delete from the index bm25opt_index.delete_documents( delete_ids ) # in-place update changed documents in the index update_ids = [ 1, 3, 5 ] # list of document ids (indices in corpus) to change updated_documents = [ 'first changed document', 'second changed document', ... ] bm25opt_index.update_documents( update_ids, updated_documents )
Notes:
This is an optimized variant of rank_bm25 where the key insight is that we can calculate almost everything at index creation time in __init__() , resulting a words * documents-score dict, e.g.
wsmap = { 'word1': [ word1_doc1_score, word1_doc2_score, ... ], 'word2': [ word2_doc1_score, word2_doc2_score, ... ], ... }
then the query function is just adding the score lists for each word in the question, e.g.
question = 'word1 word2' doc_scores = [ wsmap['word1'][0] + wsmap['word2'][0], wsmap['word1'][1] + wsmap['word2'][1], ... ]
Another important change is the un-tokenized inputs and registration of the tokenizer function, which is important to avoid situations where the corpus would be tokenized with a different function than the queries later. A simple tokenizer_default() function is provided as a default.