FeatureBase: Open-Source, Real-Time Database Built on Roaring Bitmaps
github.comI've seen bitmaps mentioned a number of times lately. I must admit it is not something I am all that familiar with. Can someone explain to me why bitmaps are more valuable than standard column oriented databases?
I havn't wrapped my head around how this helps speed up queries while data is being ingested.
They are useful for categorical variables. For example, is a record in the "Likes motorcycles" category? They are fast because (well, one reason) bitwise logical operations are very fast for CPUs to do.
Adtech is an example of a sector that benefits from this...they slice and dice datasets a lot to target ad campaigns and such. Being able to do that quickly is useful.
So are you saying that the data is stored in categories which allows for those types of lookups to run faster? Do you have specifics on how the design of a bitmap based database achieves this? How does it maintain these relationships? Just through 0 and 1's?
I guess it's easy for me to visualize both row and column based. Im struggling with the bitmaps concept.
Here's a good write up on some of what you are asking in their blog: https://www.featurebase.com/blog/bitmaps-making-real-time-an...
So is the thinking you run them alongside a traditional RDBMS as sort of a cache or view optimized for bitwise operations?
I'm super hyped about this, I've been working on this for the last couple-few years and I'm optimistic about the return to being primarily an open-source thing.
Why is the bigmap database faster than other distributed database? and what's the differences?
Instead of storing values, like "dog", "cat", or "mouse" it stores (in this example) three binary numbers:
000 - whatever needs to associate with animals, but has no associations currently
001 - whatever it is is associated with having a "mouse" included
111 - whatever it is is associated with having a "dog", a "cat" and a "mouse" included
In the past, high cardinality data sets weren't good for storing in binary form, or a binary index, but nowadays there are ways around this. So, that list of animals could be quite large.
The primary reason it's so much faster is that many CPUs nowadays can do 10s of lookups in a single instruction cycle. That makes them extremely fast.
any ideas on real life Use Cases?
I am planning to use it in a project to make the new Congressional API data more approachable for people.
Hopefully make it easy to find all the bills that your specific congress person was involved in for example.
From what I understand, machine learning models may use this in ETL pipelines as well as serving as part of the models themselves. There's an article on that here: https://medium.com/analytics-and-data/overview-of-the-differ...
FeatureBase could be the "feature store" in the middle of the batch prediction section's diagram, or simply be a drop-in replacement for the model's registry.
many! It was originally developed for marketing use cases- helping marketers understand up-to-date use her behavior and find interesting segments.
But really it's useful anytime you need low latency analytics on fresh data.