Empowering data scientists with a feature store
yiksanchan.comInteresting, thanks for sharing.
How do you handle historical backfill for new features? As in, some feature that can be updated in streaming fashion but whose initial value depends on data from the last X years, e.g., total # of courses completed since sign-up.
Also, who is responsible for keeping the Flink jobs running: the data scientists, or do you have a separate streaming platform team?
Thanks for your kind words!
> How do you handle historical backfill for new features?
Currently, our feature store doesn't come with inbuilt feature backfilling. In order to do that, some manual work needs to be done. We're working on a brand new version of feature store that hopefully addresses this need.
> Who is responsible for keeping the Flink jobs running: the data scientists, or do you have a separate streaming platform team?
We have a separate data infra team who is responsible for managing the YARN cluster for us.
This thing reads like it was written a few years ago, to my mind (source: I've been working in ML most of a decade now).
Disintermediation of data pipeline creation is definitely nothing new at this point and the technologies aren't that novel at this point either. I'd be surprised that this is on the front page, but it takes time for the lessons in this article to be learnt by a large enough amount of people that it becomes humdrum.
Above all, it reminds me of a consultant friend telling me he had two clients who built feature stores - one with an open-ended goal of enabling people and one because they had some specific things they wanted to achieve. The outcomes they got were as dissimilar as their motives!
Thanks for the comments!
> This thing reads like it was written a few years ago.
Yah the technology here is nothing novel, Hive, Kafka, Flink, Redis are around for years. What I find missing in the internet is that people who have been doing this for years are not writing about this. Uber has done a relatively good job on publishing how they build Michelangelo, but still, not enough details for outsiders to replicate.
> it takes time for the lessons in this article to be learnt by a large enough amount of people that it becomes humdrum.
Maybe :)
> two clients who built feature stores - one with an open-ended goal of enabling people and one because they had some specific things they wanted to achieve.
Could you add more color to this part, what are their goals and what do they end up achieving? I didn't fully get it.
> What I find missing in the internet is that people who have been doing this for years are not writing about this.
In my last job I implemented a feature store from scratch with ca. 500 hand crafted and ca 2.500 with code generator automatically generated features. It didn't only serve the current value of the features, but the data scientists could populate an 'init' table manually with (customer_id, reference_date, target_value) tuples, and the pipeline re-calculated the historic feature values for the given customer and reference_date. So if the data scientists came up with a new fature definition, after implementation (5 mins - 2 hours per feature), he - and all other data scientists - immediatelly got access to the features's history. We had so many features, that I had to implement an automatic feature-prunning, otherwise the users got lost. We could train, test, validate and deploy models within 24 hours (model fitting run over-night). When I left the company, we had ca 40 models in production, managed by 1 person (by me) in part-time (3-4 hours a week).
This was in an off-line business, so we didn't had to deal with latency by feature serving and didn't had to be able to change a feature's value during the day, so everything could run batch based over night.
Why I didn't write about it? Because it was implemented in PL/SQL running on Oracle ExaData and in SAS. No one cares about feature stores implemented with tech like that. People care about models trained in python, ported to Scala by Java devs, running in docker on k8s, features coming from HiveQL, sqoop, oozie or Spark and stored on cassandra, MySQL or Elasticsearch. But do they have a feature store with built in time-travel functionality?
Hi NumberCruncher, thank you for your reply!
> do they have a feature store with built in time-travel functionality?
My feature store hasn't supported time-travel yet. But many SaaS implementations do, including Tecton, Hopswork, Splice Machine, etc. Open source feature stores haven't implemented this critical feature AFAIK.
> Why I didn't write about it? Because it was implemented in PL/SQL running on Oracle ExaData and in SAS. No one cares about feature stores implemented with tech like that.
Haha, I do. Actually I am thinking about implementing feature stores using some "old-school" DB technology. By the way, just curious since I've never used PL/SQL: Are you able to implement the time-travel functionality using pure SQL?
> My feature store hasn't supported time-travel yet. But many SaaS implementations do, including Tecton, Hopswork, Splice Machine, et
Good to know, I wasn't aware of them. At the end of the day the make or buy decision is influenced by the cost factor. In our case the development costed ca. 50k€ (obviously no SV salary), we had a user base of 10 data-scientists (incl. myself) and someone in house who knows the ins and outs of the system and can implement new stuff. I have to admit I was sharing my office with our DB architect, it would be hard to put a price tag on his support. I don't know how the pricing of the SaaS solutions compares to this.
Almost all features (maybe with the exception of gender) were aggregations over time like "share of emails the customer opened in the last X days", or "did the customer use service Y in the past: y/n", etc. If you use a pseudo SQL code like:
CREATE OR REPLACE target_table AS
SELECT
i.customer_id
... some aggregation processing data from t ...
FROM init_table i
LEFT JOIN transactional_data t
WHERE i.... = t....
AND t.time_stamp BETWEEN i.reference_date - (X days) AND i.reference_date
GROUP BY i.customer_id
than depending on what you put into the init_table, you can obtain:
- in the prod environment: actual feature value(s) for one ore more customers (reference_date = current_date)
- in the model-dev environment: historic feature value(s) for one or more customers for one or more past dates (reference_date = [past_date1, past_date2, ...])
In different parts of the system (prod, model-dev) you can use the same SQL code, you only have to replace the init_table and the target_table. Implementing a new feature system-wide basically means putting a new aggregation into the SQL code, only once.
I used PL/SQL as glue code, like someone would use python. I.e. executing SQL code in a controlled manner, creating dynamic SQL the right way (out of the box Oracle solution) and the wrong way too (sticking SQL fragments as strings together), writing code-generators, managing indexes/partitions/views/materialized views/etc. for performance tuning, implementing centralized monitoring for bug and performance tracking (and to know how the projects of my colleges are going forward), and last but not least, for packaging. At the end of the day you have a library/modul (PL/SQL package) exposing functions (PL/SQL procedures). The "library" is partially type-checked on compilation on the DB, and than you grant the necessary execute rights to the DB users (distribution to the data-scientists) and they work with the historic results on their own VM (DB schema). Current code is searchable in the DB, all the users use the same (actual) version, code history is checked in into bitbucket. If a user wants to "fork" and create new features, he/she can just write plain SQL against the prod DB on his/her own and the new code will be migrated through pair programming into the prod environment.
Sorry for the formatting.
Thanks for sharing!
This SQL-intensive actually reminds me of the feature store solution by Splice Machine (https://splicemachine.com/press-releases/splice-machine-laun...). They implement a HTAP database and use that in a similar as you use PL/SQL. I had a writeup on this as well (https://yiksanchan.com/posts/splice-machine-feature-store)
Hopsworks supports time-travel since version 2.4 and it is open-source.
Sorry I miss that
I'm struggling to understand what the feature store is.
Is it another name for an OLAP or BI cube? Ie. a huge precomputed group by query with rollups.
The only new thing I see is that it combines both historical and recent data. Kinda like an olap cube with lambda architecture.
It may be worth reading this article to get an idea of what feature stores are and how they are used https://feast.dev/blog/what-is-a-feature-store/
A feature store is just a set of features stored somewhere or with the instructions how to compute them. The benefit is that you can easily add new computed features to the store by telling it how it is computed, and then that new feature just works like any other feature. This lets you create a lot of reusable features and signals just by adding the code without touching the data. So you can do code review, add documentation to them and spread the knowledge hose these signals works throughout the organization.
I'm not sure how this relates to OLAP cubes since I am not aware of that term.
SQL also enables stored computation of columns through views. How does a feature store differ from a view? Or is it just a different name for basically the same thing?
I suggest checking out the link from @willempienaar. He is indeed the domain expert since he is building the open source feature store https://feast.dev/ :)
If I understand it correctly, a feature store is an API that serves data obtained through some sort of (materialized) view. The feature logic can be partially implemented outside the database (which is either stored in memory or on disk).
There are some nice insights and engineering ideas in here. Thanks for writing this and sharing!
Thank you for your kind words!