Organizing and scaling an effective data team
robdearborn.comGood overall post but there is some conflation with machine learning. The issue with conflating the two is that, in practice, many engineers want to do ML work but few want to perform analysis work.
One additional factor for success:
Hire with the right expectations. If most of the value from the role comes from data analysis, communicate that. Often ML/data roles are hired by dangling the carrot of developing and deploying complex machine learning products. ML is sexy, a lot of managers want to manage ML projects and a lot of engineers want to work on them. In reality, most teams need someone that is good at SQL and can code a simple metric/heuristic. It’s also important to communicate to the team that shipping simple solutions and simple analysis is a great outcome. Your analysis showed that you can achieve 90% of the initial goal with a simple if/else on one metric? Great! Deploying and maintaining ML models is hard and should be a last resort. I’m saying that as someone whose entire career depends on complex machine learning models.
In my experience, ironically, most of the model gains come from understanding and fixing data pipelines and datasets, tokenizers, vocabs. It’s surprising how a team can spend time on a complex model, but nobody bothered to runs stats and see that 20% of samples are garbage or that top tokens are nonsense. So in this sense a lot of “ML” work is data analytics or code debugging. I usually say that we should work on products, and do whatever work is required to advance product at the moment.
Yeah, I absolutely agree with a common tactic being dangling an ML carrot at recruiting but the work not even being ML related.
I've historically seen the ML team be separate from Data Science/Analytics, I wonder if that helps with this
I think it's reasonable as an early-ish stage startup to say to a candidate that ~"eventually, with scale, there will be cool and impactful ML opportunities here" as long as you're realistic and upfront about the facts that ~"right now most of the impact is in simple but foundational analyses" and ~"there'll be some amount of fires to put out and rote work to automate".
Love the post and I don’t think enough businesses today see data as core to their ability to execute well.
First question, how do you see definitions getting managed between the client team and data team?
eg “this is the canonical definition of churn”
Second question, where do you think custom and 3rd party infrastructure management sits (especially in the case where the data team doesn’t sit under the engineering org)?
To your first question, an overly general answer is "client and data team collaborate to find the canonical definition, document it in one place comfortable to both teams, and both teams know how to update or iterate that definition when needed". Probably looks like analytics engineers trying a handful of SQL until the other team agrees with the results, focusing on the tricky/edge cases and not drowning the client team in a big CSV. Then formalize as a dbt model.
Second question, I think platform teams make sense, and at a large enough org this could be multiple layers with "data platform" sitting on top of "cloud platform". It's much less clear to me how to allocate those responsibilities when all the folks involved fit in a team or two instead of 3+. It's also unclear to me if something like Kafka is more "cloud platform" or "data platform".
Remember to ask, "Are you an effective team?" [1]