AI product development is being held back by data engineering
arch.devHey HN, Arch CEO here! Our team has been working at the intersection of data engineering and software engineering for a few years now with Meltano (https://meltano.com), and this year, the rise in Generative AI has made it clear that the bottleneck in unlocking the potential value of data has shifted from data integration on data teams to data engineering on software teams, so we’ve decided to do something about it.
This post is about the high-level motivations and long-term vision for https://arch.dev; please let me know if any of it resonates or if you think I’m totally off the mark :) We’ve also got a post that goes into more detail on the Arch product itself: https://www.arch.dev/blog/announcing-arch-the-data-backend-f...
I was saying this like 5 years ago.
What have you seen in the past five years that's made progress in this area? There's a lot of the "Modern Data Stack" tools that have certainly helped, but those generally feel separate from ML/AI workloads (usually).
I'm not aware of a recent "killer app" in that area.
Practically, people who succeed at this get it working end-to-end with whatever compromises it entailed, they might tell anyone that they never would have done it that way had they known how it would turn out, yet, they have a bird in the hand.
There's a problem of scale mismatch.
My RSS reader needs the toolbox of
https://scikit-learn.org/stable/model_selection.html#model-s...
to make reliable scripts that can rebuild a model when the data changes. The version built into scikit-learn has the features I need, other ones don't. scikit-learn is great for problems of a certain size that take, say, 10 minutes to run.
That scale turns out to be appropriate for very fast prototyping of systems need about a human-week of judgements to light up, that can be updated daily, etc.
Someone is going to insist on using slower models that take two hours to train (wrapped up in a model selection process), where you worry the machine might crash, and have to take a "distributed systems" approach that adds a terrible overhead for jobs that don't need it. If I liked the model selection story I could probably live with that but so far I don't.