Prod is the wrong pattern for data pipelines

39 points by merinid 2 years ago · 18 comments

Reader

Yet another discovery that ephemeral environments are the more robust way to design things. Yes they are harder to setup and require some proper thought and engineering but they end up paying themselves off down the line many times over.

ryan_green 2 years ago

totally agree. and creating ephemeral environments for data pipelines is quite a bit more challenging than systems with a less complicated data state. nonetheless, this has paid off for us already many times over.
noman-land 2 years ago

What does this mean exactly? Containerized everything? Cloud IDEs?
- klysm 2 years ago
  
  However you do it, the ability to spin up a fresh env with one click or so
  - SOLAR_FIELDS 2 years ago
    
    Yep, basically “cattle, not pets” in code. There should only ever be “what runs and production” and “the best imitation of what runs in production” dynamically spun up. Anything else is going to hurt you in the long run
    
    ryan_green 2 years ago
    
    exactly!!!
  - water9 2 years ago
    
    So, like start virtual machine snapshot
    
    yencabulator 2 years ago
    
    Beware, recreating a VM snapshot can be very difficult, and its lineage may become an unreproduceable "pet".
    Something you are able to construct from scratch on demand, automatically, is preferable.
    
    klysm 2 years ago
    
    I don’t think so. I prefer it to be reproducible from scratch so you don’t have the state from production. That way you can figure out what’s in there and remove fear of change

NBJack 2 years ago

This doesn't scale well. It is a perfectly fine approach for smaller systems with a few dependencies, but you are going to have serious headaches whenever you (1) start to see more complex internal system dependencies, and/or (2) start taking on deeper integration with external systems like cloud infrastructure, other services, etc. Once you hit this inflection point, you either start getting very robust with your integration boundaries (and likely developing more complex 'stubs'), or go the dev/stg/prod route.

You start with a database? Great. But wait, you need bulk storage now, so you start sticking it in a cloud bucket (and ensure you use a separate namespace for it). But then Team 2 introduced a new service you now need to spin up in a separate container, so you pull their repo. Then there's a production issue that could have been solved by proper AB testing, so you decide to go with a third party solution that offers that. The party continues, and soon your simple one-click setup ends up so complicated you end up with a full time person just keeping it alive. Whoops! Someone got the cloud namespace wrong on their desktop instance, and production data got hosed. Etc.

saltcured 2 years ago

Yeah, there's a lot of hidden magic/assumptions in having a "writable snapshot of a specific version" of production data. For a complex system that has more than one stateful store, this is no small feat.
The dev/staging sandboxes are essentially the pragmatic hack to create these snapshots. Ugly sacrifices are made to construct the writable snapshot across disparate pieces. It becomes a headache when there is too much contention to use these sandboxes, or too much manual effort to reset them to a desired testing state. Also, if the sandbox copy-on-write mechanism differs too much, you end up changing the test environment so much that you are no longer emulating how it will behave in production. So the old-school approach is a replica of the full environment on redundant hardware matching the same characteristics as production.
But before I read the linked article, I was expecting a different anti-pattern to be discussed: where people forget that the dev/staging processes are for software testing, to prepare for when you deploy high-quality software to production. They are not for data preparation. Your deployment eventually needs to combine new software with the existing production data, and not depend on accumulated state of the sandbox data. I've seen people twist themselves into pretzels conflating software and data, and trying to somehow move data from the sandbox into production in a misguided "upgrade".
Software flows from developer, through the sandbox(es), to eventually be in production use by users. Data flows the opposite direction, from production users into snapshots loaded into sandboxes, and eventually into developer's hands with their experimental code. Ignoring of course situations where developers are not authorized to see real user data...
- ryan_green 2 years ago
  
  saltcured, find these comments super insightful!
  > Yeah, there's a lot of hidden magic/assumptions in having a "writable snapshot of a specific version" of production data.
  That's absolutely a huge assumption. This technology has been a game changer for us: https://lakefs.io/
  > It becomes a headache when there is too much contention to use these sandboxes, or too much manual effort to reset them to a desired testing state.
  This exactly the situation we were encountering.
ryan_green 2 years ago

NBJack, your point about the difficult of managing external dependencies is well taken. That said, our data pipeline uses cloud storage and multiple external services and the scenarios you're describing haven't materialized so far. We have found that we need to take extreme care in managing the logical state of the data pipeline (e.g. ensuring that we use explicit versions of external services). And we can certainly end up in trouble if external service provider violates their api contract. So I don't think this a replacement for a strong data testing regimen which is what would hopefully help us if this occurred. I also think you can encounter these same issues if you go the dev/stage/prod route. curious to get your thoughts.
- NBJack 2 years ago
  
  Believe me, I'm not claiming that the dev/stage/prod pattern is any kind of cure-all. It has its own problems, which are probably too numerous for a late-night post.
  From what you've described, you're doing the right thing for you and your team. Keep it simple as long as you possibly can. I can only advise you to just keep the goal of balancing the time needed to maintain your approach vs. the return you get from it.
  The key advantage of the dev/stage/prod approach is only at sufficient scale and proper discipline among teams, each maintaining their own version of their product at the dev and stage points. This has plenty of headaches, but you're at least getting a chance to exercise your work in something that will be as close to production as possible without actually being there. It tends to work 'best' when you only start holding other teams accountable at the stage point.
  Cloud dependencies are where I've seen thing get the weirdest and most volatile. There are all kinds of limitations that can crop up even if you try to maintain the highest level of separation and discipline.
  For example, did you know that AWS limits a single account to no more than 5 Elastic IP addresses, and that there's an upper-limit to how many Elastic Network Interfaces can be held in a region? [1] It sounds stupid, but I've actually seen these limits hit even after politely asking AWS to make them as large as possible; keeping developers empowered to deploy their own, compartmentalized version of the product became a real pain.
  [1] https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-...

erhaetherth 2 years ago

Sounds like a complicated way of saying every developer should have their own DB instead of a shared dev instance.

ryan_green 2 years ago

sort of. the main issues we've found with each developer having their own DB on a complex data pipeline are 1) if that DB contains petabytes of data, creating it one for each developer is non-trivial from a time and cost perspective 2) the developer needs to develop and test multiple changes they want to isolate (think git branches) 3) the data state the developer is operating on gradually becomes stale and so results deviate from prod
any way, hope that perspective is helpful

Raminj95 2 years ago

Is there some more examples or blog posts that talk about this? I find the idea interesting and possibly very applicable in my work but just from this post alone I don’t feel like I have grasped it well enough to implement this.

ryan_green 2 years ago

apologies for not getting into more detail--wanted to start by covering things at a high level. There are a few key concepts that might be helpful. * data state - this is contents of both your data and metadata at a given point in time. if your data doesn't fit into a single database, this can be difficult to manage. We use this technology to help us: https://lakefs.io/ * logical state - this is everything you use in processing the data in your pipeline (i.e. code, config, info for connecting external services, etc.). This can all reside in git
We found the key was associating our logical state (git branch) with our logical state (lakefs branch). We make this association during our branch deployment process.
Let me know if this helps at all. I was planning to write a follow up post about what we learned about managing the logical state of a data pipeline. If you have suggestions for a different topic to dive into, I'd love to hear about it.

Settings

Dev / Stage / Prod is the wrong pattern for data pipelines

Keyboard Shortcuts