We cut our CI pipeline execution time in half

69 points by alrocar 3 years ago · 99 comments

Reader

Maybe I am just an old fuddy duddy conservative, but this struck me from the post:

“In the grand scheme of things, one week isn’t that long. But to us, it felt like forever. We are constantly iterating and release multiple changes every day”.

I assume they mean multiple production releases? Is this because the product lacks maturity or stability, or is it just your culture?

I am asking because I am trying to imagine the impact of this on existing customers. It sounds like an awful lot of churn.

This obviously happens a lot in the “you are the product” space like Facebook, Google, etc. But this looks to be a data analytics product with paid tiers. Curious what tooling and processes you have to support this, and how you keep customers happy with this model.

mariosisters 3 years ago

I think it’s that you are an old fuddy duddy :P
Actually, if you work with SMBs/enterprises, I agree with you on customer facing changes. In my past life we would ship very frequently (often more than once a day) but always had to feature flag changes that large clients might see or be affected by. Even something as simple as tweaking the layout of a core flow could cause support headaches and angry customers — customers worth 10s of thousands of dollars per month. Is it worth losing a customer to CD a new button placement?
- chrisandchris 3 years ago
  
  I can only image how clean code looks & works that is full of feature flags. Glad that I don't need to do that to often :)
  - mariosisters 3 years ago
    
    The right approach is to immediately remove the flags after rollout…
    The actual approach is to maintain a million fucking feature flags, ensuring that almost all possible combinations are essentially untested… better hope you did a good job separating concerns!
  - tempest_ 3 years ago
    
    They keep adding them until they need an internal library to manage their collection of feature flags.
    Then after a while they graduate to feature flags as a service (of which there are a bunch of cloud services trying to have a go at)
scott_w 3 years ago

This is considered the norm for high performing product teams in the modern day.
We keep customers happy because we push changes live incrementally, reduce our chances of major outages and improve our response time when they do occur.
- hedora 3 years ago
  
  As a customer, if I find a competitor that does not do this, then I will switch to it.
  For example, I cancelled my netflix subscription because they are unable to reliably operate microservices, and the UI was always in some semi-broken state. As a software engineer, this stressed me out during my relaxing TV time.
  Even if continuous delivery is somehow reliably delivered, if the changes are customer visible, then they break my muscle memory, and increase my cognitive load -- I have to re-learn the damned UI every fucking time I log in. If the changes are not customer visible, then what business value to they deliver?
  - scott_w 3 years ago
    
    And yet the numbers show that companies that continuously delivery updates massively outperform those that don’t. You claim to make decisions based on a company’s engineering practices but I can guarantee that you have no idea about the engineering practices of any companies that you do support actually are.
    
    deely3 3 years ago
    
    Wait, can I see these numbers grouped by industries if possible? Please?
    
    igetspam 3 years ago
    
    Yes. Read Accelerate and read the State of DevOps reports from 2017 through 2022. The reports have the data and explain their methodologies for evaluating said data. It's all there for you to consume.
    
    deely3 3 years ago
    
    Oh, so these reports are just surveys.. I was hoping for some hard analysis/statistic..
    
    scott_w 3 years ago
    
    Is there anything that will satisfy you or will anything that anyone presents be dismissed? I could spend days hunting out the evidence and you'd start question whether it's peer reviewed, or if the peer review was stringent enough, etc.
    Meanwhile the rest of the world is moving forward without you.
    
    scott_w 3 years ago
    
    Read any book on the subject, they’ve done the research for you. The Phoenix Project, The State of DevOps Report.
    Spend some time in companies that move slow vs fast and you’ll see the difference in their success first hand. You’ll see the metrics on their incidents and severity and customer satisfaction with them.
    Oh, and the fact two companies mentioned (Google and Facebook) are two of the most successful companies on earth.
    
    Scubabear68 3 years ago
    
    See my comment elsewhere taking “people are the product” out of this conversation.
    Facebook and Google and like companies do not care if they piss users off. They churn features constantly, break people’s flows on a regular basis, A-B test features so different people get different experiences.
    They get away with this because the “users” aren’t users, they and their data are the product. You pay nothing to use their services, and you get what you pay for.
    
    scott_w 3 years ago
    
    Good thing my argument didn’t hinge on them, then. Mind picking up on anything else?
    
    Scubabear68 3 years ago
    
    You used them as exemplars.
    Can you point out specifically where someone says releasing up to 20x a day into production for a paid platform is a good thing?
    I am not talking about the capability. I grant being able to release quickly and often is desirable. I question the wisdom of actually releasing multiple times a day into production as a best practice. In my mind it is highly inefficient and offers too much churn for users (eg exactly my experience with Facebook, a chaotic broken product that changes hourly and makes people feel like the product t is gas lighting them).
    
    scott_w 3 years ago
    
    Read above: Accelerate and The State of DevOps are two reads.
    From experience, across teams, we deploy probably 20x a day and we're actively trying to deploy more often. Our customer NPS scores are pretty high and our change failure rate is low. We're also a very successful B2B SaaS.
    
    Scubabear68 3 years ago
    
    Assuming an 8 hour day, that’s a deployment every 24 minutes, and that’s not good enough?
    Why in Earth wouldn’t you coordinate releases? Seems like utter chaos.
    
    igetspam 3 years ago
    
    Have you taken the time to read any of the reference material? Have you looked at any of the actual data? Are you aware of how CD with trunk based development works?
    You're making assertions that imply you haven't considered anything but your own bias. Your question only works if you have no understanding of how these things actual function
    The goal is to reduce the amount of time it takes for a change to land in production. Your eight hour work day implies you have developers only in a single location too, instead of the distributed work force most of us have now.
    You seem to be (blissfully?) unaware of how high performing organizations operate.
    
    scott_w 3 years ago
    
    Where is your data to backup your assertion that it’s “chaos?” What research can you cite in opposition to the information I’ve provided?
    Where are the customer satisfaction surveys that say deploying to production multiple times a day is making their experience worse? Where are the articles showing deploying more reduces revenue? Where is the link between more deployments and reduced reliability?
    We’ve provided our evidence, now put up our shut up.
    
    deely3 3 years ago
    
    Look, you not convincing at all.
    1. Should I skip books that do not confirm your point of view?
    2. What if I already spent time in these companies and found no big difference?
    3. So, why companies that use different approach still exists and even (mamma-mia) profitable?
recfab 3 years ago

This just sounds like Continuous Delivery. We never achieved it in my last job, so I can't speak from experience, but my understanding is that typically "deploy" is separated from "release" using feature flags of some kind.
- Scubabear68 3 years ago
  
  The article starts with “Last year, we made the difficult decision to stop deploying any changes to production for one week” and goes on to talk about releases.
  In that context I assume this means they make multiple production releases per day (which makes me shudder). I am curious how they do this while maintaining high quality and not driving customers insane.
  - sdairs 3 years ago
    
    Hey, Al from Tinybird here (co-author of the post). We've made up to 20 production releases per day some days. It's transparent to our users, they aren't even aware the upgrade is happening, there's no upgrade button to hit, there's no downtime. We release often because we release small and fast. It's not like those 20 releases are always fundamentally changing the product. We would rather fix a minor bug or two and get that out to our customers ASAP, than hold on to it for a few months and drop a huge change. In a vast majority of cases, a user won't even consciously notice something changed.
    Doing this kind of fast iteration has its risks, but it has its benefits too. We de-risk it, in part, by having extensive CI, which is why it was so important to us that the CI is fast & reliable.
    Delivering larger, less-frequent updates has its own risks. You're not practising your release process as frequently, so it's a much bigger event. You're pushing many, many more changes in one go, so there's a lot more surface area for something to go wrong, and rolling it back is a much bigger job. And dropping many/bigger changes to the user experience is much more noticeable.
    Again, this isn't the right process for everyone, but it works for us and its how we've managed to build a product that delivers value to our users.
    
    chrisandchris 3 years ago
    
    It's interesting how you differ between "deploy ASAP, within half a day" and "keep the fix a few months back".
    Like, is there nothing in between? Like once a week, once every other week?
    If you would have to estimate, wouldn't be there less bugs if you would deploy less fast (and use this time for validation)?
    
    sdairs 3 years ago
    
    Sure :D
    I've worked on products across the spectrum, Enterprise software that releases twice a year, smaller stuff that ships once a month, and now a SaaS that releases many times a day. I don't have the data to compare, but I genuinely don't believe the rate of bugs was materially different between any of them.
    We make faster changes, but the changes are much smaller, and so the surface area to test is smaller. The more changes you make, the more time you need to validate. If you make a weeks worth of changes, you're going to need an appropriately longer validation cycle than if you make 1 change. It scales, and in my experience, many products with slower release cycles aren't appropriately scaling up the validation time to match.
    That said, this doesn't necessarily mean we're writing code and 30 minutes later its in production...there's still an iterative dev cycle with lots of validation happening...but if something is ready to go, ship it!
    
    Scubabear68 3 years ago
    
    Thanks for the response Al!
    Very interesting. I agree different strokes for different folks, you guys seem to be on the extreme end of CI/CD.
    Have you done any sort of analysis you could share on what it costs to release up to 20 times per day?
    
    sdairs 3 years ago
    
    We do track the real $ cost of time & materials, but tbh I don't think it's anything too exciting. I'll see what we can share!
  - fragmede 3 years ago
    
    > multiple production releases per day
    Most (but not all) SaaS businesses are expected to these days, so I'm curious what your business/industry is that not only that you don't, but that it gives you the shudders.
    
    Scubabear68 3 years ago
    
    I consult across a few different industries, but it includes SaaS offerings. Many places could do multiple deploys to prod a day, but choose weekly releases or other longer cadences. This is to allow for documentation, client notifications, etc. It also is more efficient, constantly releasing requires a lot of resources, as this blog implies they are churning dozens of K8 pods several times a day.
    I would think constant releasing would also make debugging prod issues pure hell.
- drewcoo 3 years ago
  
  It sounds like continuous deployment, not continuous delivery.
  Continuous deployment deploys code to production frequently, as soon as it's ready.
  Continuous delivery has some ready-to deliver branch that's constantly being updated as above, but they're not deployed to production until someone (Product Owner?) or something (Yay - end of sprint!) triggers it.
  Different people may use the word release for at least this many things: 1) a deployment, 2) an unveiling via feature flags, 3) a public announcement despite the code already having been live.
alrocarOP 3 years ago

Yep, it's part of our culture, we do many releases per day to constantly iterate things. Also as in other projects there are maintenance and bug fixing we want to bring to production as soon as possible.
Our context is the one of a startup that is constantly validating things, also in our context a release does not necessarily mean releasing to the users, sometimes stuff is behind feature flags or for beta testing.
drewcoo 3 years ago

Hey there! I'm an old fuddy duddy, too.
Continuous deployment has been around long enough that even IBM (remember never getting fired for buying IBM?) talks about it.
https://www.ibm.com/topics/continuous-deployment
"Dark deploys" and "feature flags" are often used to keep customers safe from incomplete features while still giving all of the advantages of CD plus allowing testing in production.
I'd never heard of Flagship, but this is a nice writeup on that (kudos, Flagship.io):
https://www.flagship.io/glossary/dark-launch/
tetha 3 years ago

Mh, I'm interacting with teams with wildly different release strategies and stability requirements at work.
One of the more fundamental things actually pushing towards faster releases is what I call the relativistic deployment speed. We have products that will need at least 2 months to get a remotely deployable version ready. The average fast hotfix usually takes more like 4 months until an installation on a prod system actually can start. Our fastest products can go from code to prod in like 15 minutes with the automated tests being the bottleneck.
This in turn shapes choices for the product managers, but also for security. If something like Log4shell hit these slow products, I'd have to plan to be vulnerable for two months at least, and usually more like 4 - 8 months depending on the customers. I have no choice, because that's their light speed of deployment. No code goes to prod faster than two months latency. That, quite frankly, fucking sucks.
Other products were much better in that situation. We were lucky to have the right devs around, but we went from the decision to emergency log4shell at an utmost risky speed to the first log4shell patches in prod of many within 30 minutes.
However, that's not the normal speed, and that's when you get into the second decision area. Given a lightspeed of deployment, how fast do you want to go?
Some of our possibly faster moving products are B2B products, with a lot of internal training for support and consulting going into a release, and also training at customers happening for larger customers. This means, product chooses to only release bigger changes and heavily customer-visible changes every 6 weeks. They could do this a lot faster, but they choose to slow down because it fits their customers well. And for example, december is usually frozen entirely because customers want to.
But then there is the third decision area. What happens if there is an entirely customer invisible change, such as an optimization in database handling, some internal metric generation for an optimization, or an internal change to prepare a new feature for the next scheduled rollout? And we have the tested, vetted and working option to just push that into prod without downtime, with also gives us opportunity to build experience with, and confidence into our no-downtime deployment system? I don't see a reason why I wouldn't exercise this daily at least once.
igetspam 3 years ago

Read the State of DevOps reports over the years and you'll see why this is the direction we're all heading now. It turns out all that safety we thought we were building by making complex commit flows to multiple branches and environments was not only more complex than it needed to be but has also slowed us down and not made things better. Truly based development is back again and this time with data. Push early, push often, push small changes and iterate quickly. It's not just easier but it also seems to increase quality. (There are a lot of reasons why this turns out to be true. Read Accelerate. I won't do a better job explaining in a comment.)
lelandfe 3 years ago

That’s continuous delivery, right? You make great tests and you should feel comfortable releasing after review.
- mariosisters 3 years ago
  
  In my experience and to parent’s point, it’s not about your comfort it’s about documenting, notifying clients, updating support, etc. All the non-code parts of selling software. As you suggest if the code has been reviewed, tested and merged, it “should” be ready to go. Right?

dijit 3 years ago

I have a somewhat related question.

I'm using gitlab-ci with it's docker executor, and overall I'm very happy with it.

I use it on some rather beefy machines, but most of the CI time is not spent compiling, it is spent instead on setting up the environment.

Are there any tips/tricks to speed up this startup time? I know stuff like ensuring that artifacts are not passed in if not needed can help a lot, but it seems that most of the execution time is simply spent waiting for docker to spin up a container.

danpalmer 3 years ago

The short answer is "do as little as possible". What this means in practice is breaking down every step of CI, figuring out the dependencies for that step, and then ordering the graph of dependencies such that you start as much as possible as early as possible. This process also usually shows you where things are slow and what the critical path is.
Unfortunately, doing this in most CI services is actually quite difficult. It usually means a complex graph of execution, complex cache usage, and being careful to not re-generate artifacts you don't need to.
In my experience, building this, at a level of reliability necessary for a team of more than a few devs, is hard. Jenkins can do it reliably, but doing it fast is hard because the caching primitives are poor. Circle and Gitlab can do it quickly, but the execution dependency primitives aren't great and the caches can be unreliable. Circle also has _terrible_ network speeds, so doing too much caching slows down builds. GitHub Actions is pretty good for all of this, but it's still a ton of work.
The best answer is to use a build system or CI system that is modelled in a better way. Things like Bazel essentially manage this graph for you in a very smart way, but they only really work when you have a CI system designed to run Bazel jobs, and there aren't many of these that I've seen. It's a huge paradigm shift, and requires quite a lot of dev work to make happen.
- packetlost 3 years ago
  
  It's so surprising to me that this is such a poorly supported paradigm in commodity CI systems. Caching artifacts and identifying slow stages is like... super important for scaling CI for large enough orgs. We need better tools!
  - duped 3 years ago
    
    The more time you spend debugging this and the worse job you do at it, the more money they make.
    
    mgkimsal 3 years ago
    
    While it sounds cynical... it doesn't strike me as 'wrong' entirely. It's a non-trivial problem, but until some service provides great tools to handle this, and makes the experience 10x better (to encourage more use/experimentation/etc), everyone will keep offering the same experience all around. If a service could automatically cut build times by, say, 70%, that's a lot of revenue they may lose from charging for the build time. They could raise the price, or hope that enough new people get onboard to make up the loss... ?
    
    sleepychu 3 years ago
    
    That feels counterintuitive to me. I would probably use even more CI minutes if they had higher value.
  - nerdponx 3 years ago
    
    Maybe it's because CI service providers don't want to be responsible for a lot cache storage.
    Given how lacking this feature is, maybe a CI vendor could offer it as a premium paid feature.
    
    danpalmer 3 years ago
    
    CircleCI sort of do! They have something called Docker Layer Caching, which basically puts all the Docker layers from your previous build on the execution machine.
    The problem is that it's a) very slow to download those layers from their cache storage, and b) very expensive. It works out to costing ~20 minutes of build time.
- brightball 3 years ago
  
  I have found Gitlab and runners the best option here.
  - danpalmer 3 years ago
    
    The problem I had with GitLab was that the mechanisms for controlling dependencies between stages were fairly basic. They only added them in ~2020 I think, and they weren't well documented.
    Additionally, there's no cache guarantees between jobs within one execution. This means that you can't reliably cache an artifact in one job, and then share it with multiple downstream jobs. It mostly works, but it's hard to debug when it doesn't, especially if the cache artifact isn't versioned.
    GitLab is "fine", and has some nice usability features for basic pipelines, but it's definitely not doing anything better than the other major providers with respect to these problems.
    
    brightball 3 years ago
    
    Dependency controls have improved quite a bit. They went through a couple of variations of this and the current solution is nice.
    I haven't ever run into an issue with artifact hand off to this point though. Maybe it's one of the more rare concerns, but it's not something I've experienced (fortunately). I imagine it would be a concern to debug though.
david_allison 3 years ago

Set up the environment on the runner, not in the job: https://docs.gitlab.com/ee/ci/runners/configure_runners.html...
At the very least, see if you can keep heavy dependencies on the local network rather than depending on the internet.
recfab 3 years ago

I'm not sure how to speed up the spin-up-a-container time (at least not without more details), but I have two suggestions that may help mitigate it. Based on your wording ("waiting for docker to spin up a container"), the second one may not be relevant.
## 1. Do more in the job's script
If you have multiple jobs that use (or could use) the same image, perhaps those jobs can be combined. It's definitely a tradeoff, and it depends on what you want from your pipeline. For example, normally you may have separate `build` and `test` jobs, but if they take, say (30s init + 5s work) + (30s init + 10s work), then combining them into a single job taking (30s init + 15 s work) _might_ be an acceptable trade-off. (These numbers are small enough that it probably isn't, but you get the idea.)
## 2. Pre-build the job's image
If your job's script uses an off-the-shelf image, and has a lot of setup, consider building an image that already has that done, and using that as your job's image instead. For example, you might be using a `node` image, but your build requires pulling translations from a file in an S3 bucket, and so you need to install the AWS CLI to grab the translation file. Rather than including the installation of the AWS CLI in the script, build it into the image ahead of time.
- dijit 3 years ago
  
  > If you have multiple jobs that use (or could use) the same image, perhaps those jobs can be combined. It's definitely a tradeoff, and it depends on what you want from your pipeline. For example, normally you may have separate `build` and `test` jobs, but if they take, say (30s init + 5s work) + (30s init + 10s work), then combining them into a single job taking (30s init + 15 s work) _might_ be an acceptable trade-off. (These numbers are small enough that it probably isn't, but you get the idea.)
  This is a good idea and something I will seriously consider
  I'm already doing #2, but I'm glad to see others come to the same conclusion as me. :D
hedora 3 years ago

Make sure your docker build is being cached properly, and break infrequently running stuff into their own steps, then move them to the top of the docker file.
Crucially: Make sure that the large layers say they are "cached" when you rebuild the container. Docker goes out of its way to make this difficult in CI environments. The fact that it works on your laptop doesn't mean that it will be able to cache the big layers in CI.
Once you've done that, make sure that the CI machines are actually pulling the big layers from their local docker cache.
30-90 seconds to pull docker images for each run of a golang project's CI environment is too high. You might look into using "go mod vendor" to download the dependencies early in the docker build, then using a symlink and "--mod=vendor" to tell the tests to use an out-of-tree vendor directory. (I haven't tested this; presumably go will follow symlinks...)
denzil 3 years ago

My usual strategy is to ensure that the lengthy parts are executed only once. So for example one of the lengthy parts is environment setup for me too. So what I did is to put as much as possible on the docker image I build and then I start tests from image mostly ready to run. Of course something similar can be done during runtime. If starting the software you test takes long time, you could set it up only once and run multiple tests without tearing down the setup. Of course this has disadvantage of having possibly tainted environment and there is risk of making the tests depend on previous state. On the other hand this could also help discover problems that are hidden by always running tests on clean slate, so it's a tradeoff. And I have to note that I mostly do integration testing, so the long parts are probably in different places than for unit testing.
charrondev 3 years ago

I had similar problems with CircleCI and it’s docker executor. We recently switched to GitHub actions and the following led to huge improvements:
- much faster network speeds. - We no longer run on the docker executor. Instead we run on Ubuntu. These boot in a second or 2 pretty consistently. - the bulk of our test suite was able to be pulled out of docker entirely (a lot of jest, and PHPUnit tests). - we have a bigger suite of E2E PHPUnit tests that we spin up a whole docker compose stack for. These are slower but still manageable.
Parallelism is key in all of this too. Our backend test suite has a full execution time of something like 250 minutes, but we just split it over a bunch of small workers and the whole thing completes in about 8 minutes.
hinkley 3 years ago

Pulling snapshots helps, particularly with slowdowns over time. Pulling deps is a problem that deserves its own initiatives.
For me the controlling factor with build time and to a lesser extent production performance is to divorce visibility from vigilance. You can’t watch things 24/7 waiting to pounce on any little size or time regressions. You need to be able to audit periodically and narrow the problem to a commit or at least an hour in a day when the problem happened. Otherwise nobody will be bothered to look and it’s just a tragedy of the commons.
Graphs work well. Build time, test count, slow test count, artifact sizes, and so on.
- growse 3 years ago
  
  > Pulling snapshots helps, particularly with slowdowns over time. Pulling deps is a problem that deserves its own initiatives.
  I just had some success running android builds on a self-hosted github runner. One of the big setting up stages was having sdkamanger pull down large dependencies (SDK, emulator images etc.) on startup.
  Forcing sdkmanager into http_only mode and pointing it at a properly-configured squid took a large percentage off the build time.
  Similar story for the gradle build, where running a remote gradle cache node locally to the job means gradle steps get automatically cached without any magic CI pipeline steps.
  - hinkley 3 years ago
    
    dep caches are great until they aren't. We had to turn them off because we had some weird thing going on with 2 artifactory instances due to M&A issues and we were getting weird behaviors from it.
    Being able to pull deltas sure is fast, but it also violates some of the principles of CI. Artifactory or similar tools can split the difference. As long as nobody is doing something dumb that is.
colechristensen 3 years ago

How long are we talking? Are the containers getting pulled from somewhere across the internet and it’s a network bottleneck?
- DiggyJohnson 3 years ago
  
  This is what I’m working on next week. The majority of time is spent building the first n numbers of our Dockerfiles (which aren’t cached in our test/deploy pipeline).
  I’ll be baking some images with dependencies included, so the only stuff in the updated Dockerfile will be pulling the pre baked images from our registry and commands to build and run our app code.
  - mikepurvis 3 years ago
    
    We do the pre-baked dependency images too, and it's definitely workable, but I feel like it's a lot of overhead maintaining those— you have to build and distribute and lifecycle them, and it's extra jobs to monitor. Plus you now have an implicit dependency between jobs that adds complication to black-start scenarios. I wish tools like GitLab CI had more automated workflows for being able to automatically manage those intermediate containers, eg:
    - Here's a setup stage, the resulting state of which is to be saved as a container image and used as the starting point for any follow-on stages that declare `image: <project/job/whatever>`
    - Various circumstances should trigger me to be rebuilt: weekly on Saturday night, whenever repo X has a new tag created, whenever I'm manually run, whenever a special parallel "check" stage takes more than X minutes to complete, etc.
    Ultimately, I think the necessity for all this kind of thing really just exposes how weak the layered container image model is— something like Nixery that can delivery ad-hoc environments in a truly composable way is ultimately a much better fit for this type of use-case, but it has its own issues with maturity.
- dijit 3 years ago
  
  it's a 30-90s 'setup' for a compile that usually lasts about 10-30s.
  The setup time is fairly constant even for very quick jobs.
  For longer jobs where it takes less of a percentage of the total time it's not a bother, like when we run integration tests for a few minutes.
  - colechristensen 3 years ago
    
    Oh installing dependencies and things? Have prebaked images that already contain those things.
    
    dijit 3 years ago
    
    Yep, I'm talking about the time before my first command runs.
kyriakos 3 years ago

I noticed that many times using cache makes gitlab ci take longer than just fetching node dependencies again via npm install.
nicoburns 3 years ago

If you're mostly just compiling Go then why not cut out docker entirely? Just run your CI on bare metal.
- dijit 3 years ago
  
  Main reason is because honestly I'm too scared of a dirty filesystem wrecking builds.
  - actionfromafar 3 years ago
    
    So, clean it, run in a tmpfs or whatever.
    
    kvdveer 3 years ago
    
    Docker images in CI are typically just that: a tmpfs with a chroot and some network isolation. If you have it working once, youre pretty much guaranteed it will work again.
    Doing this on bare metal with a script to clean the FS, ensure correct dependencies and maybe isolate the network (for safe unit tests), means you're just reimplementing much a non-trivial portion of docker or other container tools. Maybe that's worth it, but without justification, it just smells like risky NIH to me.
buttersbrian 3 years ago

How do you setup or provision your environment? And what does this environment look like?
- dijit 3 years ago
  
  Docker on Debian 11 bare metal with gitlab-ci installed the "blessed" way (by adding gitlabs apt repos).
  No optimisation to the baseOS other than mounting the /var/lib/docker on a RAID0 array with noatime on the volume and CPU mitigations disabled on the host
  Compilation is mostly go binaries (with the normal stuff like go vet/go test).
  Rarely it will do other things like commit-lint (javascript) or KICS/SNYK scanning.
  the machines themselves are Dual EPYC 7313 w/ 256G DDR4.
  - hdjjhhvvhga 3 years ago
    
    Where do you keep your bare metal machines if I my ask? I wanted to do a similar setup a while ago (building/testing on Hetzner bare metal, deployments and the rest on AWS) but due to Amazon's pricing policy the cost of traffic would be enormous.
    
    tomislavpet 3 years ago
    
    Not the person you asked, but we have something similar to what you described - our GitLab is self-hosted on Hetzner cloud and the build machine is a beefy bare metal machine in the same datacenter (plus an additional Mac in our office just for iOS). Built images are stored in GitLab repository and deployed from there.
    We deploy to AWS (among others) and had no issues regarding traffic price since it's ingress into AWS.
    
    dijit 3 years ago
    
    We use Gitlab SaaS with our own CI runners on-prem (I have a small server room in the office).
    I push our artefacts back to gitlab, and we deploy from gitlab's registry to google cloud (GKE).

teach 3 years ago

Am I a curmudgeon? Not to take away from this cool writeup, but I'm familiar with a few CI/CD tools, particularly QuickBuild, Jenkins and Spinnaker. So this jumped out at me:

> Our CI process was pretty standard: Every commit in an MR triggered a GitLab Pipeline, which consisted of several jobs.

me: nodding silently

> Those jobs would run in an auto-scaling Kubernetes cluster with up to 21 nodes

me: what the actual deuce?

Is this really "pretty standard"?

amenod 3 years ago

It's not "pretty standard", but we're working towards it and it looks like a pretty great solution. Our problem is that CI job runners sleep most of the day (low number of commits), but then you have spikes where the jobs are waiting on each other and times get really long. Autoscaling sounds great - you can have lots of runners when you need them and only a single one (or maybe even none? not sure yet) otherwise.
tempest_ 3 years ago

Only if your company is "Cloud Native" and thus real concerned about paying for over provisioned compute.
Gitlab makes it pretty easy to just toss a ci runner process on a vm or a physical box. You can get real far with a couple rack servers and some xeons for < $1000. You do have to over provision if your work load is not very consistent ( and of course pay for the power and rack space, and someone to mind them from time to time).
brightball 3 years ago

If you have it, it’s awesome. You can get parallel execution of so much, spin up environments for each branch for QA and dynamic scans.
IMO it’s the optimal use case for K8s
- hedora 3 years ago
  
  You have to be at a certain scale for k8s to make sense in a CI environment. In particular, it needs to be economical to spend 10-50% of a full time employee to maintain the Kubernetes cluster (even if it is some managed thing like EKS).
  Also, the duty cycle on the 21 nodes needs to be low enough to justify the complexity over just buying 21 computers (or getting annual pricing on 21 VMs). You could use spot instances for the EKS nodes, but then PRs will randomly fail because their instances disappear. That wastes developer salary money and productivity.
  Assuming you have a ventilated room you don't care about, you could run 21 desktop towers off of ~ two-four 120V circuits. (Or buy a rack and pay ~ 2x as much for the hardware.) 21 build hosts would cost ~$21-42K. Power is probably averaging 50W per machine (they are probably mostly idle even when running tests, since they have to download stuff.) That's about 720KWh per month. US average electrical pricing is $0.20 / kWh; punitive California rates are about $0.40. So, in the punitive case, that's $288 / month.
  Running 21 machines probably requires as much annoying maintenance work as EKS, though the maintenance includes swapping bad hardware, fiddling with ethernet cables, and wearing ear protection (if a rack is involved) instead of debugging piles of yaml and AWS roles, optimizing to stay in budget, etc, etc.
  - brightball 3 years ago
    
    I actually find that K8s in a CI environment is a better use case than in production environments.
    In production, you're going to have clearly defined deployment rules, traffic patterns, scalability approaches where the code for each service probably belongs on it's own VM rather than sharing cluster resources.
    In non-production environment, you can feel much more free to overload what's deployed on a node because it's not seeing production traffic. You could have a single k8s instance with 30 different environments (each with their own web, worker, databases, redis, etc) for 30 different branches that represent the issues moving through the pipeline from 5 developers. In prod, ever piece of that would be better represented by it's own VM.
  - fragmede 3 years ago
    
    If you've only got 21 worker machines that probably works out, but if you've got 210, or 2,100 of them to spin up/down, I'd rather be dealing with yaml config (even though I hate yaml config) rather than get PXE booting working for an on-prem cluster.
fragmede 3 years ago

Using kube for that is pretty fancy if you aren't already using kube elsewhere, but you don't just have a single Jenkins worker, you have multiple. All that kube is doing is giving a very convenient lever for autoscaling, but other platforms give you this lever as well. If you're not scaling Jenkins workers (or whatever) to match demand, even manually (spin workers down on weekends), you're wasting developer time, compute resources, or both.
Someone's got a new project for Q2 if they aren't doing this already - it's a pretty easy sell if you calculate out the time savings for developers during busy time of day + savings on spinning down compute resources in the middle of the night/weekends, and being able to put "I saved the company $X in idle compute and saved developers Y hours per day" on your yearly performance review looks pretty good.
actionfromafar 3 years ago

Yeah... I don't know. We don't, but we have talked about it though, because the Azure pipelines are. just. so. slow. On the other hand, more complexity and Rube Goldberg-machinery is not something we long for.
I have started tinkering with Fastbuild, and preliminary testing makes it seem like to good to be true, or the best thing since sliced bread. I'm sure there are drawbacks somewhere, but it's really fast.
Then again, a big chunk of our pipelines is not actually the compilation, but stuff like downloading nuget packages, uploading artifacts and stuff, all of which are. very. very. slow.
alrocarOP 3 years ago

Thanks for this comment. I guess there's sometimes we (developers) take things for granted when they are not, and that puts a lot of pressure on us instead of celebrating our wins.
I would change now "pretty standard" by "we don't invented the wheel" xD :pray:, in the end I wanted to mean we use existing tools and "just" put them together
recfab 3 years ago

Yes. In fact, it's standard enough that it's a little odd that they specify "autoscaling" and "21 nodes", when they could have simply said "we use the kubernetes executor".
Even if you are using SaaS GitLab, there are still good reasons to have custom runners, and kube is one option for running them.

imiric 3 years ago

> We noticed a strong correlation between crazy utilization spikes and CI failure rates.

This is interesting, and is something I've also suspected on many CI systems that offer free public runners (CircleCI, GitHub Actions, etc.).

For seemingly no reason at all, tests were very flaky and unstable in CI, which couldn't be reproduced on local machines. I tried everything from resource-limited containers, to identically spec'd VMs, and never was able to reproduce certain failures. This made issues very hard to troubleshoot and fix.

Of course, you might say that this unstable environment surfaced race conditions in our tests or product, and that's true, but it's incredibly frustrating to have random failures that are impossible to reproduce locally, and having to wait for the long experiment-push-wait for CI development loop.

I suspect this is caused by over provisioning of the underlying hardware, where many VMs are competing for the same resources. This seems quite frequent on Azure (GH Actions).

In the article's case they patched it by making their environment more stable, which is a solution we can't do on public runners, but I'd caution them that they're only patching the issue, and not really fixing the root cause. The flakiness still exists in their code, and is just not visible when the system is not under stress, but will surface again when you least want it to, possibly in production.

alrocarOP 3 years ago

Yep, default runners in most CI platforms shared resources so they are prone to produce flakiness (depending on your set up).
That was one of the reasons we ended up setting up our own runners. Didn't mention in the post but we use spot VM instances.

CottonMcKnight 3 years ago

TL;DR: how a data company uses their own product.

berkle4455 3 years ago

CI has been such a productivity killer. You don’t need it. Stick with CD only and you can ship.

mailund 3 years ago

Interesting, how do you define CI in that case? IIRC, CI was originally defined as integrating continuously (i.e. daily or more frequently) and CD is delivering said code continuously. How does CI hurt productivity and how do you do CD without CI?
alrocarOP 3 years ago

I'm really interested in different points of view.
I guess you mean the kind of trunk based development? But still some sort of CI happens, maybe locally.
Never worked in a different way than using a local / remote CI pipeline, that's why I'm curious.
jdkoeck 3 years ago

CI is a prerequisite for CD.
- nerdponx 3 years ago
  
  I've done smaller projects before where "CI" only consists of merging to trunk/main/master, while testing, linting, etc. is covered by code review and the honor system. I wouldn't advocate this for something business critical that a lot of developers collaborate on, but you can do CD with only trivial CI.
- berkle4455 3 years ago
  
  It’s literally not.
  - teach 3 years ago
    
    I'm tempted to just downvote you and move on with my life but I'm genuinely curious.
    Given that it's meaningless to Deploy something without Integrating the changes, what do you _actually_ mean by "You don’t need [CI]. Stick with CD only."
    Are you just talking about testing the changes? Help us out here.
    
    twic 3 years ago
    
    What if people push unmerged feature branches, and then the automation just deploys the most recent one?
    I'm not advocating this, just to be clear.
    
    xboxnolifes 3 years ago
    
    I guess you could just have an unchanging project redeploy itself every hour or so.
  - nerdponx 3 years ago
    
    It might be required in that it's impossible to deliver continuously if changes are not integrated continuously, for some definition of "integrated".

Settings

We cut our CI pipeline execution time in half

Keyboard Shortcuts