Ask HN: How do you keep track of releases/deployments of dozens micro-services?
Our apps are made by 5-15 (micro)services. I'm not sure if this approach would scale to hundreds of services managed by different teams.
We store the source code for all services in subfolders of the same monorepo (one repo <-> one app). Whenever a change in any service is merged to master, the CI rebuilds _all_ the services and pushes new Docker images to our Docker registry. Thanks to Docker layers, if the source code for a service hasn't changed, the build for that service is super-quick, it just adds a new Docker tag to the _existing_ Docker image.
Then we use the Git commit hash to deploy _all_ services to the desired environment. Again, thanks to Docker layers, containers that haven't changed from the previous tag are recreated instantly because they are cached.
From the CI you can check the latest commit hash that was deployed to any environment, and you can use that commit hash to reproduce that environment locally.
Things that I like:
- the Git commit hash is the single thing you need to know to describe a deployment, and it maps nicely to the state of the codebase at that Git commit.
Things that do not always work:
- if you don't write the Dockerfile in the right way, you end up rebuilding services that haven't changed --> build time increases
- containers for services that haven't changed get stopped and recreated --> short unnecessary downtime, unless you do blue-green
At work we also use a monorepo that consists of subfolder of services. We use Kubernetes and we store the config files of each service (and each environment dev/staging/prod) inside the same repo. The k8s config files are placed in directories following this pattern: `<service>/deployment/<environment>/<service>.yaml`.
To avoid rebuilding all services on every commit, we use Bazel to help determine what services need to be rebuilt. Note that we don't use Bazel as build system but just a tool to see what services are changed -- essentially we only use `filegroup` Bazel rule. After a push to git repo, we basically do (1) `git diff --name-only <before> <after>` to get changed files, (2) run `bazel query 'rdeps(..., set(list of changed files))'` at both `<before>` and `<after>` commits, and (3) combine the results of `bazel query` and look for the affected services.
Once we know what services need to be rebuilt, we trigger Jenkins jobs of those services. Each service will have its own Jenkins job and Jenkinsfile (we use Pipeline). Here we also package the application as Docker image and push it to the internal registry.
We keep track of what is released using "production" branch for each service. Once we have a build to release, we (1) create a "release candidate" branch from the commit of the build, (2) update the k8s config file, (3) apply the k8s config, and (4) merge this branch to the production branch of the service if everything is ok. Then we merge back the production branch to master branch.
We follow a very similar pattern and we are at over 150 micro services right now (AWS Lambda).
A couple of things different that we do since we are building and then deploying to AWS:
- Build only on dedicated deployment branches (beta, qa, preview, prod)
- Build all functions (transpile, yarn, lint, etc) on every merge into the branch, but only deploy functions with different checksums (saves on api calls to AWS)
- We cache node_modules, but otherwise don't have any special build requirements and babel takes care of targeting node6.10 for Lambda
Total build time is between 8-13 minutes. There are some things we can do to speed up install that we haven't yet because it's not an issue yet but just a short list of things to note.
- Each function has it's own package.json for it's own packages. We maintain a list of npm packages that we download into a single folder first (that doesn't get deployed) to allow yarn to use those files from cache. We will eventually switch to an offline install for each function which essentially just copies the package folder and sets up anything it needs.
- We have a tarball package that includes all of our shared code / config files. Yarn seems to always want to download this file, regardless if we pre-download it.
- We deploy a single api endpoint for all of our micro services through API Gateway which cuts down on the time to deploy since API Gateway has a pretty hard throttle. This means we create a deployment on API Gateway every merge. We have one APIG for each environment
Just to be clear, all of your functions exist inside one mono-repo, correct? You don't use git submodules at all?
Looks like a pretty solid build process. Thanks for the insight!
Yes, we have all of our functions in a single mono repo, broken into projects, and then folders for each function, similar to this:
- src/project1/function1/
- src/project1/function2/
- src/project2/function1/
- src/project3/function1/
- src/project3/function2/
- src/project3/function3/
Deploying the functions is done by project, so we deploy all of one project, then move to the next, and so on and so forth.
That's a great model. Do you use cloudformation each for deployment? If so, have you thought of creating a single cloudformation template for the whole deployment so you can do the entire deployment in one stack update?
Have you encountered any issues to watch out for when only using one APIG for each environment (150 micro-services). Have you encountered any downsides to doing this versus 1 micro-service to 1 APIG? I'm also running into the Gateway throttle limits and I think deploying many micro-services (like you have done) to 1 APIG is the best solution.
We don't use cloudformation because honestly, it sucks. It's hard limits are a pain in the ass to get around (with 150 lambda functions, we need hundreds of resources, so that means nested stacks, which just suck) and it managing the api gateway just doesn't do what we want.
We have a custom script to deploy our own API Gateway using the AWS SDK and we generate a swagger file from simple json config files.
For the API Gateway issues, so far, we have a few things that are something we have to watch out for.
- All lambda endpoints through APIG are lambda proxy type. This means we can have a framework handle standard request / response stuff. The downside is that we can't support binary endpoints easily because they haven't fixed that issue yet.
- HTTP proxy pass through endpoints have to be added to the swagger somehow before we deploy. This is a little annoying, but not a huge issue
- Merge vs Override for deployments. We merge in beta, and override in other environments. This allows us to keep endpoints exactly as they are, but allow flexible testing in beta
1 APIG for 1 micro service isn't great IMO at scale since we run all our endpoints under on domain and mapping all of them would be a pain.
That's very valuable advice. Thank you. I've been following the serverless.com model of 1 APIG to 1 lambda, but that quickly puts you over the AWS limits when trying to manage hundreds or thousands of micro-services.
Yup yup, I went down that path and converted our then very basic deployment process to use serverless and instantly hit hard limits.
> Whenever a change in any service is merged to master, the CI rebuilds _all_ the services and pushes new Docker images to our Docker registry.
Why are you rebuilding _all_ the services, wouldn't it make sense to just rebuild the ones that have changes? You're now rebuilding perfectly working services without any new changes just because some other service changed, or am I misunderstanding something here?
Because we want to make sure that in the Docker registry we have _all_ services tagged with the latest commit.
For example you might have a Git history like this:
* 89abcde Fix bug in service_b
* 1234567 Initial commit including service_a and service_b
When 89abcde is pushed, the CI rebuilds both service_a and service_b so we can simply "deploy 89abcde" and you always have only one hash for all services, that is also nicely the same hash of the corresponding Git commit.
The trick to avoid rebuilding perfectly working services is to use Docker layer caching so that when you build service_a (that hasn't changed) Docker skips all steps and simply adds the new tag to the _existing_ Docker image. The second build for service_a should take about 1 second.
In our Docker registry we end up with:
service_a:1234567
service_a:89abcde
service_b:1234567
service_b:89abcde
But the two service_a Docker images are _the same image_, with two different tags.
Why? Microservices are suppose to be truly independent.
For ease of deployment and to solve the problem of "what version of service_b is compatible with version x of service_a"?
IMHO this makes sense if the microservices are developed by the same team. If we're talking about services developed and managed by different teams... maybe it's not a good idea.
My guess is that it is because of the mono-repo. Since it would take some work to figure out what changed and what to build, they just did it the easy way and re-build everything :-)
> We store the source code for all services in subfolders of the same monorepo (one repo <-> one app).
So I'm curious, does each service instance have their own server, or do you have multiple services on one server instance?
I have some experience working with microservices. I saw the clear business benefits of being able to map design domain boundaries to repos and specific teams, and to let those teams be able to control their deployments while minimizing external dependencies.
But we seemed to be paying a lot in network chattiness, slow site response times, and networking costs. I'm wondering if we could have minimized those costs by sticking some of those microservices on the same server instance. Not really change service boundaries or interfaces, but change the methods that the microservice interfaces use to communicate.
Quick question: Does docker, or any other higher level service, let you "tag" images? Ideally, you could build only changed stuff, and use that sha to tag every image. That way you still get the benefits of one hash, and that hash representing the state of the codebase as well, while cutting down on build time.
Yes, but the real win is in a different way than I think you're describing. With Docker, containers are built in a layered fashion with each 'step' of the build creating a new layer (think version control hashes). The benefit here is twofold:
First - If your change to the container is near the end of the build process (see earlier comment about smart container design), then the rebuild will only change the final few hashes and Docker is smart enough to not rebuild earlier hashes.
Second - Hashes are global, so if you have multiple containers that start with the same base (say, Alpine Linux + Python + NMP + etc.), Docker will share existing hashed layers. This means a much smaller distribution payload.
To (what I think is) your original question - you can tag the 'final' container itself. Tagging it with the Git hash is one way to get exactly what you're talking about.
How long do your deployments take on average?
Depends on many factors... It can vary from 5 to 30 minutes, from the moment someone presses "merge" on a pull request and the moment that change is live on a test environment. The average is probably around 10-15 minutes.
The builds for all services happen in parallel, so the longest one determines the total time. Big Scala services take much longer than small React frontends. We cache both Maven and NPM modules from previous builds.
Ideally, if the pull request only modified a React component and didn't touch any Scala file, no Scala build is triggered because Docker finds a cached layer and skips the "sbt compile" step. To be honest, we are still working to make sure this always happens, we still trigger unnecessary sbt compiles because the Docker cache is not used correctly.
At the guardian we use https://github.com/guardian/riff-raff
It takes a build from your build system (typically team city, but not exclusively) deploys it and record the deployment.
You can then check later what's currently deployed, or what was deployed at some point in time in order to match it with logs etc.
Not sure how useable it would be outside of our company though.
We have separate repos for each service, and use https://gocd.org/ to build, test and deploy each separately. But, you could also configure it to only trigger builds from changes in certain directories. There is a single pipeline template from which all pipelines are instantiated.
Independent deployments are one of the key advantages of microservices. If you don't use that feature, why use microservices at all? Just for scalability? Or because it was the default choice?
My experience with micro-services is code-bases that have prematurely adopted the pattern. Based on this, my advice is as follows...
You can deploy the whole platform and/or refactor to a monolith, and maintain one change log which is simple.
That however has its own downsides, so you should find a balance. If you're having trouble keeping track, perhaps re-organize. I read on one HN article that Amazon had 7k employees before they adopted microservices. The benefits have to outweigh the costs. Sometimes the solution to the problem is taking a step back. without more details its hard to say.
So basically one option is refactor [to a monolith] and re-evaluate the split such that you no longer have this problem. Just throw each repo in a sub-folder & make that your new mono-repo & go from there, it is worth an exploratory refactoring, but not a silver bullet.
"Amazon had 7k employees before they adopted microservices"
Sounds like the services were no longer 'micro' :)
"micro" doesn't refer to level of usage, but level of domain responsibility.
Something called 'integration testing' that has to be done before the final build which clearly flags off any compatibility issues between components.
Every component comes with a major/minor release no., which tells about the nature of change that has gone in. For ex: Major rel is incremented for a change that usually introduces a new feature/interface. Minor release no are reserved for bug fixes/optimizations, that are more internal to the component.
The build manager can go through the list of all the delivered fixes and cherry pick the few which can go to the final build.
In the company i worked for, they had their own CI/CD system which tracked information about each service and the systems onto which it has to deploy. Once it was all configured, it was basically button pushes. Also the system tracked feedback after deployment to confirm if the build went good or needed to be fixed - if certain parameters were unwell, basically it did a role back ! Also there were canary deployments to make sure code was deployed only to portion of systems to make sure it indeed pushed correctly and worked. If not, they are rolled back !
We’ve been using our own setup for 4 years now. https://github.com/wballard/starphleet
We have 200 services, counting beta and live test variants. Most of the difficulties vanished once we had declarative versioned control of our service config in the ‘headquarters’ repository.
Not aware of anyone else using this approach.
In the past I've used a single repo with all the code which gets pushed everywhere, and each service only runs it's portion of the code. No guess work involved, but this may not work for a lot of setups of course. That and your graceful restart logic has to be slightly more involved.
At an old company, we wrote this, er "model driven orchestration framework for continuous deployment":
https://github.com/tim-group/orc
Basically, there's a Git repo with files in that specify the desired versions and states of your apps in each environment (the "configuration management database").
The tool has a loops which converges an environment on what is written in the file. It thinks of an app instance as being on a particular version (old or new), started or stopped (up or down), and in or out of the load balancer pool, and knows which transitions are allowed, eg:
(old, up, in) -> (old, up, out) - ok
(old, up, out) -> (old, up, in) - no! don't put the old version in the pool!
(old, up, out) -> (old, down, out) - ok
(old, up, in) -> (old, down, in) - no! don't kill an app that's in the pool!
(old, down, out) -> (new, down, out) - ok
(old, up, out) -> (new, up, out) - no! don't upgrade an app while it's running!
Based on those rules, it plans a series of transitions from the current state to the desired state. You can model state space as a cube, where the three axes of space correspond to the three aspects of the state, vertices are states, and edges are transitions, some allowed, some not. Planning the transitions is then route-finding across the cube. When i realised this, i made a little origami cube to illustrate it, and started waving it at everyone. My colleagues thought i'd gone mad.You need one non-cubic rule: there must be at least one instance in the load balancer at any time. In practice, you can just run the loop against each instance serially, so that you only ever bring down one at a time.
This process is safe, because if the tool dies, it can just start the loop again, look at the current state, and plan again. It's also safe to run at any time - if the environment is in the desired state, it's a no-op, and if it isn't, it gets repaired.
To upgrade an environment, you just change what's in the file, and run the loop.
We wrote https://github.com/kiwicom/crane which posts and updates a nicely formatted Slack message with the status of releases. It also posts release events to Datadog (in a version we're publishing soon) and to an API that records them in a Postgres DB we keep for analytics queries.
Full disclosure: I'm on the Spinnaker team
What do you mean by keep track? Do you want to be aware of deployments?
A Slack notification could do it. Or do you want to correlate deployments with other metrics?
In this case we instrument our deployments into our monitoring stack (influxdb/grafana) and use this as annotations for the rest of our monitoring.
We can also graph the number of releases per project on different aggregates.
I think Slack notifications are really nice to see what's going on right now but not so great to see the state of dozens of service, i.e. what version is deployed to what environment.
Then there is the issue of linking the Git release/tag with the corresponding changes, say from a ticketing system such as Jira. That can be helpful to communicate changes to other people within the organization and to users.
How do you define dependencies for releasing new versions to service? Likely going to happen at some point when you have non-trivial changes to services.
> I think Slack notifications are really nice to see what's going on right now but not so great to see the state of dozens of service
Completely agree, that's why we instrument our releases so we can easily see what's deployed by service and environment.
> Then there is the issue of linking the Git release/tag with the corresponding changes, say from a ticketing system such as Jira. That can be helpful to communicate changes to other people within the organization and to users.
Each commit is related to a ticket, helps generate a changelog. We enforce a lot of things in each of our release. We have an internal release tool heavily inspired by shipit from Shopify. We have the concept of soft/hard checker to make sure it won't break or that you aware of what could break with the current diff.
> How do you define dependencies for releasing new versions to service? Likely going to happen at some point when you have non-trivial changes to services.
As I said we instrument our releases and can easily track how changes affects our performance/bugs.
We also try a lot not to release non-trivial changes in one big release by doing stuff like release part of the changes behind a feature flipper first or route only a part of the traffic to the new code path, ...
Then we don't have dozens of different services deployed and we're still a relatively small team (~20) so I'm pretty sure I don't have the full picture just yet :)
Thanks for adding more color to your original answer.
I like you enforce the commit/ticket relationship. Is this purely an agreed process or do you use other measures to keep things consistent? E.g. we typically add the ticket ref to each commit but at times that gets omitted.
Also, I think that (internal) release tool is something crucial as the team grows. Will check shipit a bit further.
Would you mind expanding a bit on the things you enforce for each of your releases?
Sure my pleasure.
> I like you enforce the commit/ticket relationship. Is this purely an agreed process or do you use other measures to keep things consistent? E.g. we typically add the ticket ref to each commit but at times that gets omitted.
We're not enforcing it but we might in the future if the team grows and this gets out of hands. At the moment we're just reminding people that they should and it works great so far.
> Would you mind expanding a bit on the things you enforce for each of your releases?
It's still early but so far we check:
- it's not friday afternoon, we want to avoid as much as possible to have issues on the weekend
- it's not out of office hour - we're still all on the same time zone
- there's no lock (we can lock the release in case something goes wrong)
- there's no schema migration. If there is we remind you how to safely migrate schema and who to ping if you have a doubt (usually it should have been caught at the PR review)
- there's someone from the ops/core team around (connected on slack)
- that there's no translations missing for our main languages (french/english)
- + we do a few sanity checks like that our master staging is healthy (release means promoting our master staging)
edit: also I forgot but this is the shipit I'm talking about https://github.com/Shopify/shipit-engine
Thank you!
Service discovery contains all the versions and who should be directed at what.
We also store stats in the service discovery app so versions can be promoted to "production" for a customer once the account management team has reviewed and updated their internal training.
Got about 80+ services. One repo per service, each service has it's own kubernetes yaml that details the services deploys to the cluster. K8s has a huge ecosystem for monitoring, versioning, health, autoscaling and discovery. On top of that, each repo has a separate slack channel that receives notifications for repo changes, comments, deployments, container builds, datadog monitoring events, etc. There are also core maintainers per repo to maintain consistency.
For anyone that has begun the microservice journey, kubernetes can be intimidating but way worth it. Our original microservice infrastructure was rolled way before k8s and it's just night and day to work with now, the kubernetes team has thought of just about every edge case.
Keep track as having a version controlled state of all revisions/versions deployed? That's something I would be interested in solutions to too, especially in a kubernetes environment with CI.
I could probably snapshot the kubernetes state to have an trail I can use to rollback to a point in time. Alternatively I thought about having CI updatemanifests in an integration repo and deploy from there, so that every change to the cluster is reflected by a commit in this repository.
We built a small internal service that receives updates from the build & deployment scripts we run which then presents us with a html page that shows what branch & commit of everything is deployed (along with the branch and commit of every dependency) where, when and by who. It's totally insecure so it can be trivially spoofed, but it's our V1 for our fleet of golang services and it works well.
Have a CI/CD pipeline that does the following:
- unit tests each service
- all services fan-in to a job that builds a giant tar file of source/code artefacts. This includes a metadata file that lists service versions or commit hashes
- this "candidate release" is deployed to a staging environment for automated system/acceptance testing
- it is then optionally deployed to prod once the acceptance tests have passed
https://github.com/ankyra/escape (disclaimer: I'm one of the authors)
We use Escape to version and deploy our microservices across environments and even relate it to the underlying infrastructure code so we can deploy our whole platform as a single unit if needs be.
Just to add: I've worked on pipelines like this for dozens of clients and I'd be happy to talk more in-depth about your options, as business requirements do tend to influence your delivery pipeline a lot. Email is in my profile if you're interested.
We use gitlab CI for pipelines which is great. You can figure out when everything was deployed last etc. We even built our own dashboard using gitlab api that shows all the latest deploys, just so its easier to track down what was recently deployed if we are investigating issues.
Did you opensourced it? I'd love to use that too :)
No. Its currently very specific to our org. But i'll consider making it more generic and open sourcing it now.
Maybe I'm misunderstanding the question, but you may want to have a look at Envoy: https://www.envoyproxy.io
We use Jenkins for releases, kubernetes for deployments if I understand the question correctly. We’d like to use something like linkerd to simplify finding dependencies.
Master=stable and in prod, non-master branches=dev & staging. Jenkins deploys automatically on git commits.
In what technical environment? More info needed.