GitHub was down

109 points by patabyte 5 years ago · 87 comments

Reader

keithba 5 years ago

Hi all - I'm the head of engineering at GitHub. Please accept my sincere apology for this downtime. The cause was a bad deploy (a db migration that changed an index). We were able to revert in about 30 minutes. This is slower than we'd like, and we'll be doing a full RCA of this outage.

For those who are interested, on the first Wednesday of each month, I write a blog post on our availability. Most recent one is here: https://github.blog/2021-03-03-github-availability-report-fe...

mwcampbell 5 years ago

Thanks for taking the time to personally give a status update while things are on fire. I hope you and all the others who are dealing with this emergency will have an especially restful weekend.
- rvz 5 years ago
  
  There's a reason why deploying on a Friday is not really a good idea.
Uehreka 5 years ago

I was just griping on Twitter yesterday about how many developers won't immediately revert an update that causes downtime, but will actually spend time trying to solve the problem while Rome burns.
Thank you for not doing that.
- sk5t 5 years ago
  
  Sometimes reverting is not reasonably possible--suppose you updated a database schema and clients immediately started filling it with new data that would have no home in any backup--you'd end up in another unanticipated state.
hanniabu 5 years ago

Any comment or insight you can share on the overall increase in downtime over the past few years?
- junon 5 years ago
  
  Growth.
oomathias 5 years ago

@keithba I have build a - private - GitHub action around https://github.com/sbdchd/squawk - for Postgres - that lints all our migrations files on each PR. The action extract raw SQL from the codebase and pass them into squawk. It catches many exclusive locks migration or missing `index concurrently` that would otherwise have been release to production and causing downtime or degraded service. Maybe something you should start doing.
- evanelias 5 years ago
  
  GitHub uses MySQL, not Postgres. They built the best-in-class online schema change tool gh-ost [1], and have a custom declarative schema change execution system built around Skeema [2], which contains a wealth of linters [3].
  Even so, it's always possible for an engineer to submit a schema change which is detrimental to performance. For example, dropping an important index, or changing it such that some necessary column is no longer present. Linters simply cannot catch some classes of these problems, as they're application/workload-specific. Usually they must be caught in code review, but people make mistakes and could approve a bad change.
  Disclosure: I'm the author of Skeema, but have not worked for or with GitHub in any capacity.
  [1] https://github.com/github/gh-ost
  [2] https://github.blog/2020-02-14-automating-mysql-schema-migra...
  [3] https://www.skeema.io/docs/options/#lint
  - oomathias 5 years ago
    
    Thanks, I didn't know about this. Indeed nothing is failproof.

jborichevskiy 5 years ago

I wonder if they track Github Status traffic volume as some sort of meta-indicator? Is it even viable?

I was futzing around with the description for a PR and hitting save wouldn't update it, yet clicking edit would show the text I expected to see.

Suspecting something was up I checked Github Status but it was green across the board. Assuming enough other people hit the same chain of events, could it provide a reliable enough indicator of an issue?

spondyl 5 years ago

> I wonder if they track Github Status traffic volume as some sort of meta-indicator? Is it even viable?
Sure, the previous decent sized company (~1000+ devs) had that exact metric available.
Visits to the status page generally that is. Now whether you could actually correlate that to an increase in errors for a particular component, no so much ;)
I'm sure it's totally feasible but it requires a certain amount of discipline to have consistency in logging/metric standards across all your applications to some extent.
Even worse, some applications would return a shared error page but internally, I believe it was logged as a 301 redirect until someone spotted it :)
- jborichevskiy 5 years ago
  
  Hah, that's awesome.
  > Now whether you could actually correlate that to an increase in errors for a particular component, no so much ;)
  Yep, makes sense. I was picturing a broad "Something Bad Happened, Go Investigate" notification. But I imagine the sensitivity would have to be tuned, especially to account for massive traffic increases from places like HN.
  > Even worse, some applications would return a shared error page but internally, I believe it was logged as a 301 redirect until someone spotted it :)
  Yikes!

turbonoobie 5 years ago

This is becoming a regular occurrence by now..

I wonder if reliability has become less of a priority. As somebody with little to no experience of running things at scale I’m finding myself attributing this to some form of “move fast and break things”.

geerlingguy 5 years ago

Nobody remembers the unicorn days? Earlier in GitHub's history, it seemed like a weekly outage was the norm. You just kind of expected it and built workflows in ways where you had a backup path to your code.
Often git operations were unaffected though.
qbasic_forever 5 years ago

Given that the change happened in the mid-morning PST (timezone where GitHub HQ and most devs are located), I'm going to bet it's almost certainly something messed up from a regular update or deployment.
tazjin 5 years ago

I remember something after their acquisition about new offers being lower than what certain people had previously, leading to important staff members leaving. This and some other issue I can't quite remember ... it was probably posted on HN :)
whoisjohnkid 5 years ago

good point; I think they might be deploying big features on fridays? as of late I've noticed most of the issues I have seen happen on Fridays.
- usui 5 years ago
  
  Why would that be the case? Shouldn't it be more common to find deployments happening early Monday morning? It's common practice to avoid potentially bug-inducing changes right before the weekend hits
  - bombcar 5 years ago
    
    If you’re trying to minimize impact to employees you deploy Monday or Tuesday morning.
    If you’re trying to minimize impact to business customers you deploy Friday night.
    
    swat535 5 years ago
    
    Yea exactly , Tuesdays are the ideal days to ship features, this has always been the case everywhere I've worked. Deployments on Friday is just asking for pain and Mondays are often too chaotic for a release..
  - SemiNormal 5 years ago
    
    Because they likely get a large number of users doing git pull first thing Monday morning.
Florin_Andrei 5 years ago

> I’m finding myself attributing this to some form of “move fast and break things”
That was the case when they were the small and hungry startup.
Meanwhile they've been acquired by a giant corporation with a less than stellar reputation for reliability or quality. So it's most likely a case actually of "move slow and break things".
- nix23 5 years ago
  
  >stellar reputation for reliability or quality
  That is unfair Minesweeper never crashed, and the printspooler is not up for debate here ;)

cs-szazz 5 years ago

Unfortunately right when we were trying to deploy a hot-fix to production, our CI can't clone the PR to run tests.

What do other folks use to avoid this situation? Have a Gitlab instance or similar that you can pull from instead for CI?

capableweb 5 years ago

It's surprisingly easy, depending on your scale/scope of course. But in general, I've managed to build CI/CD pipelines that are tolerant of GitHub (or any service) failures by following these steps:
1. Use as little of the configuration language provided by the CI as possible (prefer shellscripts that you call in CI instead of having each step in a YAML config for example)
2. Make sure static content is in a Git repository (same or different) that is also available on multiple SCM systems (I usually use GitHub + GitLab mirroring + a private VPS that also mirrors GitHub)
3. Have a bastion host for doing updates, make CI push changes via bastion host and have at least four devs (if you're at that scale, otherwise you just) with access to it, requiring multisig of 2 of them to access
Now when the service goes down, you just need 2 developers to sign the login for the bastion host, then manually run the shellscript locally to push your update. You'll always be able to update now :)
- titanomachy 5 years ago
  
  What facility are you using for multi signature login?
- stigz 5 years ago
  
  Great advice, thank you.
reificator 5 years ago
> our CI can't clone the PR to run tests. What do other folks use to avoid this situation?
Multiple remotes can help and is certainly something you should have as a backup. However I don't think it solves the root cause which is how the CI is configured.
I'm a firm proponent of keeping your CI as dumb as possible. That's not to say unsophisticated, I mean it should be decoupled as much as possible from the the how of the actions it's taking.
If you have a CI pipeline that consists of Clone, Build, Test, and Deploy stages, then I think your actual CI configuration should look as close as possible to the following pseudocode:
```
    stages:
      - clone: git clone $REPO_URL
      - build: sh ./scripts/build.sh
      - test: sh ./scripts/test.sh
      - deploy: sh ./scripts/deploy.sh
```
Each of these scripts should be something you can run on anything from your local machine to a hardened bastion, at least given the right credentials/access for the deploy step. They don't have to be shell scripts, they could be npm scripts or makefiles or whatever, as long as all the CI is doing is calling one with very simple or no arguments.
This doesn't rule out using CI specific features, such as an approval stage. Just don't mix CI level operations with project level operations.
As a side benefit this helps avoid a bunch of commits that look like "Actually really for real this time fix deployment for srs" by letting you run these stages manually during development instead of pushing something you think works.
More importantly though, it makes it substantially easier to migrate between CI providers, recover from a CI/VCS crash, or onboard someone who's responsible for CI but maybe hasn't used your specific tool.
whalesalad 5 years ago

You really just need a TCP pathway between your CI and some machine with the git repo on it.
Or take your local copy and use git-fu commands to create a bare repo of it that you can compress and put somewhere like S3. Then download it in CI and checkout from that.
Or just tarball your app source, who cares about git, and do the same (s3, give it a direct path to the asset)
All of this is potentialy useless info though. Hard to say without understanding how your CI works. If all you need is the source code, there are a half dozen ways to get that source into CI without git.
redisman 5 years ago

Can you make a local build? Our fallback is that someone does the CI/CD steps manually.
- bob1029 5 years ago
  
  Presumably, manually running builds locally is an automatic failsafe option if you have the same people around who originally set the build pipeline up in the first place.
  In 2021, basic business continuity plans for software companies should incorporate these sorts of concerns. You should have a published procedure somewhere that a person could follow for producing the final build artifacts of your software on any machine once backups are made available. Situations like these are why I check in 100% of my dependencies to source control as well.
- cs-szazz 5 years ago
  
  Unfortunately we heavily shard our tests, running on a single laptop would take a while.
theptip 5 years ago

Ideally your CI/CD is just calling Make/Python/Whatever scripts that are one shot actions. You should be able to run the same action locally from a clean git repo (assuming you have the right permissions).
The anti-pattern to watch out for is long, complex scripts that live in your CI system’s config file. These are hard to test and replicate when you need to.
rvz 5 years ago

Well unfortunately it seems everything I said 11 days ago has become a reality I'm afraid and I was still downvoted for pointing this truth out. [0]
Too many times I suggested everyone to begin self-hosting or have that as a backup but once again some think 'going all in on GitHub' is worth it. (It really is not the case)
[0] https://news.ycombinator.com/item?id=26301750
- capableweb 5 years ago
  
  Something I've learned on HN is that upvotes/downvotes means nothing but how popular an opinion is. You can be 100% right, honest, straightforward and kind, but if the hive-mind does not agree, it does not agree and will downvote your well-written opinion.
  Don't read too much into it and comment freely as normal. In the end, it's just internet points.
- x86_64Ubuntu 5 years ago
  
  Doesn't help that this occurred in the same week as a patent pending MS Patch Tuesday that borked a lot of corporate machines. I'm still cleaning up messes from the changes they pushed out that break Kyocera drivers.
inetknght 5 years ago

When I built CI stuff at my previous job there were two remote repos that could be cloned from; Github and a repo on a system on the LAN that the CI's user has ssh access. Which one was used was controlled by a toggle-able environment variable in the CI system.
fweespeech 5 years ago

> Have a Gitlab instance or similar that you can pull from instead for CI?
Gitlab, mirrored repo basically.
nneonneo 5 years ago

Could you just pull the PR from the submitter’s machine? They could serve it via e.g. git http-backend and you could then point the CI there to pull.
Florin_Andrei 5 years ago

> What do other folks use to avoid this situation?
Don't use Microsoft?
- nix23 5 years ago
  
  GitLab deleted some DB at some time, kernel.org was hacked years ago...nothing is perfect.

qbasic_forever 5 years ago

Github folks--this is really getting bad. I find it strange that your leadership will spends weeks of time, and pen hundreds of words about making right the wrongs they created with censorship (see: https://github.blog/2020-11-16-standing-up-for-developers-yo...), yet there's almost no attention given to these major outages that keep happening for a year now.

Where is the acknowledgment of a problem, root-cause analysis, and followup for new practices and engineering to prevent issues? Who is responsible for these issues and what are they doing to make it right? What positions are you hiring for _right now_ to get to work making your service reliable?

rvz 5 years ago

Again? Just 11 days ago [0], GitHub Actions had a degraded service and now it is the whole of GitHub. It's becoming a regular thing for them and it really is disappointing.

But I don't know how many times [0] I have to say this but, just get a self-hosted backup rather than 'going all in on GitHub' or 'Centralising everything'.

[0] https://news.ycombinator.com/item?id=26301659

mfer 5 years ago

Running a highly available service at this scale is hard. Especially when the service is ripe for dos and attacks.

With that out of the way... GH has had a lot of issues in recent months. More than the past. I would hope those things are on a road to being fixed.

suspecthorse 5 years ago

I started building Multiverse because of problems like this. Ironically it’s hosted on GitHub. Check it out if you are interested in decentralized VCS and code hosting.

https://github.com/multiverse-vcs/go-multiverse

justaguy88 5 years ago

What's the best practice for high availability (self-hosted?) repositories?

Is there a pass-through proxy for git? Or a leader-follower arrangement that is nice, with a proxy server?

drstewart 5 years ago

Best practice for me is to hope this happens on Friday afternoon and then take a half day... so all according to plan!
- justaguy88 5 years ago
  
  I just want to be able to run self-hosted CI without it failing due to github
  - usui 5 years ago
    
    I don't understand. Almost all of GitHub is a centralized service, while the self-hosted version is available to a select few enterprise partners.
    What's the roadblock on doing self-hosted CI without failing due to GitHub? It would be as simple as not using GitHub I think
smallnamespace 5 years ago

If you're already using a cloud provider, they [1] all [2] have [3] private git repo services.
You can set up a cronjob to sync them, or some have built-in config to do the mirroring [4].
I used Google's mirroring option before. It was fine, but we never had to use it (local copies were sufficient when GH was slow one day).
[1] https://cloud.google.com/source-repositories
[2] https://aws.amazon.com/codecommit/
[3] https://azure.microsoft.com/en-us/services/devops/repos/
[4] https://cloud.google.com/source-repositories/docs/mirroring-...
- serverlessbird 5 years ago
  
  We don`t need to forget about small but better providers, like DNS made easy or it`s younger brother Constellix I also read one guy at webhosting talk. he said DNS simple have some new good features
- serverlessbird 5 years ago
  
  GitHub better use something more trusted. AWS outages is so well known :/
drudoo 5 years ago

You could have a selfhosted gitea or gitlab instance that mirrors your GitHub repo. I use that as backup for some GitHub repos.
jtdev 5 years ago

Self host GitLab seems like a good approach here.
usui 5 years ago

We use a cluster of self-hosted GitLab instances. Their update cadence has been on a roll and their development process is much more transparent compared to GitHub imo because it's a lot easier to see how they comment and discuss when they have "all-remote" baked into the core of their workflow
Believe it or not, we have higher service availability hosting GitLab ourselves than GitHub

rklaehn 5 years ago

Great opportunity to try out decentralised alternatives like https://radicle.xyz/

Comevius 5 years ago

I noticed that Radicle claims similar functionality to centralized code collaboration platforms like GitHub, but Radicle itself is being developed on GitHub.
They could use some dogfooding, and new website.
zymhan 5 years ago

This is probably the worst time to try out an alternative.
svnpenn 5 years ago

Website looks like garbage on mobile
Jonovono 5 years ago

Interesting, and I love the design

ffpip 5 years ago

Seeing the unicorn on GitHub, I opened HN and the first post confirms that GitHub is down for everyone else too :)

ProtoAES256 5 years ago

In the wake of recent events, are there any methods to do CI/CD which will fallback to other providers/local automagically as a result?

My heart can't handle another rollercoaster of unicorns for long...

brnt 5 years ago

I was in the middle of some last minute pre-weekend PR review, and midway I discover it can't actually submit any of my comments. Is there a way to review and save (intermediate) state offline?

drstewart 5 years ago

I believe Github saves your input to local browser storage, so just make the comments and don't submit them til later.
- brnt 5 years ago
  
  Only for as long as I keep the window open, and for me it's really time to sign off and enjoy the weekend ;)

WFHRenaissance 5 years ago

Unicorn'd on PR, but eventually got it through.

leemac 5 years ago

Sent in a commit to fix a few PR suggestions. It went through, but nowhere to be found on the PR. Guess I'll have to wait.

johncalvinyoung 5 years ago

Ha. Had trouble with a PR, checked status page, no problems. Merged manually, open Hacker News, and there it is.

b_fiive 5 years ago

rooting for y'all at GitHub!

balecrim 5 years ago

darn, right when I had to setup a new machine and can't get homebrew :(

gadrev 5 years ago

Yep, can't confirm a PR: Unicorn.

zymhan 5 years ago

Happy Friday :/

pault 5 years ago

Hopefully this will spark a productive conversation about the advantages and disadvantages of centralizing a decentralized VCS.

Settings

GitHub was down

Keyboard Shortcuts