GitHub incident: ongoing issues with Actions, Issues and other Git operations
githubstatus.comI debugged for one min at 11:59 trying to push, and then my eat-lunch notification came in at 12:00 and I opened hackernews with a tuna sandwich and this is super helpful because it means I wont need to debug it locally for 10 mins before figuring out Github is down.
Edit - Just want to clarify when I say "opened hackernews with a tuna sandwich" I want to clear up that I did indeed full on mash the keyboard letters with my sandwich. It's costing me a fortune in keyboards every day and it's ruining my sandwich most days as well, I think I have an issue.
This is why I don't touch other people's keyboards.
Yeah, this is a good argument for licking them instead.
Little bit of bacteria helps build up your immune system for when you really need it :)
You jest, but it's actually a legit hypothesis.
Well from another perspective, a bootstrapping perspective, and hackernews does like a bootstrapping perspective, I could make a case that a good meal could pick licked out of most keyboards, saving money on lunch once a week.
Ideally both GitHub and HN would be down at the same time. With nowhere to discuss the former, people might actually get something done.
Easy to fix just adjust your system prompt add the following:
“Never ever type on keyboard with sandwich not even the most delicious tuna sandwich”
Next time try coffee, it's another worthy experience.
The GitHub Status shows 14 incidents affecting Git Operations this year alone [1]. That's quite a lot, considering it's only May. I wonder if the outages were always this frequent and just get more publicity on here now, or whether there was a significant increase in outages fairly recently.
Many outages happen because something changed, and someone/something missed one of the effects of said change, bringing the platform down immediately, or after a while.
There was a period of time when GitHub basically didn't change, for years. And the platform was relatively stable (although "unicorns" (downtime)) still happened from time to time.
But nowhere near as often as now, but then again, there is a lot of more moving pieces now compared to before.
I wonder if it has anything to do with layoffs, if crucial Ops people left
Interested to hear whether anyone actually managed getting some Client Credits as per their SLA [1]? Over the last quarter they probably went sub 99.9% in some services.
[1] https://github.com/customer-terms/github-online-services-sla
Does GitHub provide uptime stats? Seems very unreliable recently.
About 10 years ago someone said we should move to self hosting because bitbucket who we used were unreliable. I looked at the status page and say 2 hours of downtime over 3-months, while we have 3-4 days of downtime on our self hosted jenkins during the same time. I always think of that when I see people complain about services being unreliable. Often we see one or two problems in short span and forget about the months were we didn't see any issues.
GitHub is probably as reliable now as it has been for the past 10 years. It's always had downtime.
> while we have 3-4 days of downtime on our self hosted jenkins during the same time
Wow. This sounds like a broken organization?
Sounds like a startup with no devops and before a lot of the tooling to make things better.
And why such a startup should install a Jenkins server instead of relying on some SaaS free-tier?
It was 10-years ago. There weren't really any SaaS with free tiers.
It sounds like literally every organization.
No it isn't. It's been very unstable in recent times ( 1 year or so)
I‘ve personally only experience 1 outage that effected me. And I’ve been using it heavily for the past year and moderately for years before that.
They used to, some screenshots of it existing here:
> Let’s hope it’s temporary and GitHub error 500 won’t become their own version of Blue Screen of Death. In this case it would be Green Screen of Death (GSoD or GhSoD).
Heh, if anything it has gotten worse.
https://www.githubstatus.com/history seems to show incidents in the last 3 months:
* March - 20 incidents
* April - 12 incidents
* May - 4 incidents (so far)
We notice API outages that break core functionality every single day. It's gotten so bad over the last couple years.
Performance of the GitHub website in general feels increasingly bad.
Yes, my experience too. I feel it the most when reviewing code. It felt much snappier before.
It's a surprisingly unreliable service. It's been great for code management / reviews. But I can't imagine relying on it as the only option for deployments via CD. Imagine needing to deploy an important bug fix or a big project with a release date, but you can't because Github's having an outage.
You can checkout https://github.com/GitHub-Incident-History/GitHub-Incident-H... which provides all recorded GitHub incidents.
It's since M$FT took over, not so recently.
Once again another GitHub incident and 4 days later before the last one [0], GitHub Actions goes down.
You are better off self-hosting at this point, rather than centralizing everything to GitHub [1] as it is just chronically unreliable for years ever since the Microsoft acquisition.
> You are better off self-hosting at this point
For medium/large companies, I fully agree. For smaller projects, specifically open source projects, I'd look at something like https://codeberg.org/
Rumor has it Microsoft is pushing them to move their infrastructure to Azure, which explains much of the frequent downtimes lately.
Maybe it's the case, but Github has always had issues with uptime and stability.
"Always"? I don't remember it being nearly as frequent before they were bought by Microsoft, some data to back this up would be nice.
Where'd you get this from?
1) A friend in DevOps who supposedly has contacts there.
2) It's not like this push is a big secret: https://www.cnbc.com/2022/10/12/microsoft-github-relying-mor...
Where is their infra right now?
Really looking forward to kicking the tires on Gitea next month.
We've switched to selfhosted Gitea last month, no regrets. Only the CI story could be a bit better. We're currently using Woodpecker but need macOS runners, and Woodpecker's "local" agent implementation is still unstable. I'm watching Gitea Actions' progress with great interest.
Did you try the macOS runners? Been using Linux, Windows and macOS runners without any issues for quite some time.
Any reason to use Gitea over the Forgejo fork? Based on the drama I'd rather use Forgejo if they're functionally equal.[1]
Great to be here. Next week same time?
Been off work for a month-ish. Everything is as it was I see.
I'm the lone person at my team that still believes in keeping most of our stuff local, with online versions as primarily backup.
Every time some global service goes down, or internal internet/intranet goes down, there is a security breach, or a WFH person has a power outage I'm reminded I'm right.
I'm no luddite, these services make you dependent on them. The worst thing I'm dependent on here is a bad computer. We have backups and keep our files on our network, so it seems fine. We are slowly moving to an online system, and I'm constantly reminded all the problems shifting online.
Meanwhile, if I had a linux server, we would be in control of our own destiny.
I don't know about you, but the stuff I self-manage usually has worse uptime than SaaS products
Doesn't matter if your uptime is 80% as long as that 20% of downtime is happening when nobody is working
Additionally an 80% uptime architecture is really simple to maintain and restore and so on.
Complexity increases exponentially the more 9's you add.
Depends.
Having our programs offline mean I can run them, even if the internet isnt working.
Instead of getting 0 data from downtime, I can still get the data, run the programs, and give it to the person who needs it.
If we are fully online, if the servers are down, we basically lose the entire time.
Not to mention, I think 'uptime' is a pretty optimistic number, unusable slow service doesnt seem to hit any metrics I'm aware of.
Really depends on the "stuff". GitLab pretty much manages itself through their Helm and Omnibus installs.
GIT is actually a great protocol for keeping distributed copies of code. You can pretty easily with bash cycle through a list of backup urls for a git repo, looking for updates.
Git itself is nice, but then there's the issue tracker and CI that are more difficult to setup.
Perhaps everyone should stop complaining and be thankful for a chill morning. You can't create a PR right now - go get a pastry and some fresh air. Be in the moment for once. It's beautiful outside*
* Where I am
Why don't we all collectively mirror our repos to gitlab and switch development to there during the monthly outage?
GitHub outages aren't nearly long or often enough to consider this. Git is distributed, just keep working locally until GitHub is back up. GitHub outages are nowhere near the threshold of pain I'd require to introduce a second Git hosting provider to the mix.
Really, GitHub outages barely hurt at all. It's not like an AWS or Cloudflare outage which is more likely to be a production disaster. Every outage a bunch of people on HN start screaming about owning their own on-prem destiny or wondering why we're still on GitHub. Nothing changes because it's not nearly as bad as those people are making it out to be. Life is all about tradeoffs.
Depends how your company is set up. Some people can't run tests locally and just push commits to have some magic run the tests online.
Enterprise says I have to use this unreliable garbage.
It's definitely no less reliable than GitLab, where a good 300GB of database data got deleted in production by accident...
And how often that has happened? Seems a little harsh
Just like calling GitHub "unreliable garbage"...
Github has been down hundreds of times this year alone. They have reported outages 72 times this year and there are multiple times when services are unavailable and they don't report it on the status page.
I don't see how the two situations are comparable
> there are multiple times when services are unavailable and they don't report it on the status page.
There's no evidence that the exact same doesn't happen with GitLab. I've had it (consistently) 500 on me in the past when there's nothing on their status page to indicate any issues.
Okay Agreed, Except
That's not the point of discussion. I didn't say Gitlab doesn't lie about it or heck, That it doesn't have worse uptime than Github.
My argument is that a company erasing 300GB production database once is not a stain on their competency and that it can not be compared to a company which has very frequent outages which also happens to lie when they have outages.
gitlab.com or self-hosted?
gitlab.com is implied since it happening on a self-hosted instance would have nothing to do with gitlab as a service (they can't be responsible for your on-site backups).
> Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.
https://about.gitlab.com/blog/2017/02/10/postmortem-of-datab...
Ah I see the link. I'd caution that many people choose between github.com, gitlab.com, and gitlab self-hosted. The reliability of self-hosted gitlab is meaningful, especially when operated competently. People need to know if there are safeguards or foot guns. Backups alone can't prevent data loss.
Substitute capital expenditure for operating expenses? With interest rates going up? It was already a tough sell with negative real rates...
Yep. We could self host. But it's forbidden.
We migrated recently. On prem was never down, but since moving to GitHub were more down than up.
https://status.gitlab.com/ lists 27 incidents this year, so far.
https://www.githubstatus.com/history lists 72 incidents since January
The point isn't that GitLab has more, the point is that running these things at global scale is pretty complicated, and everyone has problems. "Just switch to GitLab" is pithy but isn't in itself an actual solution.
You can self-host GitLab and have few, if any, incidents that get resolved very quickly. Worked for a company that had no incidents that I observed in ~3 years, now work at a company that had ~2 incidents in 1.5 years.
We have a self-hosted Premium instance and have 30min of downtime _every day_ while the database is frozen and backed up. We've been told that it's a known issue being discussed with GitLab but that could just be CYA. But in any case, it's the "at scale, while changing" that tends to cause problems.
Perhaps this is a continuing argument for self-hosting, especially if you don't have to expose the instance publicly. But then, if that's an option, you can also self-host GitHub (though I have heard less anecdotes about the stability of that).
> We have a self-hosted Premium instance and have 30min of downtime _every day_ while the database is frozen and backed up.
I'm confused. You can do zero-downtime backups and replication of databases. I don't know what your company / Gitlab are doing but it seems wrong.
And why aren't they doing it at 3am
You can self-host GitHub Enterprise too.
GitLab is quite a bit more expensive. If you have GitHub Enterprise with the security features, it's $70/month/user whereas you'll need to get GitLab Ultimate for the security features, which is $99/month/user.
Feature mismatch of anything outside Git. And no one wrote the tooling needed to synchronize issues, pull/merge requests, ... back and forth.
Even so, for a lot of devs its still easier as a temporary collaboration point than sending patches via mail.
why particularly GitLab?
From the outside, it appears GitHub doesn't have any internal sharding going on. Outages always affect _all_ repos.
Architecturally this seems rather sub optimal?
EG AWS doesn't roll out changes globally - they start with an internal shard within a region and progressively roll out to more shards and more regions.
Why do GitHub not do the same?
last I heard they've sharded some stuff, but some of the core tables/databases are still using a monolithic architecture.
And I guess you're only as strong as your weakest link, which can be not that bad, that is, if it isn't your core tables.
Though take this with a grain of salt, this is mostly hearsay =D
You will probably notice it also when trying to push a new branch:
`error: failed to push some refs to ` when using --set-upstream
Indeed, I notice it when trying to push also:
remote: Resolving deltas: 100% (3/3), completed with 3 local objects. remote: fatal error in commit_refs To github.com:acme/foo.git ! [remote rejected] HEAD -> acme/foo (failure) error: failed to push some refs to 'github.com:acme/foo.git'
"Actions, Issues" are not "Git operations".
would sound better without "other" in title
good catch
If that makes you mad, I still need help with https://github.com/MichaelMure/git-bug ;-) Coming at some point, kanban and pull-request support, offline-first!
Hm, we had this last week as well. Guess who had a demo yesterday, needed to deploy things on Thursday/Friday and couldn't :(
If this is your first time being affected, I guess you have a good reason to adjust your deployment structure so you can deploy manually if needed.
If this is not the first time, hopefully at least it will be the last one :)
No worries, we can. But I like to rely on automation to spend my time on other stuff. Worked out good in the end though ;)
Maybe a good time to investigate whether it's possible to hold those demos without relying on third party services.
You're always relying on third parties. Always. Except if you run it locally. We're way beyond that. I deployed to production just fine. It's just a helper. It adds to the stress tho.
Can anyone from GH weigh in on this? We've had several major outages from GH over the last month or two, and the company has been completely silent on the causes, as well as any sort of remediation steps to fix stability.
As a somewhat large size org, we're now exploring other options for code hosting.
Was unable to merge PRs.
Earlier, I also got GitHub PR comment emails about 6 hours late.
Whatever it is, it’s been happening for more than 6 hours.
Could you apply the patch from the URL? Pull requests aren't really needed.
Thankfully git is a distributed version control system, so such outages are not of major concern.
Ignoring the fact that what people actually do with GitHub, git is such a small part. Issues, PRs, CI/CD and basically everything that isn't git, doesn't happen over git (besides the wiki, which somehow miraculously actually is via git).
Some people have their entire roadmap in GitHub, and every single bug report / feature request, without any offline backup. Don't ask me why, I don't get it. Especially since they have proven for the last few years that they cannot keep the platform up in a stable manner.
I mean, you joke but that's actually fairly true. P4 users always notice when the central server goes down because you can't reliably look at changelist history, draft CLs, and do a host of other operations that are possible on git locally. (using a central VCS confers other advantages of course).
You're kidding right? People / companies rely on git being available for deployments, builds, config changes, developer workflow etc
But "git" is available. It's most likely under /usr/bin.
It's not git people are paying for.
Youa are confusing git and github.
No I'm not. This outage affects Github, not git itself - but if you're storing your git repos (and automation) on Github then you cannot git clone, push etc... from or to them - all of which are critical to CI/CD.
They are adding affected services to the status entry title (started with Issues, Actions, Operations). Can't even do a simple push due to this so-called "degraded performance".
They’ve convinced me to shut off the GitHub status alerts I get on slack (or at least move them to their own channel that I can ignore).
CivitAI.com is down and Hugging Face is having 502s for some operations. Perhaps there is a larger issue?
Wow huggingface is down completely
Wondered why I couldn't push my notes, appears that pushing commits is also impacted.
Goddamn and here I thought I was getting errors because the files i pushed were to large or something.
Ever since Microsoft acquisition
This is a monthly incident at this point. Maybe they have a rogue cron job??
more like weekly when you look at the past incidents https://www.githubstatus.com/#past-incidents
i wonder if all these incidents are related to azure devops integration
This happens constantly, what is the Github team doing?
I've started working on a Forgejo instance for myself (Gitea fork). It's honestly disappointing how bad GitHub has gotten, just in terms of uptime anymore. I hope they get their stuff together.
I haven't been able to push for a bit now
"We gave ChatGPT root access to our infrastructure servers, and unexpectedly it crashed everything."
That's almost believable at this point. ;)
2 hours going strong!
Having issues with `git push` from The Netherlands. :(
Cannot create a new repo
Now pages too
Again?
Yes. Once again. [0]
Due to GitHub's chronic unreliability, it is guaranteed to continue happening every month.
Looks like avoiding to 'centralize everything to GitHub' has aged very well [1] and at this point you would get better uptime with self-hosting instead of using GitHub.
Just ask many open source organizations like RedoxOS, ReactOS, wireguard, GNOME, KDE, etc.