GitHub incident: ongoing issues with Actions, Issues and other Git operations

204 points by jaitsu 3 years ago · 129 comments

Reader

idk1 3 years ago

I debugged for one min at 11:59 trying to push, and then my eat-lunch notification came in at 12:00 and I opened hackernews with a tuna sandwich and this is super helpful because it means I wont need to debug it locally for 10 mins before figuring out Github is down.

Edit - Just want to clarify when I say "opened hackernews with a tuna sandwich" I want to clear up that I did indeed full on mash the keyboard letters with my sandwich. It's costing me a fortune in keyboards every day and it's ruining my sandwich most days as well, I think I have an issue.

candiddevmike 3 years ago

This is why I don't touch other people's keyboards.
- brookst 3 years ago
  
  Yeah, this is a good argument for licking them instead.
- capableweb 3 years ago
  
  Little bit of bacteria helps build up your immune system for when you really need it :)
  - Sohcahtoa82 3 years ago
    
    You jest, but it's actually a legit hypothesis.
    https://en.wikipedia.org/wiki/Hygiene_hypothesis
- idk1 3 years ago
  
  Well from another perspective, a bootstrapping perspective, and hackernews does like a bootstrapping perspective, I could make a case that a good meal could pick licked out of most keyboards, saving money on lunch once a week.
koolba 3 years ago

Ideally both GitHub and HN would be down at the same time. With nowhere to discuss the former, people might actually get something done.
taf2 3 years ago

Easy to fix just adjust your system prompt add the following:
“Never ever type on keyboard with sandwich not even the most delicious tuna sandwich”
ht85 3 years ago

Next time try coffee, it's another worthy experience.

NikxDa 3 years ago

The GitHub Status shows 14 incidents affecting Git Operations this year alone [1]. That's quite a lot, considering it's only May. I wonder if the outages were always this frequent and just get more publicity on here now, or whether there was a significant increase in outages fairly recently.

[1] https://www.githubstatus.com/history

capableweb 3 years ago

Many outages happen because something changed, and someone/something missed one of the effects of said change, bringing the platform down immediately, or after a while.
There was a period of time when GitHub basically didn't change, for years. And the platform was relatively stable (although "unicorns" (downtime)) still happened from time to time.
But nowhere near as often as now, but then again, there is a lot of more moving pieces now compared to before.
dinvlad 3 years ago

I wonder if it has anything to do with layoffs, if crucial Ops people left

vienzo 3 years ago

Interested to hear whether anyone actually managed getting some Client Credits as per their SLA [1]? Over the last quarter they probably went sub 99.9% in some services.

[1] https://github.com/customer-terms/github-online-services-sla

wun0ne 3 years ago

Does GitHub provide uptime stats? Seems very unreliable recently.

that_guy_iain 3 years ago

About 10 years ago someone said we should move to self hosting because bitbucket who we used were unreliable. I looked at the status page and say 2 hours of downtime over 3-months, while we have 3-4 days of downtime on our self hosted jenkins during the same time. I always think of that when I see people complain about services being unreliable. Often we see one or two problems in short span and forget about the months were we didn't see any issues.
GitHub is probably as reliable now as it has been for the past 10 years. It's always had downtime.
- darkwater 3 years ago
  
  > while we have 3-4 days of downtime on our self hosted jenkins during the same time
  Wow. This sounds like a broken organization?
  - that_guy_iain 3 years ago
    
    Sounds like a startup with no devops and before a lot of the tooling to make things better.
    
    darkwater 3 years ago
    
    And why such a startup should install a Jenkins server instead of relying on some SaaS free-tier?
    
    that_guy_iain 3 years ago
    
    It was 10-years ago. There weren't really any SaaS with free tiers.
  - joshmanders 3 years ago
    
    It sounds like literally every organization.
- manojlds 3 years ago
  
  No it isn't. It's been very unstable in recent times ( 1 year or so)
  - that_guy_iain 3 years ago
    
    I‘ve personally only experience 1 outage that effected me. And I’ve been using it heavily for the past year and moderately for years before that.
bobmaxup 3 years ago

They used to, some screenshots of it existing here:
https://interface.lt/github-uptime-and-microsoft
- capableweb 3 years ago
  
  > Let’s hope it’s temporary and GitHub error 500 won’t become their own version of Blue Screen of Death. In this case it would be Green Screen of Death (GSoD or GhSoD).
  Heh, if anything it has gotten worse.
steelaz 3 years ago

https://www.githubstatus.com/history seems to show incidents in the last 3 months:
* March - 20 incidents
* April - 12 incidents
* May - 4 incidents (so far)
smcleod 3 years ago

We notice API outages that break core functionality every single day. It's gotten so bad over the last couple years.
__alexs 3 years ago

Performance of the GitHub website in general feels increasingly bad.
- meling 3 years ago
  
  Yes, my experience too. I feel it the most when reviewing code. It felt much snappier before.
grumple 3 years ago

It's a surprisingly unreliable service. It's been great for code management / reviews. But I can't imagine relying on it as the only option for deployments via CD. Imagine needing to deploy an important bug fix or a big project with a release date, but you can't because Github's having an outage.
PCloud 3 years ago

You can checkout https://github.com/GitHub-Incident-History/GitHub-Incident-H... which provides all recorded GitHub incidents.
voytec 3 years ago

It's since M$FT took over, not so recently.

rvz 3 years ago

Once again another GitHub incident and 4 days later before the last one [0], GitHub Actions goes down.

You are better off self-hosting at this point, rather than centralizing everything to GitHub [1] as it is just chronically unreliable for years ever since the Microsoft acquisition.

[0] https://news.ycombinator.com/item?id=35817998

[1] https://news.ycombinator.com/item?id=22867803

hnarn 3 years ago

> You are better off self-hosting at this point
For medium/large companies, I fully agree. For smaller projects, specifically open source projects, I'd look at something like https://codeberg.org/

shibel 3 years ago

Rumor has it Microsoft is pushing them to move their infrastructure to Azure, which explains much of the frequent downtimes lately.

p-o 3 years ago

Maybe it's the case, but Github has always had issues with uptime and stability.
- hnarn 3 years ago
  
  "Always"? I don't remember it being nearly as frequent before they were bought by Microsoft, some data to back this up would be nice.
maccard 3 years ago

Where'd you get this from?
- shibel 3 years ago
  
  1) A friend in DevOps who supposedly has contacts there.
  2) It's not like this push is a big secret: https://www.cnbc.com/2022/10/12/microsoft-github-relying-mor...
KomoD 3 years ago

Where is their infra right now?

candiddevmike 3 years ago

Really looking forward to kicking the tires on Gitea next month.

dindresto 3 years ago

We've switched to selfhosted Gitea last month, no regrets. Only the CI story could be a bit better. We're currently using Woodpecker but need macOS runners, and Woodpecker's "local" agent implementation is still unstable. I'm watching Gitea Actions' progress with great interest.
- capableweb 3 years ago
  
  Did you try the macOS runners? Been using Linux, Windows and macOS runners without any issues for quite some time.
hnarn 3 years ago

Any reason to use Gitea over the Forgejo fork? Based on the drama I'd rather use Forgejo if they're functionally equal.[1]
[1]: https://forgejo.org/faq/#why-was-forgejo-created

Vasniktel 3 years ago

Great to be here. Next week same time?

xmdx 3 years ago

Been off work for a month-ish. Everything is as it was I see.

hospitalJail 3 years ago

I'm the lone person at my team that still believes in keeping most of our stuff local, with online versions as primarily backup.

Every time some global service goes down, or internal internet/intranet goes down, there is a security breach, or a WFH person has a power outage I'm reminded I'm right.

I'm no luddite, these services make you dependent on them. The worst thing I'm dependent on here is a bad computer. We have backups and keep our files on our network, so it seems fine. We are slowly moving to an online system, and I'm constantly reminded all the problems shifting online.

Meanwhile, if I had a linux server, we would be in control of our own destiny.

goodoldneon 3 years ago

I don't know about you, but the stuff I self-manage usually has worse uptime than SaaS products
- dijit 3 years ago
  
  Doesn't matter if your uptime is 80% as long as that 20% of downtime is happening when nobody is working
  Additionally an 80% uptime architecture is really simple to maintain and restore and so on.
  Complexity increases exponentially the more 9's you add.
- hospitalJail 3 years ago
  
  Depends.
  Having our programs offline mean I can run them, even if the internet isnt working.
  Instead of getting 0 data from downtime, I can still get the data, run the programs, and give it to the person who needs it.
  If we are fully online, if the servers are down, we basically lose the entire time.
  Not to mention, I think 'uptime' is a pretty optimistic number, unusable slow service doesnt seem to hit any metrics I'm aware of.
- sofixa 3 years ago
  
  Really depends on the "stuff". GitLab pretty much manages itself through their Helm and Omnibus installs.
solotronics 3 years ago

GIT is actually a great protocol for keeping distributed copies of code. You can pretty easily with bash cycle through a list of backup urls for a git repo, looking for updates.
- _flux 3 years ago
  
  Git itself is nice, but then there's the issue tracker and CI that are more difficult to setup.

papito 3 years ago

Perhaps everyone should stop complaining and be thankful for a chill morning. You can't create a PR right now - go get a pastry and some fresh air. Be in the moment for once. It's beautiful outside*

* Where I am

Kelteseth 3 years ago

Why don't we all collectively mirror our repos to gitlab and switch development to there during the monthly outage?

electroly 3 years ago

GitHub outages aren't nearly long or often enough to consider this. Git is distributed, just keep working locally until GitHub is back up. GitHub outages are nowhere near the threshold of pain I'd require to introduce a second Git hosting provider to the mix.
Really, GitHub outages barely hurt at all. It's not like an AWS or Cloudflare outage which is more likely to be a production disaster. Every outage a bunch of people on HN start screaming about owning their own on-prem destiny or wondering why we're still on GitHub. Nothing changes because it's not nearly as bad as those people are making it out to be. Life is all about tradeoffs.
- bombolo 3 years ago
  
  Depends how your company is set up. Some people can't run tests locally and just push commits to have some magic run the tests online.
geraldwhen 3 years ago

Enterprise says I have to use this unreliable garbage.
- lol768 3 years ago
  
  It's definitely no less reliable than GitLab, where a good 300GB of database data got deleted in production by accident...
  - ishanjain28 3 years ago
    
    And how often that has happened? Seems a little harsh
    
    subtra3t 3 years ago
    
    Just like calling GitHub "unreliable garbage"...
    
    ishanjain28 3 years ago
    
    Github has been down hundreds of times this year alone. They have reported outages 72 times this year and there are multiple times when services are unavailable and they don't report it on the status page.
    I don't see how the two situations are comparable
    
    lol768 3 years ago
    
    > there are multiple times when services are unavailable and they don't report it on the status page.
    There's no evidence that the exact same doesn't happen with GitLab. I've had it (consistently) 500 on me in the past when there's nothing on their status page to indicate any issues.
    
    ishanjain28 3 years ago
    
    Okay Agreed, Except
    That's not the point of discussion. I didn't say Gitlab doesn't lie about it or heck, That it doesn't have worse uptime than Github.
    My argument is that a company erasing 300GB production database once is not a stain on their competency and that it can not be compared to a company which has very frequent outages which also happens to lie when they have outages.
  - 8organicbits 3 years ago
    
    gitlab.com or self-hosted?
    
    hnarn 3 years ago
    
    gitlab.com is implied since it happening on a self-hosted instance would have nothing to do with gitlab as a service (they can't be responsible for your on-site backups).
    > Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.
    https://about.gitlab.com/blog/2017/02/10/postmortem-of-datab...
    
    8organicbits 3 years ago
    
    Ah I see the link. I'd caution that many people choose between github.com, gitlab.com, and gitlab self-hosted. The reliability of self-hosted gitlab is meaningful, especially when operated competently. People need to know if there are safeguards or foot guns. Backups alone can't prevent data loss.
- namaria 3 years ago
  
  Substitute capital expenditure for operating expenses? With interest rates going up? It was already a tough sell with negative real rates...
- bombolo 3 years ago
  
  Yep. We could self host. But it's forbidden.
  - geraldwhen 3 years ago
    
    We migrated recently. On prem was never down, but since moving to GitHub were more down than up.
misnome 3 years ago

https://status.gitlab.com/ lists 27 incidents this year, so far.
- alpaca128 3 years ago
  
  https://www.githubstatus.com/history lists 72 incidents since January
  - misnome 3 years ago
    
    The point isn't that GitLab has more, the point is that running these things at global scale is pretty complicated, and everyone has problems. "Just switch to GitLab" is pithy but isn't in itself an actual solution.
    
    tpxl 3 years ago
    
    You can self-host GitLab and have few, if any, incidents that get resolved very quickly. Worked for a company that had no incidents that I observed in ~3 years, now work at a company that had ~2 incidents in 1.5 years.
    
    misnome 3 years ago
    
    We have a self-hosted Premium instance and have 30min of downtime _every day_ while the database is frozen and backed up. We've been told that it's a known issue being discussed with GitLab but that could just be CYA. But in any case, it's the "at scale, while changing" that tends to cause problems.
    Perhaps this is a continuing argument for self-hosting, especially if you don't have to expose the instance publicly. But then, if that's an option, you can also self-host GitHub (though I have heard less anecdotes about the stability of that).
    
    grumple 3 years ago
    
    > We have a self-hosted Premium instance and have 30min of downtime _every day_ while the database is frozen and backed up.
    I'm confused. You can do zero-downtime backups and replication of databases. I don't know what your company / Gitlab are doing but it seems wrong.
    
    bombolo 3 years ago
    
    And why aren't they doing it at 3am
    
    count 3 years ago
    
    You can self-host GitHub Enterprise too.
Hamuko 3 years ago

GitLab is quite a bit more expensive. If you have GitHub Enterprise with the security features, it's $70/month/user whereas you'll need to get GitLab Ultimate for the security features, which is $99/month/user.
progval 3 years ago

Feature mismatch of anything outside Git. And no one wrote the tooling needed to synchronize issues, pull/merge requests, ... back and forth.
- Lutger 3 years ago
  
  Even so, for a lot of devs its still easier as a temporary collaboration point than sending patches via mail.
jonas-w 3 years ago

why particularly GitLab?

everfrustrated 3 years ago

From the outside, it appears GitHub doesn't have any internal sharding going on. Outages always affect _all_ repos.

Architecturally this seems rather sub optimal?

EG AWS doesn't roll out changes globally - they start with an internal shard within a region and progressively roll out to more shards and more regions.

Why do GitHub not do the same?

kjuulh 3 years ago

last I heard they've sharded some stuff, but some of the core tables/databases are still using a monolithic architecture.
And I guess you're only as strong as your weakest link, which can be not that bad, that is, if it isn't your core tables.
Though take this with a grain of salt, this is mostly hearsay =D

joennlae 3 years ago

You will probably notice it also when trying to push a new branch:

`error: failed to push some refs to ` when using --set-upstream

jaitsuOP 3 years ago

Indeed, I notice it when trying to push also:

  remote: Resolving deltas: 100% (3/3), completed with 3 local objects.
  remote: fatal error in commit_refs
  To github.com:acme/foo.git
   ! [remote rejected] HEAD -> acme/foo (failure)
  error: failed to push some refs to 'github.com:acme/foo.git'

indeyets 3 years ago

"Actions, Issues" are not "Git operations".

would sound better without "other" in title

jaitsuOP 3 years ago

good catch

michaelmure 3 years ago

If that makes you mad, I still need help with https://github.com/MichaelMure/git-bug ;-) Coming at some point, kanban and pull-request support, offline-first!

RamblingCTO 3 years ago

Hm, we had this last week as well. Guess who had a demo yesterday, needed to deploy things on Thursday/Friday and couldn't :(

capableweb 3 years ago

If this is your first time being affected, I guess you have a good reason to adjust your deployment structure so you can deploy manually if needed.
If this is not the first time, hopefully at least it will be the last one :)
- RamblingCTO 3 years ago
  
  No worries, we can. But I like to rely on automation to spend my time on other stuff. Worked out good in the end though ;)
hnarn 3 years ago

Maybe a good time to investigate whether it's possible to hold those demos without relying on third party services.
- RamblingCTO 3 years ago
  
  You're always relying on third parties. Always. Except if you run it locally. We're way beyond that. I deployed to production just fine. It's just a helper. It adds to the stress tho.

dclowd9901 3 years ago

Can anyone from GH weigh in on this? We've had several major outages from GH over the last month or two, and the company has been completely silent on the causes, as well as any sort of remediation steps to fix stability.

As a somewhat large size org, we're now exploring other options for code hosting.

JestUM 3 years ago

Was unable to merge PRs.

Earlier, I also got GitHub PR comment emails about 6 hours late.

Whatever it is, it’s been happening for more than 6 hours.

toastal 3 years ago

Could you apply the patch from the URL? Pull requests aren't really needed.

meindnoch 3 years ago

Thankfully git is a distributed version control system, so such outages are not of major concern.

capableweb 3 years ago

Ignoring the fact that what people actually do with GitHub, git is such a small part. Issues, PRs, CI/CD and basically everything that isn't git, doesn't happen over git (besides the wiki, which somehow miraculously actually is via git).
Some people have their entire roadmap in GitHub, and every single bug report / feature request, without any offline backup. Don't ask me why, I don't get it. Especially since they have proven for the last few years that they cannot keep the platform up in a stable manner.
ninepoints 3 years ago

I mean, you joke but that's actually fairly true. P4 users always notice when the central server goes down because you can't reliably look at changelist history, draft CLs, and do a host of other operations that are possible on git locally. (using a central VCS confers other advantages of course).
smcleod 3 years ago

You're kidding right? People / companies rely on git being available for deployments, builds, config changes, developer workflow etc
- meindnoch 3 years ago
  
  But "git" is available. It's most likely under /usr/bin.
  - smcleod 3 years ago
    
    It's not git people are paying for.
- ta988 3 years ago
  
  Youa are confusing git and github.
  - smcleod 3 years ago
    
    No I'm not. This outage affects Github, not git itself - but if you're storing your git repos (and automation) on Github then you cannot git clone, push etc... from or to them - all of which are critical to CI/CD.

voytec 3 years ago

They are adding affected services to the status entry title (started with Issues, Actions, Operations). Can't even do a simple push due to this so-called "degraded performance".

kwyjibo_hunter 3 years ago

They’ve convinced me to shut off the GitHub status alerts I get on slack (or at least move them to their own channel that I can ignore).

jelling 3 years ago

CivitAI.com is down and Hugging Face is having 502s for some operations. Perhaps there is a larger issue?

scottmf 3 years ago

Wow huggingface is down completely

pavo-etc 3 years ago

Wondered why I couldn't push my notes, appears that pushing commits is also impacted.

arkitaip 3 years ago

Goddamn and here I thought I was getting errors because the files i pushed were to large or something.

yuuta 3 years ago

Ever since Microsoft acquisition

perryizgr8 3 years ago

This is a monthly incident at this point. Maybe they have a rogue cron job??

jonas-w 3 years ago

more like weekly when you look at the past incidents https://www.githubstatus.com/#past-incidents

vb-8448 3 years ago

i wonder if all these incidents are related to azure devops integration

KomoD 3 years ago

This happens constantly, what is the Github team doing?

voynich 3 years ago

I've started working on a Forgejo instance for myself (Gitea fork). It's honestly disappointing how bad GitHub has gotten, just in terms of uptime anymore. I hope they get their stuff together.

theobr 3 years ago

I haven't been able to push for a bit now

justinclift 3 years ago

"We gave ChatGPT root access to our infrastructure servers, and unexpectedly it crashed everything."

That's almost believable at this point. ;)

Vasniktel 3 years ago

2 hours going strong!

yurishimo 3 years ago

Having issues with `git push` from The Netherlands. :(

linhns 3 years ago

Cannot create a new repo

vermilingua 3 years ago

Now pages too

malka 3 years ago

Again?

rvz 3 years ago

Yes. Once again. [0]
Due to GitHub's chronic unreliability, it is guaranteed to continue happening every month.
Looks like avoiding to 'centralize everything to GitHub' has aged very well [1] and at this point you would get better uptime with self-hosting instead of using GitHub.
Just ask many open source organizations like RedoxOS, ReactOS, wireguard, GNOME, KDE, etc.
[0] https://news.ycombinator.com/item?id=35817662
[1] https://news.ycombinator.com/item?id=22867803

Settings

GitHub incident: ongoing issues with Actions, Issues and other Git operations

Keyboard Shortcuts