Settings

Theme

Incident with GitHub Actions, API requests, Codespaces, Git operations, Issues

githubstatus.com

267 points by naglis 4 years ago · 122 comments (121 loaded)

Reader

Wavelets 4 years ago

Whew, glad I decided to scroll HN right now. I've been puzzling over why I'm getting "! [remote rejected] master -> master (Internal Server Error)" as well while trying to push and decided to take a break.

  • adelarsq 4 years ago

    Time to take some coffee and configure Vim

    • polishdude20 4 years ago

      Don't you guys have other features and stuff to work on locally? What is this "time to take a break when GitHub is down"? I'm saying this a bit tongue in cheek btw :)

      • websap 4 years ago

        All my features are part of 1 PR, the PR contains no code to avoid bugs, the features are in my head.

    • jamil7 4 years ago

      But muh plugins are all on Github...

    • wvh 4 years ago

      I hope you have a lot of coffee.

  • forgingahead 4 years ago

    It's been like that for at least 6 hours, randomly appearing. I would take a pause and try again and then it would work, but now it's definitely much more persistent.

    Guess it's time to go play some video games....

    https://xkcd.com/303/

    • 5e92cb50239222b 4 years ago

      Here you go:

        $ while ! git push my; do sleep 1; done
      
      Works for me eventually, although commits do not appear in web interface (they do in the actual repository).
      • klibertp 4 years ago

        Having been on the receiving end of things like this: please, make the sleep longer. Adding more requests to already malfunctioning system is not a good way to help in fixing it.

      • forgingahead 4 years ago

        Thanks but no thanks - no way am I doing anything to my core app repos when the repo host is fritzing out. This is one of those moments to go for a walk (or bed, depending on your timezone).

      • TimWolla 4 years ago

        -f does not sound like a good idea to me in a script like that.

        • renata 4 years ago

          Also pretty much every usage of -f would be better off being --force-with-lease so you're less likely to accidentally clobber someone else's work. I have git fpush aliased to "push --force-with-lease" and try to spread the gospel when possible :)

          • kubanczyk 4 years ago

            Yeah, I learned it by using magit or vscode's other magit and they both default to --force-with-lease.

        • 5e92cb50239222b 4 years ago

          Good point. I just copy-pasted it from the terminal, as it made sense in my particular situation. I'll remove it.

    • cik 4 years ago

      Also yesterday depending on where you were in the world.

  • dgellow 4 years ago

    Yep, same here! Good time to make a new coffee :)

  • ahmadrosid 4 years ago

    Same here got rejected when push. ! [remote rejected] HEAD -> main (Internal Server Error)

    • mullikine 4 years ago

      haha I thought I had finally made one too many git commits (I'm an over-commiter).

  • distartin 4 years ago

    Never really realized that github had many technical incidents lol

  • lukeinator42 4 years ago

    same here, I was having internet issues yesterday, and now that my internet is working github isn't, haha.

  • m3nu 4 years ago

    dito

avar 4 years ago

I'm finding that pushes do go through eventually, this is probably grossly irresponsible, so I don't recommend its use, but I remembered I had this old alias to "push harder" in my ~/.gitconfig:

    [alias]
    thrust = "!f() { until git push $@; do sleep 0.5; done; }; f"
I've done a few pushes so far, and found that it's going through in <10 tries or so.
  • gfunk911 4 years ago

      # Retries a command a with backoff.
      #
      # The retry count is given by ATTEMPTS (default 100), the
      # initial backoff timeout is given by TIMEOUT in seconds
      # (default 5.)
      #
      # Successive backoffs increase the timeout by ~33%.
      #
      # Beware of set -e killing your whole script!
      function try_till_success {
        local max_attempts=${ATTEMPTS-100}
        local timeout=${TIMEOUT-5}
        local attempt=0
        local exitCode=0
    
        while [[ $attempt < $max_attempts ]]
        do
          "$@"
          exitCode=$?
    
          if [[ $exitCode == 0 ]]
          then
            break
          fi
    
          echo "Failure! Retrying in $timeout.." 1>&2
          sleep $timeout
          attempt=$(( attempt + 1 ))
          timeout=$(( timeout * 40 / 30 ))
        done
    
        if [[ $exitCode != 0 ]]
        then
          echo "You've failed me for the last time! ($@)" 1>&2
        fi
    
        return $exitCode
      }
  • hackandtrip 4 years ago

    Add some kind of exponential backoff to be a good citizen!

  • totony 4 years ago

    >Service degradation

    >Time for some manual DoS

  • doersino 4 years ago

    TIL about "until" loops! How neat.

  • svnpenn 4 years ago

    half a second? Jesus dude calm down.

  • mkoubaa 4 years ago

    The delay makes me think you should use the German word for thrust

  • 5e92cb50239222b 4 years ago

    It's fine. Maybe it will force them to finally start paying attention to the quality of their work. If crap I'm writing for a living was misbehaving that frequently, I'd be sweeping the streets by now (or doing some other work that's actually useful to society).

    • ctvo 4 years ago

      It's OK to be frustrated since we rely on GitHub so much, but this is unkind. Software is complex. GitHub operates at a scale few of us work at. There are people at the other end doing their best traversing complex internal systems (organization and tech).

      I would argue GitHub has done more for societal good than most tech ventures, by the way.

      • 5e92cb50239222b 4 years ago

        I was pretty pissed off, alright, so my comment probably gave out wrong vibes. I'm not arguing I could do any better (I probably wouldn't get past their interview process), and they certainly do have the talent (which is obvious by their technical blog posts).

        It doesn't change the fact that the company has absolutely crap dev culture which seems to put features first and foremost, at the expense of everything else. There are products with even more complexity that don't fall over and die almost every single day. It's just not funny anymore. Facebook is pretty complex, it had major issues like this one, what, once in its entire life?

        I don't remember Google Search (or other Google products) ever not answering my queries, and I've been using it for about 18 years.

        And so on. I reckon it's because those companies have strong engineering culture (Google certainly does, at least), and this one doesn't.

      • deckard1 4 years ago

        GitHub actions has been like this for years now. Years. Years!!!!

        And the crazy thing is you see people on HN demanding that some one person side project/SaaS has to be at 100% uptime with multiple failovers, automatic scaling, etc. etc. There is such an emphasis on scalability on HN and yet... you just brush that all away because "software is tough." Yeah, no shit. Poor Github. They are also Microsoft now. One of the wealthiest corporations in the entire world. And people are paying Github. This isn't Twitter fail whale we're talking about.

        • ctvo 4 years ago

          > And the crazy thing is you see people on HN demanding that some one person side project/SaaS has to be at 100% uptime with multiple failovers, automatic scaling, etc. etc. There is such an emphasis on scalability on HN and yet... you just brush that all away because "software is tough.

          I'm not one of those people. I may have been when I was much more inexperienced.

          Software is hard. Full stop. Organizational politics, engineering culture, business / tech alignment are all hard. Distributed systems are hard.

          > Yeah, no shit. Poor Github. They are also Microsoft now. One of the wealthiest corporations in the entire world. And people are paying Github. This isn't Twitter fail whale we're talking about.

          I may have also thought this when I was much more inexperienced. This isn't a resource problem. Even a small startup, when they start having failures due to scale from growth, it's not a money problem. Throwing money at this doesn't make it go away.

          By the way, the Twitter fail whale impacted paying customers (advertisers).

        • sascha_sl 4 years ago

          That’s because GitHub Actions is Azure DevOps, or if you want to go back further, Team Foundation Server Pipelines.

      • BukhariH 4 years ago

        People tend not to be very kind when any product they pay for goes down.

        At the end of the day - our companies also have people that rely on our software working in order to do a lot of societal good.

        • ilkkal 4 years ago

          Sure, but it’s incredibly naive to see gh having problems and go “they must not know what they are doing”

          • hackandtrip 4 years ago

            It is probably caused from postmortem culture not being shared in the community.

            "Having problems" in this world (any kind, not only due to the github scale!) is something that happens - we are not perfect and we work on an incredible amount of layers of complexity.

            It is sufficient to actually touch production code on a daily basis to see that it can happen to the best, with the best observability systems or processes. The key is avoiding blaming, and understanding iteratively how to fix the problems underneath (faster recovery, detection time, and so on).

        • tedunangst 4 years ago

          Everybody should be refunded $0.05 for the unavailability of the service they paid for.

    • TechBro8615 4 years ago

      You should probably look for a new job then, because it's pretty difficult to get fired for underperformance as a software engineer these days. There are plenty of places you can write shit code, or if you prefer Rust, places where you can blog about other people writing shit code.

      Anyway, you shouldn't fire someone for causing bugs in production since it indicates a systemic failure of all the checks that should come before the bug is deployed. Even if you can trace the root cause to one person, it would be counterproductive to fire them, because now they've made the mistake they probably won't make it again. Whereas their replacement doesn't have the same wisdom.

everfrustrated 4 years ago

Does anybody else remember when GitHub's outage page used to have little graphs showing downtime?

Eventually they took it down as their outages were just too often.

GitHub has _always_ had terrible uptime. It's a great product - wish something would change but it seems cultural at this point.

  • 15characterslon 4 years ago

    They had massive problems with their main database cluster (MySQL). If you read through their engineering blog, most of the outages were related to their growth and the main database cluster. They moved workloads for some features to different clusters, but that's only to buy more time. Eventually they'll do proper shredding (by user or org I guess, not by feature) but that takes time.

    Their engineering blog is full of articles about MySQL and the main "mysql1" database cluster, e.g. https://github.blog/2021-09-27-partitioning-githubs-relation...

    • throwusawayus 4 years ago

      i've noticed this too .. the real head-scratcher is how a solid chunk of github's db & infra folks left to join a database startup, one of them even becoming its ceo!!

      if they had made github db/infra super-stable before this, it would be a vote of confidence in their new company, but instead imho it is the opposite

      • avar 4 years ago

        DB and infra folks are often tasked with shoveling shit uphill, and aren't in total control over how data or schemas get organized.

        • throwusawayus 4 years ago

          that's fair. i am just raising an eyebrow to github's apparent lack of sharding, as described in their incident reports -- while these engineers all left to join a db company that focuses specifically on sharding -- it seems like an experience mismatch.

          if they were all sharding experts why wasn't github sharded properly. other large mysql shops have solved this, all the way back to the days of yahoo and flickr and livejournal

      • dimitrisnl 4 years ago

        Which one are you referring to?

        • throwusawayus 4 years ago

          maybe i shouldn't have mentioned it, i don't want to name names and have this to come off as an off-topic attack subthread about a different company, sorry! it's a db company that has raised a lot of money and is mentioned on hn a lot, there are only a handful of these

        • tedmiston 4 years ago

          my guess is:

              rot13 cynargfpnyr
  • pythux 4 years ago

    I have no idea if this is remotely close to reality but, what if, their culture of breaking things and bad uptime is what allowed them to move fast and build a great product in the first place?

    • hn_throwaway_99 4 years ago

      GitHub was founded in 2007. They were acquired by MS years ago. They should be well beyond any startup culture of "move fast at the expense of reliability".

      • pythux 4 years ago

        I don't disagree with this, they could/should have transitioned already. But for one, cultures are hard/slow to change. And second, as an example, Facebook had the motto "move fast and break things" until 2014, and by that time they also were beyond the startup phase(), so this kind of culture is not only for early days.

        () They were founded in 2004, that's 10 years in. By that time in 2014 they had 800M+ monthly active users and $12 Billion revenue; and they had this culture internally until this point.

        • hn_throwaway_99 4 years ago

          Facebook is a social media app that hardly anyone (except for advertisers) pays for.

          GitHub is an enterprise product crucial to tons of businesses.

          Cultural comparisons between the two really shouldn't apply.

          • pythux 4 years ago

            Aren't both companies potentially loosing money when their products don't work? The fact that it's crucial to businesses seems to be the client perspective, not the company perspective. It could also seen as critical for some businesses to advertise on Facebook. This could call for a different culture internally but I'm not convinced this is necessarily the case.

intsunny 4 years ago

Whew, outage timestamps in UTC.

Now I won't have to know what time is it California, and if California currently has PST, PDT, PTSD, etc

  • pdenton 4 years ago

    As someone with diagnosed PTSD, I never thought I'd psychologically level with an entire state ;)

  • omegalulw 4 years ago

    To anyone who is reading this and genuinely wants to know: it's PDT, UTC-7.

candiddevmike 4 years ago

This is causing actions jobs to hang after completing, consuming precious minutes. I don't think I've ever seen a refund when this happens, so I recommend everyone check their jobs and cancel them for now.

deckard1 4 years ago

Two days they have been down now. Github has, by far, the worst uptime of any critical service I've seen going on multiple years now.

jetpackjoe 4 years ago

The github.com homepage, as well as api (via `gh`) are not working for me either.

  • jetpackjoe 4 years ago

    Their status page is reflecting the new outages. Good on GitHub for actually updating that quickly.

    • jhugo 4 years ago

      We've been experiencing problems here in Asia for almost 12 hours now, and it's been "all green" the whole time.

    • cube00 4 years ago

      It's a shame they not open about the extent. Sign in/out hitting a 500 internal error isn't really "degraded"

  • niel 4 years ago

    > The github.com homepage

    Only while logged in, it seems.

arpinum 4 years ago

These incidents have to hurt Azure's brand value. It's a monster task to run something as big as GitHub, if they ever get it stable it will lend a lot of credibility to Microsoft's cloud skills.

  • ryanbrunner 4 years ago

    There's not really all that much pointing to an infrastructure level failure - it's possible, but it's just as likely it's an application-level failure somewhere in Github's code. The API is returning 500s and not 503s and the failure is relatively quick, so it's not obviously a server outage.

    • kortex 4 years ago

      It's yellow lights across the board, literally nothing is green. That's usually indicative of some sort of software infrastructure level failure or cascade failure, not an application-level failure, which usually manifests as one or two specific services going down (depending on how you define "infrastructure" and "application" - with IAC, arguably the software defined infrastructure _is_ an application). I doubt its a physical hardware issue. It's rarely hardware (except when your DS catches on fire).

      No red lights, so it's probably not something catastrophic like that facebook DNS SNAFU, but it definitely smells infrastructure- or deployment-scoped. Like either small DNS issue, or some load balancers are sending traffic to servers which cannot handle it programmatically (schema change?) so they are barfing.

      • egberts1 4 years ago

        Only load balancer (as an infrastructure) can hit the lights across the board. Not much else.

        • TimWolla 4 years ago

          Databases, Caches or the authentication service? For me read-only requests are working fine and I've not seen any issues. Submitting new contents (e.g. comments) is where it's failing for me. It might be that their database primary is falling over.

  • zinekeller 4 years ago

    Serious questions:

    1) Is GitHub runing under Azure's technology stack?

    2) Is GitHub under Azure's mamagement (in contrast to Visual Studio's team)?

    I'm not sure about two but I'm pretty sure that GitHub doesn't run under Azure at all, considering that GitHub has fully separate networking from MSN's/Azure's (and GitHub's machines do pingback unlike most of Microsoft's machines which don't).

    • Serow225 4 years ago

      The last time I checked, the only meaningful parts of GitHub that ran on Azure was/is Actions. Everything else is AWS.

  • gtirloni 4 years ago

    GitHub is pretty stable. What are you talking about? I doubt most GitHub users know it's on Azure.

  • jaywalk 4 years ago

    I don't consider this a reflection on Azure at all. It's really just a reflection on GitHub under Microsoft's leadership.

    • jamil7 4 years ago

      Eh, I'm no Microsoft fan but it used to have issues before the acquisition too. I can't really remember if it was better or worse.

jakub_g 4 years ago

At least one good thing about GH is that while things break, the status page is updated relatively fast compared to other companies, when all HN knows about outage for 1h+ until it's acknowledged.

bloopernova 4 years ago

And of course my developer teammates are still trying to merge PRs.

I don't care that it works "some of the time"! Don't mess with the repos when the repo host is having seemingly random issues.

  • fritzo 4 years ago

    For example: while actions are down, branches can be merged without ci tests passing, even for protected branches. This just happened on one of my repos.

PeterBarrett 4 years ago

One of our systems runs AWS code repository in parallel to Github and builds are triggered from there (but not in us-east-1). Time to migrate the rest of our systems to having that fallback.

lebski88 4 years ago

It's almost the same time as their incident yesterday too. Although today the scope is wider - yesterday it was Webhooks and Actions. Today core git is broken as well as the APIs.

  • pm90 4 years ago

    Yep. I hope they post an aws style postmortem… this is kinda ridiculous (although I do empathize as an ops person). Webhooks breaking broke all of our pr bots bringing development to a standstill yesterday; today everything seems f’d.

WFHRenaissance 4 years ago

Looks like the drinking started early at GitHub... good on them!

timeimp 4 years ago

It’s not DNS

There’s no way it’s DNS

It was DNS

rvz 4 years ago

Here we go again. GitHub going completely down at least once a month as I said. [0] So nothing has changed. That is excluding the smaller intermittent issues. Let's see if anyone implemented a self-hosted backup or failsafe just in case.

Oh dear.

[0] https://news.ycombinator.com/item?id=30149071

  • bastardoperator 4 years ago

    The entire point of git is that it's decentralized, lol. If I've cloned locally like millions of people do daily, I have a backup.

    • rvz 4 years ago

      > The entire point of git is that it's decentralized, lol.

      No-one here is criticizing git itself. That is not the point.

      It is GitHub that is defeating the whole point of it all, once their hosted central server goes down.

      The majority of these projects went all in on GitHub, including using GitHub actions, npm packages, hosting their whole website, etc hence as soon as it goes down, they can't push or update anything; especially if it was very urgent. It has become a giant single point of failure for nearly everything.

      There is a reason why the Linux Kernel, Mozilla, Qt, Chromium, GNOME, ReactOS, etc self-host their own repositories and have fail-safes repositories if Github goes down and becomes unreliable.

      • bastardoperator 4 years ago

        If you're not building some downtime into your model you're not being realistic. It's easy to point fingers but the reality is every product and company will experience unexpected downtime. It's an easy business decision for executives/buyers, pay a team of top engineers to home grow a durable product assuming it can even be done at extreme cost now and later or be okay with a couple of hours of downtime here and there with far less cost.

        Every single project you listed uses Github as a mirror meaning when they go down internally, Github is the backup which from my perspective is a little ironic.

        • rvz 4 years ago

          > Every single project you listed uses Github as a mirror meaning when they go down internally, Github is the backup which from my perspective is a little ironic.

          And? It is a read-only mirror. It just 'pulls' changes from the self-hosted copy. It can't be used for direct development for the maintainers. If the main official repository was on GitHub and that goes down, then everything will be down as well including (issues, pull requests, actions, etc). Then you will be totally reliant on GitHub for 'fix it'.

          There is a reason why those same projects do not use GitHub as their main repository and tell you 'We don't accept issues or patches here'8. They have control over their issues trackers, review process and CIs and their projects won't halt due to GitHub's unpredictable and intermittent issues.

          For those projects, GitHub is only* used as a read-only mirror for cloners, but useless for anyone to send patches, track issues, PRs, etc. which that is done on their self-hosted repositories and it has been like that for them for years.

          • bastardoperator 4 years ago

            It's a remote origin, once I clone and branch which I can do from a mirror, I can write and commit as much as I want to the repo, where I push the change up to is ultimately my decision assuming I have access. The point stands, these companies use Github to act as a mirror/backup for their project in the event of something like a disaster (e.g. datacenter fire).

            There is no perfect solution and there never will be. Everything has associated cost. You're focused on the distribution of devops tooling, but that is only a fraction of the story. Many large companies have moved to Saas based products because they realize doing it themselves comes with significant cost. An hour or two of downtime is cheaper then a datacenter, equipment, bandwidth, licensing, and expertise to manage all of it.

            It's a simple cost benefit analysis. You need to look at this issue through the lens of a business and not just an engineer would be my advise. Interestingly enough you can only point to OSS projects which rarely pay for tooling anyways.

      • uplebian 4 years ago

        > It is GitHub that is defeating the whole point of it all, once their hosted central server goes down.

        server != service

        assuming its a distributed service vs one server for a multi-billion$ company also group of humans built this service, so its not gonna be perfect :shrug:

        companies that use such tools and in trust all the business process to a provided service and do consider an event like this is a blocker should build in contingency plans or accept that there is no real 5-nines of availability more like 90-98%

        • rvz 4 years ago

          > assuming its a distributed service vs one server for a multi-billion$ company also group of humans built this service, so its not gonna be perfect :shrug:

          Regardless of any of that, it still is proven to be unreliable. It is also not an excuse to go all in and risk being fully dependent on GitHub (and their services) and tolerate such downtimes and run to HN and complain about it each month.

          > companies that use such tools and in trust all the business process to a provided service and do consider an event like this is a blocker should build in contingency plans or accept that there is no real 5-nines of availability more like 90-98%

          Then I should see no-one being surprised or complaining about 'GitHub having issues' or 'GitHub is down again' whilst also using it for GitHub actions, pages, issues or pushing their changes and they are not paying for GitHub Enterprise or some higher plan; especially serious open source project like Mozilla, Chromium, etc. That's why they self-host.

          Until the next time GitHub goes down again (hopefully that won't be in another month's time).

          • uplebian 4 years ago

            > Then I should see no-one being surprised or complaining

            Oh agree 100%, this is the equivalent of the "reply-all email threads" and people responding to be remove or stop. I find it entertaining overall.

            > Until the next time GitHub goes down again

            Cheers

    • fritzo 4 years ago

      Good point! This would have been a bigger issue back in the days of cvs and svn.

can16358p 4 years ago

At some point GitHub main page 500'ed for me. The problem is probably somewhere down to the core, not at something isolated.

lambda_dn 4 years ago

This is why you should have your code on multiple remotes, i.e. Azure DevOps, Git labs, self hosted git server.

kitten_mittens_ 4 years ago

Can't push changes at the moment.

Xarodon 4 years ago

Pushing to repos is also not working

cedric 4 years ago

I downloaded a GitHub repo from Software Heritage [0]. I searched and found the repo was in the archive. Software Heritage saved my day.

[0]: http://archive.softwareheritage.org/

bombcar 4 years ago

It's intermittent, I was able to get a push through eventually, and am now hung trying to convert a draft PR to ready for review. It took many tries to get to draft.

I'm probably not helping by repeatedly trying, but I don't want to forget this PR.

Yay it finally went through.

  • Saig6 4 years ago

    I'm able to occasionally push commits, but PRs aren't picking up the update or rerunning CI

Sydneyco 4 years ago

Why is GitHub having so many issues recently? do you think it's due to the recent events?

RapperWhoMadeIt 4 years ago

Do they regularly publish post-mortems after their repeated incidents? Might be interesting...

jhugo 4 years ago

In Asia I've been having problems for almost 12 hours now (both locally and from our CI/CD which is in a different country). Also had similar problems on Tuesday.

i_like_waiting 4 years ago

Wow, suddenly staying on-prem with old rusty Jenkins is not so bad. (It has its issues, but at least I had better service levels in last 12 months)

soraminazuki 4 years ago

Ah, so this is the reason for the mystery failure I encountered with GitHub Actions. My job just failed without emitting a single error message.

fishywang 4 years ago

They just had a (smaller) outage yesterday. At first I thought it's yesterday's incident finally got enough points on hn.

thomassharoon 4 years ago

Pull review comments and approvals as well

alexambarch 4 years ago

I'm unable to even sign out. It gives me a 500 and then drops me right back at the homepage on a refresh.

mml 4 years ago

zenhub appears to be having issues as well (can't load ticket at all) due to their GitHub integrations I assume.

anarsdk 4 years ago

ya’ll do know Git is a distributed VCS right? it’s ok for the the remote to be offline.

svnpenn 4 years ago

I cant even comment on issues...

anunay_i 4 years ago

do they publish postmortem's? gist.github.com was down too for sometime

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection