Settings

Theme

GitHub Outage

140 points by mre 3 years ago · 86 comments (85 loaded) · 1 min read


Just noticed issues with Github handling requests and overall flakiness. Getting a lot of status code 500 errors and decided to open this thread for status updates.

omn1 3 years ago

Even though the status page (https://www.githubstatus.com/) shows no issues, I'm still getting the occasional 500. It seems to be happening quite irregularly. They are possibly facing a lot of load.

  • adamscybot 3 years ago

    In my experience its often the case the status site does not reflect reality -- until someone intervenes.

    • naikrovek 3 years ago

      that's the whole point of a manually updated status page. you don't want automation to update it because that automation can fail. automation likely caused the outage you want to know more about.

      you also don't want your automation guessing at what the problem is, or what the effects are. you want real info from a real person even if it isn't given to you the millisecond you look for it.

      this is why status pages aren't updated by automation. if they're updated by a person, you know that people know about the problem, you know that people are working on the problem, and so on, which is good, but while they figure out what's going on, you see a "green" status page.

      this is normal.

      (this is for future readers, more than the person I am replying to.)

      • mejutoco 3 years ago

        IMO the reason status pages are not updated automatically is legal. SLA and other legal contracts might change if every time something is down the status page reflects that accurately, so people try to hide it.

        Approached in that way a status page is almost useless, since it is not reliable and only after I found out via other sources it is updated.

        I am perfectly happy with a status page that shows the, mm, status of the service. Could be as easy as not reachable, slower than usual or any generic information (a traffic light). I disagree that a status page has to show the why of the error, although of course it would be nice.

        • naikrovek 3 years ago

          > IMO the reason status pages are not updated automatically is legal. SLA and other legal contracts might change if every time something is down the status page reflects that accurately, so people try to hide it.

          you are right about legal reasons; some companies count SLA by the time and date stamps on the status page.

          people hiding a real outage when users know damned well there is an outage is thankfully not common at all.

          if you can design and run a 100% reliable status page which never reports incorrect information, while also reporting useful information, you will be a hero to many.

          • mejutoco 3 years ago

            > people hiding a real outage when users know damned well there is an outage is thankfully not common at all.

            Thankfully people are not hiding it as in a conspiracy to pretend nothing is wrong. But, as you see in many comments in this thread, status pages rarely reflect that something is down immediately (because they are updated manually by humans).

            This delay, codified in processes, is very convenient, and to me this is purposely hiding that a service is down. People are not hiding it, but the processes that control the status page are, indeed, hiding this information. This makes status pages less useful, IMO.

      • judge2020 3 years ago

        Yet Discord has an “API Response Time” graph[0] and Reddit has a “5xx error rate” graph[1]. No, it doesn’t automatically create incidents, but it’s nice to confirm an issue is happening site-wide after experiencing it.

        Actually looks like the metrics part of Reddit’s status page broke over 2 weeks ago

        0: https://discordstatus.com/

        1: https://www.redditstatus.com/

        • grepfru_it 3 years ago

          GitHub has such a graph too, but limited to internal employees only. In fact that’s how they detect failures with their deployments, elevated 500 statuses. Some individual teams will have their own dashboard, but some do not (like k8s upgrades) and they only monitor 500s.

          Rest assured someone is looking into this problem right now

        • naikrovek 3 years ago

          > Yet Discord has an “API Response Time” graph[0] and Reddit has a “5xx error rate” graph[1].

          that's awesome. doesn't exist for github, yet. would be nice if it does come.

      • tpxl 3 years ago

        I'm going to disagree here. The point of a manually updated status page is appearances.

        With proper reporting it's trivial to know which subsystem is experiencing problems, if any. It doesn't have to be very granular, just "normal", "experiencing issues", "offline". If reporting doesn't work, you should be alerted it doesn't work, and if alerting doesn't work, there needs to either be out-of-band alerting for that or someone monitoring the status at all times.

        Manual overrides for status pages should exist for when the automation doesn't work of course.

        At my last job we had a big screen in the office we monitored (Grafana) and we usually saw problems before the alerting kicked in - it had about a minute delay. When not in-office/during work hours, the on-call received alerts. It wasn't technically nor organisationally complex.

      • smt88 3 years ago

        This is so wrong that it makes me wonder if it's satire.

        "The whole point" (as you put it) of status pages was to publish high-level monitoring data to users. The monitoring process should occur outside the system that is being monitored, perhaps even on a different cloud.

        Eventually, many companies realized this revealed expensive SLA violations and ended that level of transparency.

        Your status page can and should report import metrics to users, like elevated error rates. Most status pages used to.

        • naikrovek 3 years ago

          not satire.

          no company will put any amount of monitoring online for anyone to see, no matter how high level. for it to be useful info, it must contain details, and information about infrastructure is usually well guarded for very good reasons.

          • smt88 3 years ago

            > no company will put any amount of monitoring online for anyone to see, no matter how high level

            Many companies used to do this. I remember the first time someone on HN commented, "Hey, is it possible this status page is just a useless blog now?" And people were trying to figure it out.

            Companies arguably have a contractual obligation to be transparent about this data with their customers anyway, so a company like Github (where such a huge percentage of the industry is a customer) is going to leak the data one way or another.

    • MattIPv4 3 years ago

      Exactly, status pages tend to be updated by the humans responding to the incident, they're not automatic (that'd be pretty useless, you already know it's down, you want to know when they know it's down). Coordinating what to put on the status page when an incident happens can take time, getting the correct scope of impact from responding engineers etc.

      • pbhjpbhj 3 years ago

        Sorry, I'm not following you, how do you know it's down when the status page says it's all working? At that point you assume it's not down and start checking your own systems. They're just lying to avoid fallout; it's not better than an automated page.

      • smt88 3 years ago

        "Humans responding to the incident" is what Twitter and email communications are for. Status pages are supposed to be realtime status, and they should show downtime as soon as users suspect it.

        As a user, you often don't know if the vendor's system is really down or if there's something wrong with your own system.

    • hericium 3 years ago

      Static sites need editors and editors sometimes have to ask permission to post.

      At least that's what AWS Health[1] looks like to me.

      [1] https://health.aws.amazon.com

    • deepstack 3 years ago

      from the message I'm getting it seems like the load balancer is not able to spawn up server to handle new connections. Again, status page needs to reflect that, which means the status server page is NOT running on the same infrastructure as the main server group. Stop using AWS(or whatever fill in the blank hosting provider) for the status and production environments.

  • rozenmd 3 years ago

    I monitor GitHub externally here: https://github.onlineornot.com/

    Seems like a huge spike in load.

    • capableweb 3 years ago

      Load as in load average? Or load as in traffic?

      Spikes in request latency can be because of bunch of stuff, including more traffic, but in my experience, it's usually around non-existing optimizations for some data structure that got triggered after N items or new deploys containing code that wasn't as optimal as the author of the code thought. Especially when dealing with distributed systems, where sub-optimal code in one part can cascade performance issues to various parts in the system.

  • bowsamic 3 years ago

    Status pages are absolutely useless. I've never seen them accurately reflect an outage

    • MattIPv4 3 years ago

      You are missing the point of a status page. They're not automatic things that tell you instantly when something is down -- that'd be pretty pointless, you already know it's down. They're updated by the folks responding to the incident, so you know they know there's an issue and that they're looking into it.

      • XCSme 3 years ago

        > that'd be pretty pointless, you already know it's down

        How would I know? What if my website doesn't have any monitoring and I use a payment system, shouldn't I automatically be notified when that payment system is down? What if it's down for a week? I think service-providing companies should always announce outages and even suspected outages.

      • mejutoco 3 years ago

        I agree with GP. If I am trying to, let's say, watch something on Netflix and it is not working, a status page would confirm that Netflix is down in my region, and I would know that there is nothing wrong with my connection, DNS, or any other potential cause.

        Because of this reason I believe they would not be pointless if they were simply status pages, instead of "incident response pages". My hypothesis for them being this way instead is it is too much transparency for some companies for PR and legal reasons.

      • bowsamic 3 years ago

        Then it's not an operational status, it's an engineering status. Clearly it is very misleading. I think most people, even devs, think these pages are supposed to reflect the current situation. Btw the Github one still doesn't

      • rozenmd 3 years ago

        That's the premise of my whole business - there's definitely a market for an automated status page!

        (https://onlineornot.com)

MattIPv4 3 years ago

Been running into many unicorns for the last few minutes, had a moment where it came back but seems to be down again. Even the unicorn image won't load on the unicorn page.

  • omn1 3 years ago

    Which is weird because the unicorn is an inlined image (png), encoded in base64. Seems like they broke it.

qwertox 3 years ago

I haven't pushed to GitHub in over a year. Now I'm setting up a new page on github.io with a new repo and GitHub goes 500 just when I try to push.

Those GitHub badges... they are as ugly as it gets.

  • naikrovek 3 years ago

    well, I for one would like more unrequested critique of artwork on a code sharing website. ಠ_ಠ

  • irajdeep 3 years ago

    > Those GitHub badges... they are as ugly as it gets.

    Bingo. Not everything in this world needs to be gamified.

blueflow 3 years ago

Do we need to create a HN post for every outage? It happens every other week.

  • capableweb 3 years ago

    In the beginning of status pages, most of them were automatic one way or another, or engineers quickly threw up "We know of the problem, stay tuned" messages there.

    But soon after, legal/executive team got ownership of them apparently, and the status pages are no longer automatically showing downtime/response time and notice about when things are actually down can take a while.

    So I think it's nice that there is at least one place where I can see if it's a problem on my end, or if it's global. It helps to remove some frustration at least.

  • RcouF1uZ4gsC 3 years ago

    What else are you supposed to do when you can’t work because you don’t have access to your source code.

    • naikrovek 3 years ago

      you still have access to your source code, you just can't push. or pull, but you can sneakernet around that, or have a second remote set in the repo for just this occasion, so you can collaborate as a stop-gap measure while GH gets fixed up.

      • LtWorf 3 years ago

        git works completely fine offline.

        However I have a feeling that most companies are set up to download 50MiB of dependencies at every run, so a website being down makes the entire thing not work.

corford 3 years ago

Yep, noticed it with comments on an issue (had timeouts while submitting but it eventually went through).

Now 30 mins later, i've refreshed the issue and see that my reply and the comment I was replying too (by another user) are both gone. Hopefully, it's eventually consistent and these comments will re-appear later.

kid64 3 years ago

It's completely down for me. Status page says "all systems operational".

fredrikaverpil 3 years ago

The service seems very flaky right now. Even the unicorn isn't loading properly.

sivapil 3 years ago

Still getting this same error for past 10-12 hours. Tried in different times.

{ "code": 500, "message": "internal server error" }

Does anyone have luck? Any workaround to fix it?

jmartens 3 years ago

According to Metrist monitoring (disclosure: I work there), the errors were very rare, and didn't happen enough for us to call the product "down." Looks like around 1% of requests.

WFHRenaissance 3 years ago

I'm unicorning hard rn

bilalq 3 years ago

How reliable are Github cron action workflows? I set one up to run every 15 minutes recently, but it seems to actually be running closer to once an hour.

view 3 years ago

I'm trying to clone a repo at a whopping 6KB/s from Kenya.

EDIT: Seems to be a routing issue. I've enabled a UK VPN and it's working fine now.

cherryblossom00 3 years ago

‘No server is currently available to service your request. Sorry about that. Please try refreshing and contact us if the problem persists.’

bityard 3 years ago

Interesting how outages like this seem to happen mostly on Monday^w Tuesday mornings.

max23_ 3 years ago

I get an error saying the action can't be performed when trying to star a repo.

azeemh 3 years ago

this is what happens when you sell important community infrastructure to M$FT

  • dave4420 3 years ago

    This predates them getting bought by MS. GitHub was notoriously flaky from the very beginning.

adamscybot 3 years ago

Its back

alexandremonjol 3 years ago

Seems to be back right ?

isusmelj 3 years ago

Same issue. Site is also very non-responsive.

tambourine_man 3 years ago

They’ve been having issues since yesterday.

ohmahjong 3 years ago

Time for some deep thumb-twiddling

JSDevOps 3 years ago

Yeah I noticed it earlier.

planxty 3 years ago

Seeing the same.

maxcan 3 years ago

Github outages are the bored engineer's equivalent of getting a surprise snow day when you were in school, full of unbridled joy.

For engaged, happy engineers its the equivalent of getting a surprise snow day when you are grown up and have to go dig your car out of the snow and its a normal day just with extra steps.

  • ramigb 3 years ago

    I can be both bored and engaged, don't test me buddy /s

  • lysergia 3 years ago

    > equivalent of getting a surprise snow day

    Not if you self-host Git

    • capableweb 3 years ago

      Self-hosting Git is easy, throw up a ssh server and point git to it.

      Self-hosting everything else GitHub does is harder. Which is why they are building out all of those things, they don't want people to move to other places so easily.

      Hopefully these constant outages makes more developers pissed off that issues are not stored in git as well, and start working on tooling to solve this shitty problem once and for all.

      P2P/Local First software for everyone! \o/

      • chrisseaton 3 years ago

        > Self-hosting everything else GitHub does is harder.

        You can self-host the whole of GitHub can’t you?

        • capableweb 3 years ago

          As far as I know, you can self-host an enterprise version of GitHub, but it'll still be available from one location (the server you deploy it on). I cannot run it locally and federate with my colleagues instances for example, just like I can with Git.

      • lionkor 3 years ago

        gitlab

        • capableweb 3 years ago

          GitLab is not P2P/Local First software like Git itself. It suffers from exactly the same problem as GitHub, reliance on a central server (which is run either by the company making the product, or by your own team).

          What I'm talking about is being able to access everything like issues, wikis, PRs and whatever, even when you're 100% offline.

  • naikrovek 3 years ago

    Not really. I'd rather be getting stuff done. ¯\_(ツ)_/¯

    edit: oopsie I misread.

    • mort96 3 years ago

      Would you characterise yourself as a bored engineer?

      • naikrovek 3 years ago

        today I might characterize myself as someone who made a mistake while reading a comment, and replied to the mistaken understanding instead of the intended one.

wheelerof4te 3 years ago

You still can use git, you just can't push the code.

Not a huge problem, unless it lasts for hours or gasp, days.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection