Gitlab is down

66 points by riyadparvez 4 years ago · 64 comments (63 loaded)

Reader

GitLab team member here. We're aware of the incident and the status page has been updated. We will provide further updates on the status page as they become available.

(Edited now that the status page has been updated).

totaldude87 4 years ago

Awesome, thank you! Godspeed!
hackerlytest 4 years ago

Thanks. Seems to be back for me.
Dayshine 4 years ago

It took (by my measure) 13 minutes for a full outage to be represented on the status page.
I was under the impression that gitlab use gitlab.com for their work. Surely someone would have noticed within seconds that it was down?
Why have the misleading "updated a few seconds" ago text if it doesn't update on complete failure? :)
- john_cogs 4 years ago
  
  Your impression is correct. We use GitLab.com and notice these incidents as they happen.
  The delay in updating status is a result of our Incident Management process [0]. We have a Communications Manager on Call (CMOC) who leads communication throughout an incident. One of their responsibilities includes updating the status page. The slight delay between noticing the issue and updating the status page is a result of the time it takes for the CMOC to get alerted, assess the situation, and write the communication that is shared on the status page.
  I'm not sure how the "updated a few seconds ago" messages are generated but I'll try to find out once the incident has been resolved.
  0 - https://about.gitlab.com/handbook/engineering/infrastructure...
  - linuxdeveloper 4 years ago
    
    Why is "Active Incident" and "System Wide Outage" on the status page with a background color of green? Why not red?
    At first glance it looks like everything is operational with no issues.
    
    john_cogs 4 years ago
    
    The color is green because GitLab.com is accessible again.
    "Active Incident" remains because our team is still working towards full recovery.
    "System Wide Outage" is the description of the incident at its onset.
    
    caioariede 4 years ago
    
    I noticed that too. Pretty confusing to read "Operational" with a green background and "System Wide Outage" on the left side.
  - rasz 4 years ago
    
    Not a "status page" then, but merely "a page where Communications Manager post messages on after assessing the situation and consulting/getting permission from management"
    
    Lutger 4 years ago
    
    Why? Because there's a human decision involved?
- vasco 4 years ago
  
  After you notice I assume you have to declare an incident, get a call going, assess the extent of the issues, get the needed people involved, and then you'd announce on the status page. 13 minutes isn't amazing but it also isn't terrible. Perhaps you have better ways of keeping status pages updated much faster while also not ending up ramping up the posting of false positives.
  - aejnsn 4 years ago
    
    13 minutes is pretty solid compared to anything of recent AWS outages.
- pixl97 4 years ago
  
  It doesn't matter if each individual detects the outage because they'll start blame at the local source and move further up the tree rather than assign blame to a full system failure right off the bat. 99.9% of the time it's going to be a local failure affecting the individual.
  Also, most alerting systems like check multiple times before declaring a public outage, many times 2 to 3 failures some seconds apart are needed.
  - voidfunc 4 years ago
    
    To add onto this, my experience is you never want a fully automated status page for another two reasons:
    1. External engineers will start to automate recovery/mitigation processes around your status page if it has real time status.
    2. You now need to bug test your status page thoroughly because of #1. It basically becomes an actual API.
    
    10000truths 4 years ago
    
    That sounds like a problem for the external engineers, not for GitLab.
    
    pixl97 4 years ago
    
    Users make their problems yours.

totaldude87 4 years ago

and the status page is all green (sigh)- https://status.gitlab.com/ where as downdetector definitely shows that there are issues - https://downdetector.com/status/gitlab/

I guess, the status pages should now have a button to get data from public.. crowd sourced status page?

dnsmichi 4 years ago

GitLab team member here - sorry for the delay, SREs are investigating.
https://status.gitlab.com/ is updated. Edit: https://status.gitlab.com/pages/incident/5b36dc6502d06804c08...
m4lvin 4 years ago

It is updated by now - some seconds delay is fine I think and if they would not cache the status page it might go down in a blink now too ;-)
teekert 4 years ago

It's there now, added seconds ago.
- oriettaxx 4 years ago
  
  yes, you're right.
  I was just working on gitlab, so I would say the status page reflected the issue about 5 minutes later
  - oriettaxx 4 years ago
    
    mmhhh, make it 10

routeroff 4 years ago

overleaf.com is also down, https://status.overleaf.com/

Maybe some common severs ?

tovej 4 years ago

Definitely related, overleaf probably depends on gitlab. The overleaf outage ended right when the gitlab outage did.

hobo_mark 4 years ago

It's interesting that different pieces of gitlab.com appear to be running on a hodge-podge of GCP, DO, AWS and AZ... I wonder why that would be the case?

karmakaze 4 years ago

This could make good sense if they want to provide service where the customers use it.
temptemptemp111 4 years ago

But but CLOUD NATIVE! https://about.gitlab.com/cloud-native/
Traubenfuchs 4 years ago

Maybe they fell for the polynimbus meme.

simon04 4 years ago

https://status.gitlab.com/pages/incident/5b36dc6502d06804c08... – January 31, 2022 15:22 UTC – System Wide Outage

markdog12 4 years ago

Prob just a coincidence, but our Memorystore (hosted redis) instance went down with a "repairing" status around the same time.

rvz 4 years ago

For SaSS, it is down. But not if you are self-hosting your own.

Just look at Gnome: [0]. They are doing it right.

[0] https://git.gnome.org

iamcreasy 4 years ago

Is gitlab.gnome.org/GNOME set to forward to git.gnome.org?

teddyh 4 years ago

And this is why you self-host on your own instance.

analogsalad 4 years ago

Indeed, I can't remember a single time where a self-hosted server crashed. They run for decades with 0 downtime.
- rvz 4 years ago
  
  Exactly. That is the whole point. I keep telling that for GitHub since that goes down once a month. [0][1] GitLab SaSS is the same but a self-hosted backup is better.
  [0] https://news.ycombinator.com/item?id=29901564
  [1] https://news.ycombinator.com/item?id=29379648
  - miltonlaxer 4 years ago
    
    FYI GitHub Enterprise can be self-hosted https://docs.github.com/en/get-started/learning-about-github...
  - rovr138 4 years ago
    
    Everyone should have backups and these things aren't infallible.
    Gitlab is a perfect example. They had database issues and had to restore from backups already.
- sdoering 4 years ago
  
  Not sure if this is irony (I often don't identify irony as such).
  But I fatfingered a lot of self hosted stuff in my time.
  - analogsalad 4 years ago
    
    It's full-blown sarcasm. Sorry for omitting the /s.
    
    nix23 4 years ago
    
    And i was thinking AHHH he has a old IBM mainframe at home ;)
- manquer 4 years ago
  
  It doesn't, but I can fix it as opposed to waiting for their team to do it.
  Also at gitlab.com scale the problems they face are very different from a typical deployment.
  It is like having maintaining your car and using the train.
  On average if you can fix your car (or hire a good mechanic i.e. consulting) you would probably have a better experience than public transport breaking down, that you are powerless to do anything about.
  I would rather run a business depending on my car than the train ?
  - analogsalad 4 years ago
    
    As a customer of Gitlab, I'm satisfied with their uptime and I have no reason to believe that they can't fix these issues in good time.
    Yes, I can also fix it if the server was my mine but more than likely I'll be busy doing my actual job (which does not involve fiddling with self-hosted gitlab instances) so I'll take my chances with the Gitlab engineering team. They do fix things and me being busy, asleep, sick, or travelling have no impact on their response. I intend to keep it this way.
    
    manquer 4 years ago
    
    Nothing wrong with that outlook, plenty of people do take the train after all extending the analogy further.
    However ridiculing people who want to their control infrastructure better because they don't have the expertise or time as the guys running the railway gets old.
    
    analogsalad 4 years ago
    
    I had no intention of ridiculing anybody, there are things that I self-host but I like to pick and choose my battles instead of a blanket "I must have control over everything" approach. My response was specifically to this comment:
    > And this is why you self-host on your own instance.
    This was the commentary on the outage, and it's just outright wrong. Your self-hosted instances will also experience outages. That's the point I inteded to raise.
    
    manquer 4 years ago
    
    I meant the OP , who was making fun of self hosting, didn't mean to imply you did.
    I am not denying that there different people have different needs, not everyone wants or has the time to drive either . Just that self hosting is also valid even with uptime or other concerns.
    Having said that achieving better uptime is quite possible self hosting not because we are better but because we have simpler challenges than gitlab.com, poorer uptime is also quite likely if we don't know what we are doing.
    Simply put, managed services doesn't mean better uptime doesn't automatically .
  - sofixa 4 years ago
    
    > On average if you can fix your car (or hire a good mechanic i.e. consulting) you would probably have a better experience than public transport breaking down, that you are powerless to do anything about
    Spoken as someone who has never taken a train i suppose? Transit at scale can handle maintenance much better than a single vehicle and/or mechanic, and they do so proactively and on schedules. And when things get really bad ( catastrophic failure of some component you can't just "fix" on the spot), public transit will organise a backup ( a new train or a bunch of buses) to get you to your destination.
    
    manquer 4 years ago
    
    I think perhaps you are interpreting my statement as indictment of public transit. That is usually American debate, I am not against public transit, in most denser countries it is absolutely essential.
    I commuted in Mumbai trains for years the experience is terrible and dangerous, in most dense cities there is no other cheap reliable way to get anywhere, even in richer cities like NY the system is pretty bad if you care at all about your journey beyond getting from point A to B.
    Scaling is hard for public transit, very very hard, it does not matter how wealthy the city is either. Poor cities don't have money to expand, rich cities have ton of legacy infra, politics[1] and other systemic issues. The NY 2nd avenue line is 100 years in the making and costs $15-20B and it is just 9 miles long. There are some good transit systems but most of them are have ton of problems.
    Fr vast majority of people, trains (or managed SaaS ) would be good fit for their needs, however that does not mean it is always better in every metric and fits for everyone, for some people control and experience and other aspects is more important than what managed solutions can offer there is nothing wrong with that.
    [1] Large scale SaaS apps unsurprisingly also have similar problems
- dengolius 4 years ago
  
  And even if it goes down you might have more options to get it back to work.
  - rovr138 4 years ago
    
    I mean.. you can still use your git repos.
    Need to do a launch? Build it and push it.
    Need to share a change with someone so they can review?, `git diff` and send a patch via email. Want to use a server? Spin up a server, add users and keys and push up to it.
    Gitlab, GitHub and these hosted solutions haven't always existed. They're convenient, but not a OMGWTF moment... unless of course you don't have backups.
- karmakaze 4 years ago
  
  They could, if you stuck to the yak shaving full-time.
  - mschuster91 4 years ago
    
    What? Running your own Gitlab instance is one docker command away. No need to shave any yaks.
    
    karmakaze 4 years ago
    
    How do you get fault-tolerance, monitor and manage your K8s cluster to avoid service disruptions?
- oefrha 4 years ago
  
  Well, my GitLab instance at some point started to have its Prometheus eat 100% CPU all the time until I disabled the Prometheus component altogether, so there’s that. A cursory glance at the tracker just now says the issue is still open. That’s the kind of problems you get for self-hosting, it’s not all rainbows and unicorns.
  - dnsmichi 4 years ago
    
    Hi, Developer Evangelist at GitLab here.
    Can you link the issue please? :)
    For context, Prometheus and observability will be handled with Opstrace in the future [0]. I'd like to learn about your use-case and see which troubles you have been running into. Thanks!
    [0] https://opstrace.com/blog/gitlab
    
    oefrha 4 years ago
    
    This one I guess? https://gitlab.com/gitlab-org/gitlab/-/issues/210365
    To be clear, I don't care about Prometheus on my instance, I only care when it's causing trouble (e.g. by eating 100% CPU all the time).
    
    dnsmichi 4 years ago
    
    Thanks for the URL, and the additional context. Appreciate it. :)

hknapp 4 years ago

Seeing the same error

oriettaxx 4 years ago

yes, while the status page https://status.gitlab.com says everything is fine :(
grrr... I am stuck with my job now .... :(
- qayxc 4 years ago
  
  Must be caching issue - it shows "System Wide Outage" for me.

Settings

Gitlab is down

Keyboard Shortcuts