Oracle Cloud is having a major outage

53 points by resdev 3 years ago · 61 comments · 1 min read

What's worse is that their "real time" status is not showing any issues.

https://ocistatus.oraclecloud.com/#/

I had to confirm the outage based on community reported down detector.

https://downdetector.com/status/oracle-cloud/

All of our services, instances and backups for https://searchadsoptimization.com are in Oracle cloud.

This shows a critical issue when relying on a single Cloud provider. It's time to build a cross cloud infrastructure design to handle these issues.

Update: It looks like they have updated their “real time” status page after good 25 minutes of severe outage. My trust and assumptions with real time status pages changed completely.

I don't understand the point of real time status pages if they are clearly not real time and not accurate.

My error notifications were blowing up my phone, the first thing I did is check their status page and assumed issue is within my application, and I couldn't even access my backend application. Out of desperation, I had to check downdetector to confirm the issue. I have formed new respect for downdetector.

hn_throwaway_99 3 years ago

Wow, there must be literally tens of people who are worried right now!

My shitty, snarky comment aside, I am genuinely curious about why someone would choose Oracle as a cloud provider. If you look at their capex spend, it's undeniable they have so vastly underinvested in their cloud compared to AWS, Azure and GCP, that even if you were an "Oracle shop" I'm genuinely curious what benefits their cloud would offer.

Edit: Just want to say I really do appreciate the responses, lots of good info! I didn't know Oracle cloud offered a decent free tier, will take a look.

NikolaNovak 3 years ago

I think that's a perspective of people who compare Oracle Cloud IaaS offering to AWS, and find it deficient.
But I don't think that's the actual Oracle Cloud play, for the most part. To the best of my awareness, their cloud is realistically focused on hosted applications and SaaS - HCM Cloud, PeopleSoft Cloud, etc.
As such, their customers are not so much folks and small companies whose client-managed VMs may go down. Their customers are more likely to be large corporations whose Enterprise Resource Planning applications are fully hosted and may be impacted - Financials Management, Human Capital Management, Customer Relationships Management, etc.
I think for the most part Oracle Cloud does not have the same sales pitch and does not compete, for all practical sense, to AWS/Azure/GCP IaaS.
I could be wrong! There might be tons of clients who are renting bare VMs for Oracle! But to your point, I don't know why :P
SadTrombone 3 years ago

I'm personally using it because they give you multiple VMs for free. 2x x86 VMs with 1GB of RAM, and 4 ARM (Ampere) VMs with 4 cores + 24GB to split up 4 ways as you see fit. All of that with a combined 200gb of block storage + 10tb of data transfer.
I'm no fan of Oracle, but that's a good amount of free stuff for my hobby projects.
- hdjjhhvvhga 3 years ago
  
  Unfortunately these can disappear without warning. With AWS this basically never happens.
  - mbajkowski 3 years ago
    
    Can you elaborate on this? These are not preemptible, why would they disappear without warning? I've been running on these for 2 or 3 years without any issues thus far.
    
    BackBlast 3 years ago
    
    When your account changes from the time based free tier to the only forever free tier they often shut things down. Even if you're careful not to use the time based free tier stuff, which I was, it went pop anyway.
    If you're below their "idle system" threshold, they'll shut it down.
    Get past the 3 month or whatever it is tier. Put a low level load on the system to get it above the idle threshold and you should be good to have a nice personal playground. Probably.
    I've had systems go down to both these. They also have rejected every credit card I tried to throw at them. Which is a known problem for them as well as some other cloud providers. I no longer put anything critical on OCI.
    
    devrand 3 years ago
    
    They might be referring to their idle reclamation policy: https://docs.oracle.com/en-us/iaas/Content/FreeTier/freetier...
    
    mbajkowski 3 years ago
    
    Thanks, very useful, I was not aware of this. There really should be a warning before this happens. I really need test my incremental backups now.
    
    andrewinardeer 3 years ago
    
    I can confirm I was given an email about 4 weeks out before they reclaimed my idle VMs.
procd 3 years ago

It is the only thing I can call a cloud provider in Saudi Arabia vs Alibaba Cloud. You see not all got all the choices. Why shouldn’t I use something abroad? Laws.
nine_zeros 3 years ago

At my last company, the VP got the largest discount from Oracle, so Oracle cloud it was.
Engineer attrition rose. Services took longer to build and maintain. Company's stock went down. Layoffs ensued.
But the VP got promoted to SVP.
- revelio 3 years ago
  
  Why would services take longer to build? You mean, compared to going all in on every aws service?
  - nine_zeros 3 years ago
    
    Lack of reliability and lack of documentation about corner cases.
topspin 3 years ago

"I didn't know Oracle cloud offered a decent free tier"
It's more than their free tier[1]. There are a number of nice things in OCI. For example: Redundant control plane hosts for their non-"free tier" k8s clusters are free. The equivalent of AWS's cross-AZ traffic is free (as opposed to $0.02/GB at AWS); a huge win for certain use cases. They're using a open, platform agnostic "specification" (the Fn Project[2]) for serverless cloud functions, which is wonderful for local dev and test. Terraform is tier 1 with OCI; from documentation through support Terraform is the reference "infrastructure as code" solution on OCI, always comprehensive and robust. Oracle Linux is pretty good; better than Amazon Linux has been, although with AL2023 Amazon is starting to close the gap. OCI instance shapes are very flexible. Overall costs are lower; Oracle is aggressively competing on price. Instance live migration (à la KVM live migrate) is a thing at OCI, so Oracle can live migrate running instances to isolate failing hardware.
I could go on.
Yes, I'd say OCI isn't as stable as AWS. Anecdotally: I get occasional "event notification" in my inbox; perhaps 4 in roughly 2 years, which is fewer (1) than I've seen from AWS in the same time. All but 1 was tangential, didn't actually impact anything that matters, and were quickly resolved. I actually received the OCI notice in my email today before it popped up on HN, which is "different" than how it usually goes with AWS.
[1] My experience is limited to AWS and OCI. [2] https://fnproject.io/
p.s. I appreciate that Oracle has earned the hate it receives from most, and I too have been a victim in my prehistoric past. OCI is, however, different; it's a pay-as-you-go platform that provides Oracle with no customer-abuse opportunities given the strength of the competitors, and my experience with it has been entirely cromulent.
remram 3 years ago

They have a good free tier: ARM Ampere instance with 24 GB memory.
- space_ghost 3 years ago
  
  I snagged one of those as soon as they were available. And, so far, I've only used it to host my portfolio, which is entirely static. :D
  - yabones 3 years ago
    
    I abuse mine to run a big-ass Elasticsearch instance with about a year of syslogs/weblogs for my personal machines. Not terribly useful, but a good outlet for my hoarding I suppose.
- natrys 3 years ago
  
  Even in pay as you go they have it much cheaper than in azure or gcp.
chesiffon 3 years ago

I use Oracle for certain workloads because their ARM compute offering is very flexible and cost effective. They let you select CPU count and memory semi-separately, without charging a huge premium for it. You can have 64 CPUs and 64 GB RAM (max CPU, min RAM) or 1 CPU and 64 GB RAM (min CPU, max RAM). I say "semi-separately" because they don't allow less than 1 GB RAM or more than 64 GB per CPU. For the workloads I'm talking about, they need a lot of RAM compared to CPUs. So Oracle's offering is very attractive for it. I've also found that the CPUs significantly outperform even "high performance" dedicated Intel CPUs at other providers for one of my compute heavy workloads (I still don't understand how that could be the case, but I'll take it). It only underperforms my desktop i7-11700k by about 25%.
bastard_op 3 years ago

Anyone stupid enough to enter into Oracle licensing for ERP and financials in the first place simply moves to it eventually as Microsofties inevitably move to Azure. In the name of "better" native integration, ie it sucks less as at least Oracle knows how bad Oracle is to deal with.
Doesn't make it better, just makes for vasoline to make the shagging less painful.
nray 3 years ago

You'll care about it if you can't pay for you coffee at Starbucks or whatever. I'm not if there is overlap between Oracle's public cloud and their own hosted services, but a lot of hospitality runs on Oracle (formerly Micros Fidelio) Simphony, and they are moving their customers from on-prem to Oracle Cloud.
heavyset_go 3 years ago

It's cheaper than AWS and has a good free tier. I've worked with clients who use it because Oracle or they got a really good deal with them over AWS.
It's not something I'd choose right away, but there are use cases where it can be cheaper/free compared to other options.
lcenjoyer 3 years ago

Oracle's vendor lock-in should be given its own category in the Diagnostic and Statistical Manual of Mental Disorders
cypress66 3 years ago

I use it because it's much cheaper.
Bandwidth alone is an order of magnitude cheaper than AWS.
BryantD 3 years ago

Hosted Oracle database services. Otherwise, very favorable contract terms.
- hn_throwaway_99 3 years ago
  
  Thanks very much, was assuming they must be giving a sweetheart deal given that AWS and GCP have hosted Oracle DB solutions, and even Oracle itself touts running on Azure, https://www.oracle.com/cloud/azure/oracle-database-for-azure....
hamburglar 3 years ago

I just read a claim in the Wall Street Journal the other day that said Elon Musk supposedly just bought up their entire available fleet of GPU instances.
Edit: adding link [1] sorry for the paywall
[1] https://www.wsj.com/articles/the-ai-boom-runs-on-chips-but-i...

gtirloni 3 years ago

Cross cloud is really complex and error prone. You'll probably cause more outages than you'll prevent by going down that path.

Maybe you should consider moving to a major cloud provider that has better services.

BackBlast 3 years ago

It's not that hard. You only need a few rules to make it work.
* Manage your own VPN and don't depend on the vendor's solution. * Only use base level services that are available on every cloud you want to leverage. Which probably means you're on a container based system and not using anything like Lambda.
Then a service being in another cloud just doesn't matter. You're always making requests via your VPN anyway, if it's in a local cloud or remote cloud it doesn't matter.
You'll feel the egress pricing if you integrate cross cloud services that are chatty. So it really depends on your immediate goals.
If you just want redundancy then you need to keep the resources standing in the backup cloud/AZ and just move your entry point via DNS, in which case you don't really need "cross cloud", just the ability to provision else where.
AnthonyMouse 3 years ago

It's not really that bad?
If you're operating at large scale you should have enough control over your own infrastructure to distribute load to multiple providers. Then if one of them is down, spin up more instances on another one.
If you're operating at small scale, store your backups on another provider, and periodically test that you can quickly restore them to that provider.
This isn't just about redundancy. Doing this is necessary to keep you from getting locked in.
- gtirloni 3 years ago
  
  > If you're operating at small scale, store your backups on another provider, and periodically test that you can quickly restore them to that provider.
  Sure, but that's not what people mean when they say cross cloud. It usually implies running active workloads.
  Orchestrate hundreds of workloads that each depend on one another across several clouds and they reorganize upon a failure introduces a lot of new failure modes.
  - AnthonyMouse 3 years ago
    
    > Sure, but that's not what people mean when they say cross cloud. It usually implies running active workloads.
    For workloads large enough to justify load balancing, that isn't that hard. Things should only rarely care which "cloud" they're running on. The amount of the load assigned to each one is a knob you can turn.
    For workloads smaller than that, the primary way to achieve redundancy is failover. Whether it's worth the cost and complexity of making that happen automatically instead of manually doesn't have a universal answer.
    But "just use a more reliable provider" doesn't work if being down matters. AWS, Azure and Google Cloud have all had major outages. More than that, sometimes a piece of equipment fails in a way that takes adjacent equipment with it, or there is a fire or a burst pipe. They call it "cloud" but somewhere there is physical hardware under your bits and it can fail. Each failure may only affect a limited number of customers, but those customers can include you. If your systems can't be down, you need a plan in place to have them running again somewhere else in short order. And putting the system you use for this on a different provider can save you from a major provider outage.

compumike 3 years ago

> My trust and assumptions with real time status pages changed completely.

FYI this is why we show real-time status on https://heiioncall.com/status including the time of the last inbound check-in or last HTTP probe.

internetter 3 years ago

This is a beautiful product with excellent pricing and the design is lovely. Great work.

jareds 3 years ago

Not all applications need high availability across multiple clouds and the cost increases that go with that. Some applications can afford a couple hours of downtime if the underlying hosting platform as issues instead of needing to do a hot failover to completely different infrastructure.

cheeze 3 years ago

The idea that "we'll just seamlessly failover to another provider" is a bit otpimistic IMO. With that comes additional complexity. Some applications need this, but it's a huge cost and complexity tradeoff for almost all businesses.
I'm a fan of sticking with one provider, but going with something bigger that has a good track record. AWS, GCP, Azure aren't prone to 0 outages, but I think for almost all companies, having redundant stacks in separate regions is enough to maintain high availability.
I don't know enough about Oracle Cloud to comment on them, but my general take is these companies all inevitably hit a "showstopper" global outage, realize they aren't investing enough in separation of regional stacks enough, and put a ton of energy into making their platforms more fault tolerant.
Thinking that Johnny dev shop is going to be able to do better than a major player is, IMO, wishful thinking.
I know that at GCP at least, they actually have monitoring setup for things like tweets, downdetector, etc. Ideally they catch every issue with their own monitoring, but they do their best to know if anyone is having an issue, whether they can detect it or not..
resdevOP 3 years ago

I agree, I should've been more clear, i was referring to database. The main issue is if database , storage and backups are located with single Cloud provider, it is a possibility for a single point of failure.

qwertyuiop_ 3 years ago

Oracle Cloud Infrastructure Customer,

We've identified a cooling system issue affecting multiple services in the US East (Ashburn) region. Our engineers are actively working to mitigate the issue.

sicklife 3 years ago

......

bastard_op 3 years ago

The reason for outage report should be interesting.

My cousin mentioned their erp was down mid-day, and I laughed citing HN like "oh yeah, forgot you're a poor bastard oracle user." It was entirely dead, like everything apparently, most of the day. Sadly the financial people don't care, they will still cut a check to daddy Ellison monthly.

At least one large California municipality I worked with made a multi-year concerted effort to abandon the misery that is oracle erp. That said, never heard how that venture panned out with the replacement. Something about a frying pan to the fire comes to mind.

hu3 3 years ago

I have a VM there but it wasn't affected. Got this e-mail from Oracle 50 minutes ago:

> Oracle Cloud Infrastructure Customer,

> Engineers and the colocation partner have successfully installed additional cooling systems to reduce ambient temperatures and mitigate the issue affecting multiple Oracle Cloud Infrastructure (OCI) services in the US East (Ashburn) region. We will continue to closely monitor this situation.

internetter 3 years ago

Well, if the status page says they are up do the SLAs really apply? (half kidding, this seems like an oracle thing to do)

resdevOP 3 years ago

I don't understand the point of real time status pages if they are clearly not real time and not accurate.
My error notifications were blowing up my phone, the first thing I did is check their status page and assumed issue is within my application, and I couldn't even access my backend application. Out of desperation, I had to check downdetector to confirm the issue. I have formed new respect for downdetector.
- gtirloni 3 years ago
  
  1. It only goes to red after a set of humans determine it's really high impact and should be made public. Minor or localized outages rarely qualify.
  2. Previous point is ignored very often and outage is only made public when major clients or news organizations take notice and inquire.
- Spivak 3 years ago
  
  I like AWS's approach of having your own personal incidents page. Still not exactly real time but better than an unchanging wall of green. And they include performance degradations as incidents which is nice.
ben7799 3 years ago

As a former employee this REALLY seems like something Oracle would actually do, along with instructing employees to tell customers it wasn't actually down.
It was before Oracle cloud but I literally was told to do things like that.
firstSpeaker 3 years ago

They need SVP+ level approval to mark disruption/unavailability.
- stusmall 3 years ago
  
  I love this because I honestly can't tell if its a joke. It's obviously a terrible idea.... but also seems like something Oracle would do.
  - tatersolid 3 years ago
    
    AWS does the same; it stays green unless the outage is so bad an exec has to approve changing the status page.
    Surprisingly azure is very open with outages of all services big and small in my experience, and notifies if any service our tenant is using is impacted.

s-xyz 3 years ago

Very honest question, who would use Oracle Cloud in 2023?

hamburglar 3 years ago

https://downdetector.com/status/tiktok/

berbec 3 years ago

It's now showing errors in US East (Ashburn)

bastard_op 3 years ago

So did any actual RFO come out about this yet? Inquiring minds want to know, also point and laugh.

phendrenad2 3 years ago

Maybe Larry plugged in his solar-powered yacht backwards and took out the local grid.

not_enoch_wise 3 years ago

If you can’t trust Oracle, who can you trust?

hn_throwaway_99 3 years ago

Even people/companies who use and depend on Oracle, I wouldn't quite say their relationship is one of "trust".
- dralley 3 years ago
  
  I'm pretty sure that's the joke.

Settings

Oracle Cloud is having a major outage

Keyboard Shortcuts