Our Server’s Hard Drive is Dead. We didn’t have a backup.
blog.method.acIt's noble of you to come clean and own your mistake, but let me say this over and over:
You should never, ever provide an environment that stores people's hard work without having professionals who know how to safeguard it.
If it makes you feel any better, I recently had to clean up a mess in a huge enterprise IT shop, (if I were to name the organization you would immediately know them) involving hundreds of thousands of man-hours of work lost due to a lazy, incompetent DBA and the clueless management above her.
This "DBA" was the kind of person who came in at 9:45AM, took a 2 hour lunch at noon, and left at 3:30. Did I mention she refused a work from home option?
She didn't know how to do chron jobs, so all of her backup scripts had to be run manually. If she was on vacation, they didn't get run. Surprise Surprise, the DB died after her long pre-Christmas vacation. Zero backups for the first 3 weeks of December.
Even "professionals" can be suspect sometimes.
Running cronjob backups and looking at them in passing to see that they look like valid backups is not sufficient for any serious website or web service.
Automated backups need automated backup restoration and testing. Otherwise, the backups might not be created properly, or they might be perfect backups that have some hidden error that will cause them to fail when they're put to use.
As an example, Jeremiah Wilton's self-case study on Amazon's Oracle database problem in 1997. http://www.bluegecko.net/download/disaster-diary.pdf
Other than the one missed backup, backup procedures were fine. An Oracle bug caused Oracle to refuse to start due to a database format/schema change weeks earlier. TESTING backups would have caught the error, and allowed them to fix it before they took down their production database and triggered the bug on the next attempt to start it again.
> You should never, ever provide an environment that stores people's hard work without having professionals who know how to safeguard it.
Even if they do know how to safeguard the data, that doesn't mean that everything else is going to work properly.
I had recently taken over IT after working for six years as a developer. In fact, this happened only a month or so in to my new role.
Our mail server died. Three out of four drives in the hardware RAID 10 failed. I'd been seeing bounces to root@localhost from root@localhost in the nightly reports, but the way things were configured made it nearly impossible to figure out where the mails were coming from. Thanks, Zimbra. We speculate that these were constant alerts from our RAID card notifying us of the impending disaster.
Oh, and the only backups for the mail store were on the machine itself, and in the local Thunderbird installs that half the company used instead of the Zimbra web interface. The machine was in a colo downtown, not local, and running backups over our pathetic little DSL connection was unmanageable.
Both of these things were known problems, both marked high priority, but both months away from being addressed when things went south.
This happened on a Friday. By Monday morning, I'd moved us over to a hosted service, manually sorted all of the mail that hit a catch-all mailbox on a VM I'd set up. By Tuesday, I'd audited every one of our other machines to make sure that mail to root was deliverable (it wasn't in about a dozen machines) and that every machine with hardware RAID had both local and remote monitoring.
Some people, including Directors and C-levels, lost up to ten years of mail. It was the worst IT disaster the company ever faced. But that's not the worst part. No, the worst part is that we're in the IT industry, and knew the entire time that what we were doing was wrong... fixing it had just never been prioritized before, because it wasn't seen as super urgent that it be fixed.
That lesson has been learned.
Sounds like they need a technical founder. Stock well spent imho...
"We need to get programming talent on-board." Sounds to me like they still haven't learned their lesson...
I read that more as they don't know the difference between programming and ops.
glad I wasn't the only one who had that as a first thought. as an ops guys I wasn't sure if I should be offended or just shake my head at the irony of it.
Well apparently someone didn't like my comment since it got down-voted. Whatever. As someone with a background in systems administration, it bothers me when this profession gets left out of the equation, far too often.
As someone with a background in programming, it bothers me when your profession gets left out too :-)
Sometimes i think Sys. Admin profession needs a messiah like NNT(for finance) with perhaps the outspokenness/attitude of Zed Shah. That way, there would be a whole lot of increased awareness for the challenges involved in Systems administration.
I dream of that day..
judging by some of the other comments I think people believe sysadmins can and have been replaced by heroku, aws and such.
I would have to agree. But I would also have to agree that Heroku or AWS may have been a better choice than a 1&1 dedicated server in this particular instance.
However, Heroku and AWS are no substitute for good systems administration. They are good substitutes for data centers given signed and documented SLAs.
It's a recurring theme on HN. I'm a sysadmin and cringe when reading these rather common posts.
Developers are like any other person. "How hard can it be?"
What do you mean?
I mean that Systems Administration is as thoroughly exhausting a career path as programming. They didn't lose information because they coded something wrong or inadequately.
The most basic backup would be to have a remote backup of your files and databases. That's a simple shell script and cron job combination. I can understand that one may not get around to setting it up (I had a massive project and only started doing this after about 7 years of running it..) but it's definitely very important and doesn't need a ton of Sysadmin experience or setup time.
> a remote backup of your files and databases
Are you guaranteeing that those files are in a good state when you ship them off? (i.e., flushed and synced to disk)
How about the DBs? Are you backing up the *SQL data files directly (and if so, are you guaranteeing that _those_ are in a good state?), or doing an SQL or other export?
Exactly which files do you need to backup -- user generated content, system configuration, logs, spool files? Do those files give you enough information to rebuild your machine(s) from bare metal? Or have you have documented your setup procedures?
Where are you storing these backups? And how many copies of them, and on what frequency, should you be keeping?
Are you testing your backups? Are you testing your recovery procedures? If not, how do you _know_ they're all working?
---
Sorry for the question dump, but there is a fair bit more to even basic backups than just syncing your filesystem to S3 / tarsnap / The Cloud TM every day.
Files are pretty straightforward, especially in a possibly low volume - startup that cannot afford a dedicated sysadmin stage, not going to address that here.
Simple database backup would be to replicate to a slave (just addressing mysql here for a simple case). Stop replication, backup, restart replication. Easy as that. More complicated scenarios need more complicated setups but hey, we're not talking about saving the last 0.0000001 seconds of data here, this is about losing all your data, from day 1.
Obviously I cannot comment on the specific backup needs for these guys since I don't know their app. However, I can still ensure that an app that is running on a shared hosting environment (meaning - it's not that complex..) has a reasonably good (24 hour snapshot ?) backup potential within a few minutes.
I think it'd have been a decent compromise if they even had a backup from yesterday with a few minutes/hours of effort upfront.
Not saying here that a sysadmin role isn't justified, just that there are a few steps you can follow so that while you don't have a sysadmin, nothing gruesome happens to you.
It also isn't a backup unless it's tested.
You're right and the cosmos has just proved it. I just went in to check my backups and found some that hadn't been running since Nov. 19th 2012. Well, lesson learnt :) - gotta test those backups!
I can't recommend https://deadmanssnitch.com/ highly enough. This would have told you your backups weren't working a long time ago!
Perhaps it's me, but I don't see the roles of SA and Dev as interchangeable. Yet, HN seems to present a world where all Devs are SAs. As a SA, it's frustrating.
All of that to say; I agree with what you're saying. I just wonder how to get the message across that maybe both roles aren't the same...
I think the point is when you're starting off you have to wear many hats, and since HN is a lot about start-ups, you'll see that often here. At a certain size/load I think most people on HN would recognize that you need specific talent sets (and they might not be interchangeable).
They aren't interchangeable, but there is a significant amount of overlap in the position, especially when you begin operating at scale. The amount and type of automation required when you're dealing with thousands of servers demands a breadth of knowledge that spans both domains.
They're synergistic roles, yes, and ideally you would like one with a mix of both [either a dev with sufficient experience in system administration or a system administrator with sufficient experience in development] to tackle the messes that crop up, but I haven't seen anyone claiming they're interchangeable - just that most of the population of HN is some overlap of the two.
This is likely because many of us are part of small startups where the roles are often interchangeable in the early days: the same small engineering team is responsible for both "writing code", backups, infrastructure, and everything else technical.
It's very difficult to have your foot in both worlds and remain current in either. I did for a while, but I'm more firmly planted in the dev world now. So I can say, with some confidence, that they aren't very interchangeable.
Ah. Yeah you have a point but a good dev would at least suggest doing backups of some sort.
I don't understand, you had a backup HD - that means you had a RAID setup. Why didn't your host replace the damaged hard disk ? In my experience hosts usually monitor RAID health on their servers and if there is a problem they replace the bad hard drives at the quickest opportunity.. and I'm talking about budget hosts.
EDIT: Too many to respond to below so just editing in here. The author mentioned that the primary hard disk had failed over a year ago - but he didn't know about that (the host informed him of this... now?). That points to a RAID setup where the mirror was basically working all this while. That's what I'm talking about in this post.
Repeat after me: RAID is not a backup. It will mitigate certain drive failures, but is not a backup. Period, end of statement. Controllers will forget their info, OSes will eat their partition tables, and will otherwise ruin your data. If you're not backing up your stuff, to a completely separate system, preferably to a completely separate service, you will lose data. Period.
I don't think that's what ashray meant. A backup (as in failover) HD, not an HD that stores backups... Because apparently to add insult to injury, in this case they ignored a failed RAID drive and didn't have backups.
When you buy unmanaged servers, the host isn't monitoring RAID health -- they don't have any remote access to your machine except maybe IPMI for reboots. I've rented servers from various providers for a decade and none has ever monitored my hard drives... plenty have failed, including disks in a RAID and RAID adapters themselves; they get replaced when I call up and tell someone the server won't boot and I need someone to go take a look.
You're right, I seem to have gotten lucky with my unmanaged hosting (3 times over with different hosts). They seem to have some sort of hardware interface to monitor RAID health, of course, this is hardware RAID so maybe that's where the setup differs. I was surprised when I received an email from them one morning about a year ago saying "Hey, one of your RAID drives failed so we replaced it, just FYI".
It's true that a RAID failure may go unnoticed by a sysadmin for a year or more if they don't have proper checks setup for themselves.
I guess the only thing that could've been done in this case was to have a backup cronjob or use a provider that takes care of this stuff..
Thank you. This was exactly what I was getting at. 1&1 is on the short list of budget hosts that are notorious for doing the bare minimum to get your money.
"The backup HD" doesn't necessarily equate to "that means you had a RAID setup". How did you arrive at that conclusion? Budget hosts are notorious for having little to no backup solutions in place simply because, well I don't really know why? Costs? Complexity?
The author said that the primary hard disk failed and the host had told him there was a backup hard disk that hummed along for a year.
I cannot imagine any other practical situation in which a 'backup hard disk' would automatically kick in - apart from a RAID setup.
I know that budget hosts do not backup data off site, but they do tend to maintain their hardware, RAID arrays, etc.
However, I admit that I am all too unfamiliar with shared hosting environments in today's day and age, it's either a cloud or dedicated server for me - and for my budget dedicated servers, hosts have been pretty proactive about replacing bad hard drives.
Repeat after me: RAID is not a backup.
Chances are that most of the information is still physically there, just that it is inaccessible. First thing I would do is physically obtain the drive, so even if it's inaccessible you have the information in your possession.
From your blog post, I'd assume you don't have the knowledge to attempt recovery yourself, so call in an expert to handle the data recovery for you. At this stage, it is a matter of what the information is worth to you, compared to the cost of recovery. Almost any intervention is possible, for a price.
I feel for you. I really do. You did a lot of things right: learned how to program, bootstrapped your startup, released a product (!), got users, went viral, etc. etc.
But.
1) All of this could have been solved with money, specifically money used to pay professionals. You got 30 THOUSAND signups and you didn't think of trying to get funding? I'm surprised VC's weren't pounding at your door. At the very least, that might even be enough for a bank loan from a savvy lender. Hell, you could probably find a recently graduated ('tis the season) CSCI student willing to just take sweat equity with those numbers. This is especially frustrating for me as I currently have a startup that recently garnered a whopping 400 (count 'em!) hits on it's signup page, and yet I still got emails from people trying to invest. Not nearly platinum tier, and thus far none have panned out, but still!!!
2) You claim to have worked in web design/development for a while, and you didn't hear about 1&1's horrific reputation? That's hard for me to believe. In fact, of any community, the PHP/JS crowd is probably most familiar with being burned by 1&1. (Not even going into the slimy overselling).
I hate to say it, but you should have known better. That said, I sincerely wish you the best of luck. You've succeeded pretty spectacularly thus far, and in the big scheme of things this is a pretty minor setback. Just keep shipping and you'll get it eventually.
Edit: I realize that it might seem foolish to some to go after funding when it's not needed, but I would argue that if you are making it up as you go along (not an indictment, it's how we learn) and you get these kind of numbers, you should feel at least a little obligated to your users to secure your product. If that requires money that you don't have, get funding.
Not to be depended on as a substitute for backup, but 'dead' doesn't necessarily mean dead. Forensic recovery (either DIY or professional, depending on the nature of the failure) may still be an option.
Logic board failures are common, and replacements cheap (the cost of a new HD of the same model), data can be highly recoverable from soft failures. Mechanical failure is the worst case, but as long as the platter(s) is/are in tact, not insurmountable.
"Technical support informed me that my first HD died 20 days into my contract. The backup HD hummed along for a year."
That sounds more like RAID than a backup HD.
This doesn't sound like RAID at all. How are you guys coming to this conclusion? At best it sounds like a server with a master/slave hard drive setup...you know, something from the early 2000s.
Two mirrored disks is technically a RAID level that can survive failure of one disk. I worked at a company in the late 1990s that had mirrored disks, they would "split" the mirror at the end of the business day, backup to tape from one disk and run nighttime batch jobs on the other disk. Before start of next business day they would backup the batch disk to tape, then resync the mirror.
A master/slave hard drive setup would have data loss. If the main hard drive died and it wasn't a RAID, surely they would have noticed.
Mirrored disks are RAID 1: http://en.wikipedia.org/wiki/Standard_RAID_levels
Looks like RAID 1.
You probably expected something like RAID 5, well, you are right, it's not.
Where did it say these were mirrored drives?
"We messed up bad. We launched without having a backup procedure in place, and without the resources to make it happen. This was a hard-learned lesson that won’t happen again. We have no one except ourselves to blame."
You know how you messed up? By not using something like AWS - EC2 - Snapshots. Or even S3 or Glacier. What is this trend of devs doing Operations? As a Sysadmin with a Compsci/dev background, it blows my mind constantly.
Great, you know how to move around the CLI, but are you versed in how to maintain a proper and robust system?
Also, why weren't you using something like SES for your email alerts?
Long story short, money was tight, priorities were set incorrectly, and they got fucked.
Sounds about right. My favourite part was leasing a server from 1&1. Even a little industry knowledge with regards to infrastructure would have caused someone to avoid them.
a. My business has a partnership with a good data recovery outfit. We might be able to get you a good deal on a data recovery if you want to try going that route.
b. It takes a particular kind of personality to be good at sysadmin work. (And a lot of trial-and-error -- I just recently had to do an emergency server build due to a Debian update whoops, and I've been doing this stuff for a while.)
c. I usually recommend BackupPC (http://backuppc.sourceforge.net/) for easy set-it-and-forget-it backup infrastructure. It's compatible with everything, it will notify you if there are problems, it does pooling and de-duplication and compression, it's fast and reliable, and you can usually store months of backups on a small offsite server. I store 12 months of all hosted and customer data with it, and we've used it to meet other clients' needs too.
d. If you need affordable help, let me know. I'm way too cheap, and I do this stuff all day, every day. I opened a business specifically to address problems like this: needs something, money is a problem.
That goes for anybody else too. If your lack of backups is keeping you awake at night, or if you've suddenly outgrown your infrastructure, or if looking at config files gives you an ulcer, get in touch with me. I'll help you out.
How many people did it affect? I would be too ashamed to admit it if i was a company offering services for programmers but didn't back up my server.
I see this too many times.. and have read about this more than once on HN in recent memory.
Hire a proper system administration company early to work with you on these types of things. There are many companies out there that do this. I happen to run a company that does this, so I know that you can add an expert admin to your team for $100-200/mo.
That is actually surprisingly cheap. Care if I ask what types of services one would get at those rates?
You're absolutely right though, for a company like OP's, if they are so short on cash, it makes a lot of sense to get someone in even if just for the week to address these types of fundamental problems.
For a monthly service, you generally receive an initial: - System architecture review - Backup strategy / DR review - Security scan, and detailed review - System monitoring design, and implementation
and on-going: - 24x7 monitoring, and response to outages - Server patch management - Ad-hoc system admin time available to be used on-demand
Many more details, and capabilities, but you get the idea ;)
> you can add an expert admin to your team for $100-200/mo
Hahaha.
It's what we do -- so I laugh with you ;)
Expert admins are in the $100-150/hr range when they're consulting, and $100K+ salary range when hired. $100-200/month? Oh lord.
Agreed with the consulting price point, and we charge this. But we also let you get in the door for a few hours a month at a reasonable rate. It's been working for many years now.
You're also based on Michigan a little south of Detroit, so I'm sure you can get away with a lower base rate based on the cost of living there.
Sorta, except we generally hire where the talent is available -- Anywhere USA.
Good luck with your below-the-minimum-wage admin.
I can understand with those numbers why you may think that. These numbers are generally per server and fit a small shop pretty well. We also work with many companies that have hundreds of servers, and the rate is multiplied by each server. I could have been clearer in my initial post.
Additionally, for clarification about your wage comment -- we pay above average US sysadmin wages, and just posted a job for our 7th full-time admin.
> These numbers are generally per server
Well, that's a quite important detail you had omitted. If you start dividing salaries by the number of servers, all kinds of crazy numbers come up.
This site is an excellent way to find out if you've covered all your bases in your backup protocol:
Note: the website is an extended advertisement for a piece of backup software, and the user account was created 3 minutes before the comment was posted.
Yeah, that's true. But then do you really think those guys are spamming HN at 8 p.m.?
I wish I had said, if you ignore the advertising, it's a great resource. If you apply it to proposed backup solutions, it's an effective means to find out if they are viable.
I said it for you, since a reader would probably be interested in that piece of information. I apologise for the cynical comment about the account being 3 minutes old, as that is ad hominem and being new to HN shouldn't come into it. Welcome to HN.
Oh it's okay. Thanks for pointing it out. Truth is I lost my password and I hadn't given them my email, so I just made a new account. (I'm not upset about this.)
On the plus side, this will probably only ever happen to you once. Once you've felt the pain, you'll never let it happen again.
However remember that backups are only half of the disaster recovery picture. You need to have a tested restore process ready to go as well.
Automated and tested is even better :)
I wonder if it still would've happened if they were SSDs...
I tend to think they're safer due to no moving parts.
Actually it's been my experience that SSDs are less reliable than HDDs...
And for the love of all you hold holy, do not ever put drives bought in the same batch in the RAID at the same time. They tend to fail at the same time. Check those serial numbers first.
That's most unfortunate to hear, because I just bought 10x used Intel X25-E SSDs for my server because SSDLife.exe said they only had 6 months of use, with 99% of their life left & an expected runtime of 10 more years.
With these drives typically around $750/each, it'll be difficult if not impossible to find deals on 8 separate lots of _good_ used drives.
I can only afford them when I see an exceptional deal on eBay, and feel lucky to have found these I just got for $200/each. I strongly doubt I'll be able to afford to implement 1 drive per lot with my server like your warning implies would be a good thing to do.
It seems you'd have to buy them all new at different times of the year to be able to implement such a thing, which I certainly can't afford.
Well, I went with the buy one drive a month from each of my vendors for my RAID.
Use them, but take some precautions. Do real backups and test the backup to make sure it will actually restore. Replace blown drives very quickly. I had a group of C4 SSDs that tanked within 24 hours of each other. I was not amused and thus learned what I need to do in the future.
If you really want to have some fun, test the hot swap capabilities in the middle of a very busy, data centric day.
We do that before the equipment goes into production. We do not tempt Mr. Murphy to visit us in production.
Spinrite !