Dribbble is Back With a Day of Data Lost
blog.dribbble.comThere is a lot of arm chair sysadmining going on, but remember, the team that built Dribbble is essentially 4 total people, 2.5 engineers, working with no outside funding. The fact that they've built the designer's equivalent of Github and keep it running as smoothly as it does is amazing. It's fine to provide suggestions, but this is a minor blip in an otherwise impeccable record of performance.
If there's one social norm I'd love for HN, it would be "If you build things, we're on your side." (I hope that the normative intent of this is clear enough to not require 2 paragraphs of inoculations against nitpickery. On second thought, if there were two social norms I'd like for HN, that plus "Default to not nitpicking.")
Strongest possible agree.
If there's one thing my gut says has changed for the worse since I joined, it's the cant away from supporting people who build things to tearing them down.
I asked in a "how do we improve HN comments" thread awhile back if Paul Graham could just add this to the guidelines, but it got drowned out by all the nerdy feature requests and didn't get much discussion.
That's a great idea. (I wish people on the internet were more supportive and less antagonistic in general. It's a tendency we should all be aware of.)
But we are hackers; it's our intrinsic pedantry and relentless drive to improve that got us where we are today! Of course you're right that this leads to nitpicking and one-upmanship in comment threads.
On the bright side, looking to the quality comments here is a great way to learn constructive criticism (as well as learn about a variety of other topics too).
Good disclosure on the part of Dribbble.
I have some sympathy; I've seen a Linux server randomly corrupt its file cache, no idea why. Google's study found 8% of DIMMs experienced at least one memory error a year. If you can't trust your RAM, what can you trust? http://research.google.com/pubs/pub35162.html
In ECC we trust.
One of the great ironies of modern computing is that we stopped building ECC into consumer hardware right when we got enough RAM to really need it. What fraction of server hosting has ECC RAM? No one seems to know if Amazon EC2 does, for instance, which suggests it probably doesn't.
I would be really surprised if the memory on EC2 is not ECC. 32GB of ECC only runs $399.00 on Crucial.
It could have been a lot worse, for sure. I'm just happy they resolved the issue and it's back online with minimal damage.
Building & maintaining a website is always a learning lesson because there are so many different areas to study.
Not trying to be an ass here or anything but something doesn't add up. I understand the memory corruption idea but I wouldn't think that would replicate to the other postgresql server. So am I right in thinking there was no slave ever here?
It really depends on how you have configured replication and what the exact issue was. Postgres replication either works by directly streaming the WAL archive or by manually shipping older archived WAL files. If these files were corrupted on the master, then the slave would also get the corrupted files.
Now the files (and when streaming directly, the packets) have a header containing some metadata and the actual WAL log entries have a fixed formatting, so it's likely that the slave would have detected this corruption (unless you were really unlucky which would then easily replicate the corruption over to the slave).
But that would just lead to the slave stopping to actually replicate. Unless you watch your clients whether they are still ok, streaming from the master and the replication lag is reasonably low, you would not notice the replication stopping. When you fail over, you get to the state which the database was in when the first corrupted packet arrived.
So either you check your slaves, or you use two-phase commit, ensuring that your data has reached the slaves, but that has some serious performance costs.
BTW: I would assume this was far more likely an issue with their storage, not with RAM.
thanks for the explanation
I don't know exactly how the replication in postgres works, but I can think of scenarios where bit flipping in RAM gets propagated to the slave(s) (e.g., newly-generated data which currently resides only in RAM gets corrupted and then fsync'd to disk, at which point it gets replicated).
The real question is, why on earth would they use non-ECC memory on their database server.
> The real question is, why on earth would they use non-ECC memory on their database server.
Perhaps because the database server is a cloud server from someone like Amazon, Mediatemple or Linode where you have no control on the underlying hardware.
The links at the top don't really work.
Perhaps we will start to see posts about people abandoning Postgres and moving back to MongoDB, completing the circle?
I chuckled :)
the point is, no software can prevent hardware failure , not even mongodb ;)
Except for those programs which listen to sensory input, specifically to ensure safe operating conditions. E.g. if it's a deliberate design choice not to allow operations when external factors are out of bounds, I'd consider it a success if control software decides to shut down when that happens. In doing so, it has prevented the hardware from entering a potentially devastating failure mode. Sorry to wit.
Glad you are back, hope you'll do what it takes to avoid another data loss;) Good luck.