Settings

Theme

Dribbble is Back With a Day of Data Lost

blog.dribbble.com

62 points by 19_ploT 13 years ago · 21 comments

Reader

replicatorblog 13 years ago

There is a lot of arm chair sysadmining going on, but remember, the team that built Dribbble is essentially 4 total people, 2.5 engineers, working with no outside funding. The fact that they've built the designer's equivalent of Github and keep it running as smoothly as it does is amazing. It's fine to provide suggestions, but this is a minor blip in an otherwise impeccable record of performance.

  • patio11 13 years ago

    If there's one social norm I'd love for HN, it would be "If you build things, we're on your side." (I hope that the normative intent of this is clear enough to not require 2 paragraphs of inoculations against nitpickery. On second thought, if there were two social norms I'd like for HN, that plus "Default to not nitpicking.")

    • tptacek 13 years ago

      Strongest possible agree.

      If there's one thing my gut says has changed for the worse since I joined, it's the cant away from supporting people who build things to tearing them down.

      I asked in a "how do we improve HN comments" thread awhile back if Paul Graham could just add this to the guidelines, but it got drowned out by all the nerdy feature requests and didn't get much discussion.

    • tedsanders 13 years ago

      That's a great idea. (I wish people on the internet were more supportive and less antagonistic in general. It's a tendency we should all be aware of.)

    • wikwocket 13 years ago

      But we are hackers; it's our intrinsic pedantry and relentless drive to improve that got us where we are today! Of course you're right that this leads to nitpicking and one-upmanship in comment threads.

      On the bright side, looking to the quality comments here is a great way to learn constructive criticism (as well as learn about a variety of other topics too).

NelsonMinar 13 years ago

Good disclosure on the part of Dribbble.

I have some sympathy; I've seen a Linux server randomly corrupt its file cache, no idea why. Google's study found 8% of DIMMs experienced at least one memory error a year. If you can't trust your RAM, what can you trust? http://research.google.com/pubs/pub35162.html

  • manmal 13 years ago

    In ECC we trust.

    • NelsonMinar 13 years ago

      One of the great ironies of modern computing is that we stopped building ECC into consumer hardware right when we got enough RAM to really need it. What fraction of server hosting has ECC RAM? No one seems to know if Amazon EC2 does, for instance, which suggests it probably doesn't.

      • nodesocket 13 years ago

        I would be really surprised if the memory on EC2 is not ECC. 32GB of ECC only runs $399.00 on Crucial.

jakerocheleau 13 years ago

It could have been a lot worse, for sure. I'm just happy they resolved the issue and it's back online with minimal damage.

Building & maintaining a website is always a learning lesson because there are so many different areas to study.

hijinks 13 years ago

Not trying to be an ass here or anything but something doesn't add up. I understand the memory corruption idea but I wouldn't think that would replicate to the other postgresql server. So am I right in thinking there was no slave ever here?

  • pilif 13 years ago

    It really depends on how you have configured replication and what the exact issue was. Postgres replication either works by directly streaming the WAL archive or by manually shipping older archived WAL files. If these files were corrupted on the master, then the slave would also get the corrupted files.

    Now the files (and when streaming directly, the packets) have a header containing some metadata and the actual WAL log entries have a fixed formatting, so it's likely that the slave would have detected this corruption (unless you were really unlucky which would then easily replicate the corruption over to the slave).

    But that would just lead to the slave stopping to actually replicate. Unless you watch your clients whether they are still ok, streaming from the master and the replication lag is reasonably low, you would not notice the replication stopping. When you fail over, you get to the state which the database was in when the first corrupted packet arrived.

    So either you check your slaves, or you use two-phase commit, ensuring that your data has reached the slaves, but that has some serious performance costs.

    BTW: I would assume this was far more likely an issue with their storage, not with RAM.

  • pfg 13 years ago

    I don't know exactly how the replication in postgres works, but I can think of scenarios where bit flipping in RAM gets propagated to the slave(s) (e.g., newly-generated data which currently resides only in RAM gets corrupted and then fsync'd to disk, at which point it gets replicated).

    The real question is, why on earth would they use non-ECC memory on their database server.

    • runarb 13 years ago

      > The real question is, why on earth would they use non-ECC memory on their database server.

      Perhaps because the database server is a cloud server from someone like Amazon, Mediatemple or Linode where you have no control on the underlying hardware.

timmm 13 years ago

The links at the top don't really work.

benjaminwootton 13 years ago

Perhaps we will start to see posts about people abandoning Postgres and moving back to MongoDB, completing the circle?

  • bobx11 13 years ago

    I chuckled :)

  • camus 13 years ago

    the point is, no software can prevent hardware failure , not even mongodb ;)

    • brokenparser 13 years ago

      Except for those programs which listen to sensory input, specifically to ensure safe operating conditions. E.g. if it's a deliberate design choice not to allow operations when external factors are out of bounds, I'd consider it a success if control software decides to shut down when that happens. In doing so, it has prevented the hardware from entering a potentially devastating failure mode. Sorry to wit.

camus 13 years ago

Glad you are back, hope you'll do what it takes to avoid another data loss;) Good luck.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection