A self-inflicted data breach

32 points by aiguuu 4 years ago · 12 comments

Reader

> This blog post is not about Gravatar.

Stackoverflow used those MD5 hashes to display gravatars. Downloading old stackexchange data dumps is simply an easy way of collecting stackoverflow's gravatar hashes.

> Some time in 2013 the email hash issue was brought to light and Stack Exchange promptly removed the email hashes from their dumps, but the damage was already done.

The issue has been known since at least 2009: http://www.developer.it/post/gravatars-why-publishing-your-e...

And I warned of this exact issue on meta.stackexchange in 2011: https://meta.stackexchange.com/a/84734/152255

In 2011 the dump contained 95k unique email hashes plus 10k IPv4 addresses, while the dump mentioned in the article had grown to 1.8M.

Stack exchange ignored the issue for several years, instead of fixing it promptly. This delay increased the number of affected email addresses increased 20x.

sneak 4 years ago

What is the basis for the claim that md5 hashed addresses are easily reversed?

You can of course hash public email lists but that won't work for unique emails.

quag 4 years ago

I too was skeptical, but the author actually reversed the email addresses for half of the entries in the leak in 3 minutes.
The second to last paragraph in the article says: “With my Nvidia RTX 3080 it only took 3 minutes and 17 seconds to process. Hashcat was able to recover 51.81% (972,933) of the hashes.”
shkkmo 4 years ago

MD5 was intended to be a message digest hash, so its primary function was to provide non-collidable checksums. (Edit: which it does poorly in an adversarial environment so is now retired for security uses.)
Unlike hashing algorithms designed to be hard to reverse, MD5 has a very fast algorithm, so rainbow tables are fast and easy to create.
A public list of emails only speeds up the process and is not required as you could just build a completely random rainbow table. Prioritizing common email patterns in the table also is just another speed boost.
- kingcharles 4 years ago
  
  I wonder if there are two "sane" looking email addresses that collide on the same MD5 hash?
  - CodesInChaos 4 years ago
    
    It's very likely they exist. But I don't think the existing attacks on MD5 will allow you to construct one. So you'd have to spend 2^64 on brute-force, which is expensive but feasible.
xarope 4 years ago

This is probably going back 20+ years, but given old tools like john the ripper, newer tools like ighashgpu, or just writing your own iterative tool to go through [a-z][A-Z][0-9]@<populardomain>, and the advent of high speed CPUs and GPUs, you'd be able to generate rainbow tables of millions of md5 hashes in a few hours/days/weeks (depending on length and complexity of course).
So yes, literally every security expert out there knows md5 hashes are weak, hence the advent of salted hashes, and nowadays sha and variants, as well as well-known hash types for passwords like bcrypt/argon2.
- Mandatum 4 years ago
  
  Millions of md5s in hours? More like microseconds.
CodesInChaos 4 years ago

In 2011 I attempted to reverse email MD5s published by stackoverflow. I recovered ~28% of 95k unique email hashes using a slow CPU based hasher, by trying simple patterns based on common names and providers. I did not use any already known email addresses for this.
https://meta.stackexchange.com/a/84734/152255
I think a bit later somebody else did a similar attack on a different dataset, and recovered about 50% using a GPU based hasher (a GPU can burn through billions of MD5 hashes per second).
https://arstechnica.com/information-technology/2013/12/crypt...
andreareina 4 years ago

Not many people have unique addresses.
ETA: the author isn't claiming to have found a preimage attack, but brute force is a legitimate tactic when the search space is constrained, and I'd argue it counts for claiming that those particular hashes were reversed.
taspeotis 4 years ago

You can do many billion MD5 hashes a second and there are a few common suffixes @gmail.com @outlook.com etc.
The search space is mostly lowercase and maybe . or +
So it literally isn’t a reverse lookup but it’s still pretty broken.

Settings

A self-inflicted data breach

Keyboard Shortcuts