Déjà vu: Fast efficient probabilistic deduplication

48 points by f483 9 years ago · 12 comments

Reader

Bloom filters don't seem useful as the primary means of deduplication for an actual data storage system to me.

- False positives means marking data as duplicate when it's not.

- Bloom filters are not associative. So unless the application is very special and can find/identify/retrieve the data later in some other (likely inefficient?) way, a separate index is required anyway. But if you have a separate index of the data you already have, you can just use that to deduplicate. This index is the main memory cost of deduplicating storage.

Bloom filters are of course interesting in certain scenarios e.g. to statistically reduce network traffic.

f483OP 9 years ago

> Bloom filters are of course interesting in certain scenarios e.g. to statistically reduce network traffic.
Yes I originally wrote it for exactly this such a scenario.
This is not for you if you cannot live with any chance of a false positive, even an extremely small one. Note that even in most cases a false positive is acceptable if the chance of a false positive is less then the chance of a hardware error.
Not that just because it is probabilistic does not mean its not useful in many cases. The default setup with 1mil entrie memory and 1/1mil chance false positive chance will run at ~8M mem usage. This is a quite an acceptable trade off in many cases.

maxdemarzi 9 years ago

May want to take a look at using a Cuckoo Filter instead of a Bloom Filter. See https://maxdemarzi.com/2017/07/13/using-a-cuckoo-filter-for-...

f483OP 9 years ago

Thanks for the feedback, will look into it.

adrianratnapala 9 years ago

Perhaps like the other ZFS fanboys on this thread, I was hoping that "efficient" would by some magic mean "sublinear in memory", but it doesn't look like that. :(

f483OP 9 years ago

Sorry I cant do magic just yet :(

jwilk 9 years ago

Poor name choice.

https://en.wikipedia.org/wiki/D%C3%A9j%C3%A0_Vu_(software)

https://github.com/worldveil/dejavu

https://github.com/appbaseio/dejavu

https://github.com/IndigoUnited/js-dejavu

f483OP 9 years ago

Yeah, should have put more thought into it then I did. But I guess its to late now.
chongli 9 years ago

You missed one! One of my favourite adventure games!
https://en.wikipedia.org/wiki/D%C3%A9j%C3%A0_Vu_(video_game)
ComputerGuru 9 years ago

I must say his is the most-fitting use of the name, though.

bradknowles 9 years ago

Could you use this to deduplicate files in a filesystem?

Could you use this as the deduplication method in ZFS?

f483OP 9 years ago

> Could you use this to deduplicate files in a filesystem?
Depends on what you mean and your constraints. Deduplicate file entries if you can live with a rare false positives, sure. A setup of 1million entrie limit with 1/1billion false positive chance will get ~80M mem usage.
If you want to check for duplicate files, probably not.
> Could you use this as the deduplication method in ZFS?
I would say no.

Settings

Déjà vu: Fast efficient probabilistic deduplication

Keyboard Shortcuts