Déjà vu: Fast efficient probabilistic deduplication
github.comBloom filters don't seem useful as the primary means of deduplication for an actual data storage system to me.
- False positives means marking data as duplicate when it's not.
- Bloom filters are not associative. So unless the application is very special and can find/identify/retrieve the data later in some other (likely inefficient?) way, a separate index is required anyway. But if you have a separate index of the data you already have, you can just use that to deduplicate. This index is the main memory cost of deduplicating storage.
Bloom filters are of course interesting in certain scenarios e.g. to statistically reduce network traffic.
> Bloom filters are of course interesting in certain scenarios e.g. to statistically reduce network traffic.
Yes I originally wrote it for exactly this such a scenario.
This is not for you if you cannot live with any chance of a false positive, even an extremely small one. Note that even in most cases a false positive is acceptable if the chance of a false positive is less then the chance of a hardware error.
Not that just because it is probabilistic does not mean its not useful in many cases. The default setup with 1mil entrie memory and 1/1mil chance false positive chance will run at ~8M mem usage. This is a quite an acceptable trade off in many cases.
May want to take a look at using a Cuckoo Filter instead of a Bloom Filter. See https://maxdemarzi.com/2017/07/13/using-a-cuckoo-filter-for-...
Thanks for the feedback, will look into it.
Perhaps like the other ZFS fanboys on this thread, I was hoping that "efficient" would by some magic mean "sublinear in memory", but it doesn't look like that. :(
Sorry I cant do magic just yet :(
Poor name choice.
https://en.wikipedia.org/wiki/D%C3%A9j%C3%A0_Vu_(software)
https://github.com/worldveil/dejavu
Yeah, should have put more thought into it then I did. But I guess its to late now.
You missed one! One of my favourite adventure games!
https://en.wikipedia.org/wiki/D%C3%A9j%C3%A0_Vu_(video_game)
I must say his is the most-fitting use of the name, though.
Could you use this to deduplicate files in a filesystem?
Could you use this as the deduplication method in ZFS?
> Could you use this to deduplicate files in a filesystem?
Depends on what you mean and your constraints. Deduplicate file entries if you can live with a rare false positives, sure. A setup of 1million entrie limit with 1/1billion false positive chance will get ~80M mem usage.
If you want to check for duplicate files, probably not.
> Could you use this as the deduplication method in ZFS?
I would say no.