Chunking Attacks on File Backup Services Using Content-Deﬁned Chunking [pdf]

daemonology.net

123 points by cperciva 2 months ago

pvg - 2 months ago

Less pdfblobby blog post https://www.daemonology.net/blog/2025-03-21-Chunking-attacks...

cperciva - 2 months ago

Yeah I considered submitting that, but all the interesting details are in the paper.
- pvg - 2 months ago
  
  The 'uninteresting' metadetails are what gets people to reach the interesting details so your first instinct was probably right but I don't think it really matters much in your case as you have enough local reputation to just blobpost plus you're around to talk about the stuff in either form.
RadiozRadioz - 2 months ago

What is pdfblobby?
- cperciva - 2 months ago
  
  An adjective. He was saying "less of a PDF blob".
  - RadiozRadioz - 2 months ago
    
    Thanks, I had interpreted it as "pdfblob" being some kind of PDF rendering tool that left noticeable artefacts or quirks in the output

dchest - 2 months ago

This is great, thank you! This was on my wishlist for a few years:

https://www.reddit.com/r/crypto/comments/7imejm/monthly_cryp...

I've tried to take a stab at this problem, but was not sure if it worked at all:

https://gist.github.com/dchest/50d52015939a5772497815dcd33a7...

It's a modified BuzHash with the following changes:

- Substitution table is pseudorandomly permuted (NB: like Borg).

- Initial 32-bit state is derived from key.

- Window size slightly varies depending on key (by ~1/4).

- Digest is scrambled with a 32-bit block cipher.

I also proposed adding (unspecified) padding before encrypting chunks to further complicate discovering their plaintext lengths. Glad to see I was on the right track :)

mananaysiempre - 2 months ago

> I'm also exploring possibilities for making the chunking provably secure.

Seems like that’s possible[1] to do in a fairly straightforward manner, the question is if you can do this without computing a PRF for each byte.

[1] Obviously you’re always going to leak the total data size and the approximate size of new data per each transfer.

cperciva - 2 months ago

Right. I need to implement this and see if the performance is too painful.

masfuerte - 2 months ago

Could this be mitigated by randomising the block upload order?

A fresh backup will be uploading thousands of blocks. You don't want to create all the blocks before uploading, but a buffer of a hundred might be enough?

cperciva - 2 months ago

Yes that's one of the things we're planning on doing. Doesn't help with small archives (or archives which don't contain much new data) of course.
I was originally planning on having that as part of 1.0.41, but the implementation turned out to be harder than I expected.

0cf8612b2e1e - 2 months ago

Having not read the paper, does this impact Restic or Borg which encrypt chunks?

jszymborski - 2 months ago

> The reason rolling hashes are relevant to our topic is that many file backup services use variations of rolling hashes to achieve CDC. This paper will primarily look at Tarsnap [9], a project by the second author, but we will also look at other schemes such as Borg [2] and Restic [6]
> It seems like compression as default (or even required) is important. Without compression, Borg and Restic are susceptible to known plaintext attacks. With compression, we still have theoretically sound (and harder) chosen-plaintext attacks but no known-plaintext attacks. Sadly, compression can also leak information for post-parameter extraction attacks, as shown in Example 4.3.
cperciva - 2 months ago

Yes. They have been notified about these attacks.

amarshall - 2 months ago

My reading is that the primary vector is based on the size of the chunks (due to deterministic chunking and length-preserving encryption). Would padding chunks with random-length data (prior to encryption) help mitigate this at the cost of additional storage (and complexity)?

cperciva - 2 months ago

Tarsnap 1.0.41 (released today) adds padding for exactly this reason.
ThomasWaldmann - 2 months ago

borg supports the "obfuscate" pseudo compressor since 1.2.0 (since Feb 2022), that adds random extra length to the chunks, using one of 2 different algorithms to determine the extra length.

- 2 months ago

[deleted]

jszymborski - 2 months ago

Would SipHash be too slow? I think it would help mitigate the problem since you can key it to prevent known-plaintext attacks, right?

EDIT: or maybe this keyed rolling hash https://crypto.stackexchange.com/questions/16082/cryptograph...

ThomasWaldmann - 2 months ago

borg discussion + wiki page:

https://github.com/borgbackup/borg/discussions/8694

pbsd - 2 months ago

In page 10, should the ring R be GF(2)[X]/(X^32-1) and the map p be from {0,1}^{32} to R?

cperciva - 2 months ago

I think so yes. I emailed my coauthors to confirm.

Pozzi1 - 2 months ago

[dead]

hackburg - 2 months ago

[dead]