Breaking and Fixing Content-Defined Chunking

This work will be presented at the Real World Crypto Symposium 2025 in Sofia, Bulgaria

At the beginning of 2023, Matteo (link to his website) and I started supervising a Bachelor's thesis with the objective of analyzing the state of end-to-end encrypted backup applications. This is part of the ongoing effort from our group (among others) to analyze cryptographic applications "in the wild". Martin Albrecht and Kenny Paterson wrote a nice retrospective on this sort of exploration here.

The thesis is available on our group's website and presents a fingerprinting attack on the Borg backup system. In short, due to the peculiar way deduplication is done (by splitting files into smaller parts called chunks) some applications are vulnerable to a fingerprinting attack: an adversary with access to the encrypted chunks can check whether a file is present in the backup or not, despite the encryption. While some applications, like Borg, try to protect against this, the project showed that these mitigations can be insufficient.

Inspired by this early result, we set out to break more backup systems. With some procrastination inbetween, as we do.

TL;DR: We found attacks that compromise the mitigations for Borg, Restic, Tarsnap, Bupstash and Duplicacy (under some specific assumptions, see the paper and below). We also provide a cryptographically-secure mitigation and we discuss weaker mitigations that may sometimes be more practical. Here is a link to the pre-print of our paper.

This is joint work between me, Simon-Philipp Merz, Matteo Scarlata, Felix Günther, and Kenny Paterson.

Preface: on independent work

When we contacted Colin Percival, the author of Tarsnap, on the 9th of January 2025, we discovered that they had been previously contacted by two researchers, Boris Alexeev and Yan X. Zhang, discussing attacks of the same spirit as ours. Up until that moment, we had no knowledge of their results and our work has been completely independent. After our disclosure, but before the release of this post, they completed a writeup of their work, which you can find here.

We are thankful to Colin, Boris, and Yan for the discussions we had during our disclosure.

Syncing, in the face of redundancy

Imagine the following abstract setting: there are two devices which have an almost exact copy of the same set of files and they have to synchronize them. Let's assume that synchronization happens only in one direction (this would be the case in which one device is a client, and the other is the server).

This appears in practice in various settings:

One direct solution would be to simply transfer the whole set of files, which is clearly inefficient for large sets of big files. Another easy solution would be to exchange succint checksums of files (e.g. SHA-256 hashes) and only send the files that differ. However, a single-byte edit in a file would require transferring the whole file again, which would be inefficient if the edited file is large. The solution is to split files into chunks and then exchange checksums at the granularity of chunks.

Fundamentally, the consequence is that chunks are deduplicated: only one copy of every distinct chunk needs to be uploaded and everywhere else a pointer to that copy will suffice, allowing for very efficient synchronization. But here's the one-million dollar question: how does one chunk files such that deduplication is efficient?

One idea could be to split files into chunks of the same fixed size. Changing a single byte in a chunk ensures that only that chunk changes, so only that chunk has to be transferred. However, prepending a single character causes something catastrophic to happen: every byte shifts by one and every single chunk changes, which completely defeats the purpose of chunking in the first place. This is known in folklore as the boundary shift problem.

At this point, the chunking scheme must employ a clever trick: instead of setting chunk boundaries based on absolute offsets, it sets chunk boundaries depending on the plaintext. This means that if it, say, chunks between the words "Hello" and "World", it will keep chunking there even if the words get shifted around the file. Thus, a single byte will not affect the location of the chunk boundary with respect to the file content. The consequence is that, while the first chunk may include more data, the successive ones will remain identical (with good probability) and hence will not need to be transferred. Success!

This technique, called Content-Defined Chunking (or CDC for short), is so successful that it has been introduced in rsync, IPFS, the League of Legends patcher, and basically every single end-to-end encrypted backup system we managed to find including Borg, Kopia, Bupstash, Restic, Duplicacy, and Tarsnap.

A strawman backup system

Imagine you are a software developer and you want to create an efficient backup system that does not leak anything to the server hosting the backup. In this case, you are the one hosting the backups of the users and you definitely don't want to see anything, lest you get some national security agency knocking on your door with a subpoena in their hand.

This seems like an easy problem to solve: generate a symmetric key to be used in an appropriate Authenticated Encryption with Associated Data (AEAD) scheme. Then, encrypt and send the files to the server. Sounds good, right?

Ah, right: we forgot to add deduplication, so now the users are complaining that their backups are exploding in size (and your AWS bills are looking very grim now despite having few users). Easy enough: use a CDC scheme to chunk the file first. Then, encrypt these chunks and send them over the network.

Is it true that you can't deduce anything about the users' backups now?

Fingerprinting attacks

If you've ever followed a course in cryptography, you (should have!) heard something about encryption not protecting the length of messages. This can pop up when explaining the one-time pad ("The OTP provides information-theoretic security, and only leaks the length of the message") or concepts like IND-CPA security, where the adversary is only allowed to query for pairs of same-length messages, to prevent trivial wins.

Usually this is not a big issue, but, in the setting of encrypted backups, it's a lethal weakness. Recall what we said earlier: "we set chunk boundaries depending on the plaintext". This means that the lengths of the chunks depend on the plaintext, and the lengths of the chunks are not protected by our encryption scheme! Like sodium and water, the combination of two things is explosive.

Here's the setting: a user uploads a large file as a sequence of encrypted chunks. Let's say that this results in chunks of lengths \(L_1, \ldots, L_n\). Again, these lengths are visible despite encryption. Now, a national security agency wants to check whether this file corresponds to a copy of some controversial book. All they need to do is to take their own copy of this book and chunk it: if it results in chunks of lengths \(L_1, \ldots, L_n\), then they have a good reassurance that this user has this book in their backup. Fundamentally, the sequence of chunk lengths forms a fingerprint for a file, which is transparent to encryption.

Protecting against fingerprinting attacks: a first attempt

The astute reader might have spotted something in the previous paragraph: the whole attack works only because the attacker itself (the server provider) can compute the fingerprint of any file. What if we injected some secret in the chunking algorithm that prevents the adversary from computing this signature? Ideally, unless the adversary can somehow recover the key, this should make it impossible for the adversary to run fingerprinting attacks.

Shoving a cryptographic key in an arbitrary algorithm in the hope that it provides security is like throwing a scalpel at a patient hoping to perform an appendectomy: it won't work. In fact, designing a secure chunking algorithm is not a trivial task, and many backup systems got it wrong, either trivially or in more involved ways.

We do not blame the developers for getting it wrong since, as just mentioned, this is quite tricky. Furthermore, there is another quite nasty requirement: the chunking algorithm has to be blazingly fast. Intuitively, it has to be called for every byte of the file, so we need to run it for billions, trillions of time in the span of a few minutes. The design of such a fast CDC algorithm is an interesting problem and there is a lot of literature in this regard. In the end, in order to obtain this efficiency, most CDC schemes have a lot of nice algebraic properties, often involving polynomials. Intuitively, one would want to preserve this algebraic structure as much as possible even when mitigating fingerprinting attacks, so that the scheme remains efficient.

However, algebraic structure is the favourite breakfast of every cryptanalyst. All our attacks abuse this algebraic structure to recover the chunking secret of the (keyed) CDC schemes of Borg, Restic, Tarsnap, Bupstash and Duplicacy. This allows fingerprinting again.

Mmhhh... tasty polynomials

Did you really break them?

In the paper, we discuss at length about the real-world impact of our attacks. One requirement for our attacks is that we know at least one file and the exact length of its corresponding chunks. The first part of the assumption can be fulfilled if, say, the user backs up their Downloads folder and the attacker sends them an email with an attachment. The second part is a bit trickier to achieve: compression, padding, packing chunks together adds a lot of uncertainty when trying to learn the actual length of a chunk. This actually prevents some of our attacks (which are rather focused on the chunking scheme itself) from being applied to the full backup system.

This is not the case for all systems. We managed to build attacks against three backup systems when considered as a whole:

We disclosed our findings to the appropriate vendors. More details in the paper.

Are my backups secure?

Maybe, likely yes. First of all, for someone to have fingerprinted your backup, you would have to have put your backup data on an untrusted server. Some people might be hosting their backups on servers under their control, which would be fine. This is the case with my backup, though I am assuming that my office mate Matteo has not tampered with my server.

Second, there are a few obstacles to exploitation, which we explain above and, in more length, in the paper.

How do we even fix this?

Here is the second major contribution of our paper. We propose a novel security notion for (keyed) CDC schemes. Intuitively, you can think of the chunking scheme as some deterministic scheme that, for any position tells you 1 if you need to chunk there, 0 otherwise, depending on the content. Our security notion says that, for computationally bounded adversaries, this decision bit should be indistinguishable from just tossing a coin (of the appropriate bias) independently of the file content. This is a rather strong notion, but we show that it is achievable with a simple construction.

Despite its simplicity, our construction does incur some overhead. In particular, it requires one AES block cipher evaluation per byte of the file. We think this is almost inevitable in order to obtain cryptographic security (though we would like to be proven wrong) for the specific class of chunkers we analyze.

Due to what we discussed above, one could be tempted to use a weaker mitigation such as using padding to cause uncertainty in the actual length of the chunks. In the paper, we discuss this in more details. However, the general gist is: if your CDC algorithm is too weak, just padding won't save you.

Still, our security model is focused only on the CDC part of these backup systems. In fact, one might observe that some files will induce very regular chunking patterns. For example, backing up a file consisting mostly of 0x00 bytes will create all chunks of the same size. This is an inherent limitation of CDC schemes: they must make these patterns evident in order to allow deduplication.

One good open question for the community is then the following: can we come up with a security model that captures the security of a backup system as a whole, including encryption, chunking but also compression, padding, and deduplication BUT still allows for efficient deduplication with efficient constructions?.

Oof, that's a long question. Let's break it down.

We think this is a great open research question, since there are many challenges involved. Here is an illustrative example. Let's say that we want a sort of left-or-right definition where the adversary gets to submit two files to the client, and then has to guess which one was uploaded by the client (after chunking, encryption...). How does one prevent trivial wins? It's not easy to come up with a definition which is not a self-referencing tautology ("the adversary can only submit pairs of files which end up being indistinguishable to the adversary after the backup") since files which are somewhat similar can still behave quite differently after chunking and deduplication. Min-entropy arguments seem to fail because files with the same entropy can have very different chunking behaviour as well.

If you have an idea, or you want to discuss this, feel free to message me! I'd be happy to chat about this. Also, if you happen to be in Sofia, come talk to me at RWC!