tl;dr
I’ve been comparing crates on crates.io against their upstream repositories in
an effect to detect (and, ultimately, help prevent) supply chain attacks like
the xz backdoor1, where the code published in a package doesn’t match the
code in its repository.
The results of these comparisons for the most popular 9992 crates by download count are now available. These come with a bunch of caveats that I’ll get into below, but I hope it’s a useful starting point for discussing code provenance in the Rust ecosystem.
No evidence of malicious activity was detected as part of this work, and approximately 83% of the current versions of these popular crates match their upstream repositories exactly.
Background
I’ve been employed by the Rust Foundation3 to work on security matters for a bit over a year now. My focus has mostly been on the crate ecosystem thus far, and especially around supply chain security.
After the xz backdoor1, one question that immediately came up was “could
a crate be compromised in the same way?”. Perhaps an even more urgent question
was “is a crate compromised in the same way?”. We need tooling to be able to
answer those questions in an ongoing way.
Analysis
I’ve built a standalone tool to analyse crate versions, which I intend to eventually turn into something that performs ongoing analysis of newly published crate files. (Once that’s ready, this will be open sourced — the current tool relies on personal tooling I’ve built to mirror crates.io, which is (a) ugly as hell, and (b) not useful to anyone except me in its current form. The methodology is described below, though.)
For now, I’ve analysed the top 9992 crates on crates.io.
As part of that, I also built a rough and ready tool to visualise the results for spot checking purposes, which I’ve now turned into a static site generator and have used to publish the results of that analysis.4
Methodology
The short version here is:
- Take every version of each crate.
- See if the manifest defines a
repository. - See if the crate file includes VCS metadata.
- Clone the repository at the given revision.
- Run
cargo packageto rebuild the crate file.5 - See if it matches!
Simple, right? Right?
Well, as it turns out, there are a few issues.
Issues
There are, in fact, a bunch of ways the above can fail.
Symlinks
This isn’t an issue for the vast majority of crates, but I spent a fair bit of time tracking this down.
Basically, if you have a symlink in your repo, and you clone it on Windows without symlink support enabled6, Git will helpfully turn each symlink into a regular file. That file’s contents will be the target of the symlink.
Now, if it’s a source file, you’re probably going to notice right away (since your builds will fail), but for things like README files and licences, you probably won’t. And, to make matters worse, they’re pretty much the most commonly symlinked files, particularly in multi-crate workspaces.
For now, I’ve elected to give crate versions that otherwise match their repos a special yellow sort-of-OK state. Still, it’s not lost on me that this might be a potential vector of attack in the future. Realistically, the fix here is probably to encourage crate developers to publish their crates out of (non-Windows) CI. (More on that in a bit.)
Stuff straight up not existing
Just because you declare a repository in a Cargo manifest, doesn’t mean that
the repo still exists. (Or, indeed, ever existed.) And that’s before even
getting to submodules. Or revisions — just because Cargo saw a revision in a
local repository doesn’t mean it ever got pushed to a public code host. It just
has to be committed locally to avoid needing to use --allow-dirty.
Speaking of…
Dirty crates
If you publish with something like cargo publish --allow-dirty, then that
lovely VCS info file doesn’t get included in the crate file.
This is probably the right choice on the Cargo side, but I do feel that we’ll
lose the ability to otherwise verify repos in some cases where someone just has
an extra set of test cases in their working directory and used --allow-dirty
to get around the requirement that the Git tree is clean.
Build failures
This isn’t really an issue in the top crates, but in testing the deeper reaches of the crate ecosystem, some crate versions just straight up don’t build any more — presumably because they relied on submodules that no longer populate correctly, or because they relied on nightly features that no longer exist.
It’s hard to generate a crate file to test when the crate doesn’t build.
Workspaces, again
I mentioned workspaces earlier, but another problematic case is users of workspaces on older versions of Cargo. Before Cargo 1.57 (December 2021), crate files built from subdirectories of repositories didn’t have that fact annotated in their VCS info.
In theory, it would be possible to search the repository to try to discover where a member crate is built from. Alas, that exercise has been left for another day, so those crates will currently show up as not being found in the repository.
Very old things
And, finally, Cargo added support for generating the VCS info file in version 1.30, which was released in October 2018. Very old crate versions simply won’t have this file, and hence can never be verified.
Results
Given the above, here’s what I found. Of the most recent versions of the top 999 crates:
- 826 crates match their upstream repositories at the revision they were built at.
- 74 crates have revisions that cannot be found in their repositories, whether due to later squash merges, rebases or revisions simply not being pushed.
- 73 crates do not have VCS info, either because they were built
with old Cargo versions, built with
--allow-dirty, or not built from a repo clone at all.7 - 7 crates do not declare a
repositoryin their Cargo manifest. - 7 crates would match their upstream repository but for one or more symlinks being incorrectly handled.
- 3 crates declare repositories that do not exist.
- 3 crates have submodules that do not exist.
- 3 crates cannot be found within their repositories.
- 3 crates cannot be built due to
cargo packageerrors.
Going back further, those 999 crates have published 33,085 versions in total. The major trends looking back further into history are that fewer crates have repository metadata, and there are more errors related to not being able to find a crate in a workspace and more missing repositories. Both of these feel intuitively correct: the further back into history we go with these crates, the more likely it is that they have were packaged with older versions of Cargo, and the more likely it is that their repository history has shifted in ways that we can’t unpick in 2024.
Only 8 crate versions straight up don’t match their upstream repositories. None
of these were malicious: seven were updates from vendored upstreams (such as
wrapped C libraries) that weren’t represented in their repository at the point
the crate version was published, and the last was the inadvertent inclusion of
.github files that hadn’t yet been pushed to the GitHub repository.
Future work
An obvious next step here is to extend this to the entire crates.io corpus. I intend to perform this analysis in the next couple of weeks.
Rather than further extending the static site that I’ve published today, I would also like to integrating this into crates.io for every crate, and running this check each time a new crate version is published. Doing this will, of course, require the consensus of the crates.io team, and work to design the UI and UX for this in a way that is immediately useful to the casual Rust user.
I think it’s also critical that we start providing off-the-shelf GitHub actions (and equivalents for other popular code hosts) that make it easier to publish directly out of repositories on the host, rather than crates being published from developer desktops. This is also a critical step on the road towards supporting a full trusted publishing pipeline.
And, of course, there’s plenty that can be done to improve the analysis.
The handling of broken symlinks is a late-added heuristic that I’m still not 100% sure I like.
Discovering crates within workspaces published from old Cargo versions would improve the accuracy of the checks.
Finally, getting a better idea of what types of changes exist is also important: analysing the top 999 crates didn’t really result in enough crate versions that didn’t match their upstreams to perform any real analysis, but a larger dataset will likely give us a lot to dig into. This is important because it will allow us to develop tailored best practice advice for different real world scenarios.
In summary
If there’s a backdoor attack lurking in the crates ecosystem, then it’s lurking pretty deep at present. The popular crates that we all rely on day to day generally appear to be what they say they are.
Of course, just because a package is the same as its upstream repository, that doesn’t mean that the repository itself is safe. This just mitigates one potential area of supply chain interest. (Alas, there are no silver bullets.)
I’m looking forward to developing this work further in conjunction with the Rust project, Rust Foundation, crates.io team, and others, but also ensuring that we broaden the analysis and scanning work that we do as we go.
Lots to be done!