Select Topic Area
Bug
Body
Our builds are breaking because sha256sums of downloaded .tar.gz files have changed. This happened sometime over the weekend or this morning (2023-01-30). Did something change in how GitHub generates .tar.gz files?
In an older discussion:
What it sounds like you’re asking for is a 100% guarantee that the release tarballs will not change. That’s something that we aren’t going to guarantee. So if you need to be absolutely certain that a release artifact won’t change, uploading a release artifact is the way to go, whether it’s a duplication in the common case or not.
So it could be changes on the GitHub side. But I'd expect to see an announcement it's going to happen.
Edit: But per @SanjayVas's link, .tar.gz with URLs under /archive/refs/tags/ should have stable hashes, so this looks like a GitHub bug.
Edit 2: Official response
The default compression for Git archives has recently changed.
As result, archives downloaded from GitHub may have different checksums even though the contents are completely unchanged.GitHub doesn't guarantee the stability of checksums for automatically generated archives.
These are marked with the words "Source code (zip)" and "Source code (tar.gz)" on the Releases tab.
If you need to rely on a consistent checksum, you may upload archives directly to GitHub Releases.
These are guaranteed not to change.
Edit 3: from @vtbassmatt:
Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).
E.g. one error (from Bazel):
Error in download_and_extract: java.io.IOException: Error downloading
[https://github.com/AprilRobotics/apriltag/archive/refs/tags/v3.2.0.tar.gz ] to [...]/v3.2.0.tar.gz: Checksum was 3ce5fae0355961a0be846363ce6b6b394b7e179f8ee5354907a47c8764f40639 but wanted 111a93a5315f8b8c2a36fa911403236032a819e8f50e8845548fe2d7dd1e5db5
The new checksum matches what I get when downloading manually:
$ curl -sS https://github.com/AprilRobotics/apriltag/archive/refs/tags/v3.2.0.tar.gz | sha256sum 3ce5fae0355961a0be846363ce6b6b394b7e179f8ee5354907a47c8764f40639 -
Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).
3 replies
I think the comment you linked says that only your own release archives will be stable, not the source code archives generated by github. They explicitly say the source code archives are not guaranteed to be stable.
That cannot be said, however, for repository code download archives found at archive/v6.0.4.
I guess the part that's unclear to people is what was meant by repository release archives
Both:
- archive/v6.0.4.tar.gz
- archive/refs/tags/v6.0.4.tar.gz
(when fitted into the url template: https://github.com/user/repo/{link_location})
are generated by github. Currently, they both redirect to the same, third location:
- https://codeload.github.com, an internal subdomain of github, followed by the repository slug
user/repo, followed bytar.gz/refs/tags/v6.0.4
I am not sure what the original github employee who stated this meant. Perhaps it meant that Github reserves the right to change the internal implementation of archive/v6.0.4 and make it do something else instead.
I think the comment you linked says that only your own release archives will be stable, not the source code archives generated by github. They explicitly say the source code archives are not guaranteed to be stable.
Following up on @eli-schwartz's reply, what the comment stated is that archives accessed via that URL template would be stable. This means source archives that are automatically generated for a release/tag.
WE are having the same problem, please can someone get some eyes on this?
0 replies
0 replies
From our tests we found only folder permissions to be changed: some folders now have group write added (for a file from Feb 25, 2021
)
2 replies
This looks similar to #8149
@gudmundur provided some good feedback last time, maybe they can help again?
1 reply
@gudmundur provided some good feedback last time, maybe they can help again?
I wouldn't hope for it, they have been at Vercel for a year.
0 replies
And just so it's recorded, this also breaks Homebrew install from source:
$ brew install --build-from-source mockolo
==> Fetching mockolo
==> Downloading https://github.com/uber/mockolo/archive/1.8.0.tar.gz
==> Downloading from https://codeload.github.com/uber/mockolo/tar.gz/refs/tags/1.8.0
##O#-#
Error: mockolo: SHA256 mismatch
Expected: 9d289eede8080bcf3f0fbd2ec4f6a0616a78743c09cd58575e48e6e4b946f357
Actual: 8a3ff3707c58215e48eb62d39f6347f95b7099cea6d0370f50da61f018a9c2d2
File: /Users/matt.robinson/Library/Caches/Homebrew/downloads/2d1930525a06b4198573b12c74e2b17cca9aea951a4f38c05feb9b466953b5a1--mockolo-1.8.0.tar.gz
To retry an incomplete download, remove the file above.
0 replies
0 replies
A reply from GitHub (@bk2204):
bazel-contrib/SIG-rules-authors#11 (comment)
Hey,
I'm one of the engineers in the Git Systems org at GitHub. I think there's been a misinterpretation of what we guarantee as far as stability.
If you generate a release for a particular tag, and you upload your own assets, such as a tarball or binaries, we'll guarantee those don't change. However, the automated "Source code (tar.gz)" and "Source code (zip)" links, as well as any automated archives we generate, aren't guaranteed to be stable. That's because Git doesn't guarantee stability here and we rely on Git to generate those archives on the fly, so as we upgrade, things may change.
If you need a stable source code archive, please generate a release and upload your own archive as part of this process, and then you can reference those with stable hashes.
To give you an example as to what's stable and what's not, if you look at the latest Git LFS release at https://github.com/git-lfs/git-lfs/releases/tag/v3.3.0, all of the Assets entries except the two "Source code" links at the bottom are guaranteed to be stable (since those two are autogenerated). You'll notice we ship our own stable tarball and signed hashes as part of the assets, and that works.
I apologize for the confusion here, and hopefully this clarifies things.
3 replies
The problem with this, is that this approach is commonly used to depend on third party repos releases. Not something that you can upload the tarballs yourself to the releases pages of your dependencies, and maintainers have been relying on these tgz as good for their users to consume, I think this was completely unknown, and the result is the ecosystem around github releases is massively broken now.
the problem is that people assume a guarantee that doesn't exist.
neither git nor GitHub guarantee stable hashing, but everyone assumed it was stable.
it is the assumption that is wrong here, not the actual behavior.
However github support had provided some level of guarantee for stable hashing for some urls: bazel-contrib/SIG-rules-authors#11 (comment)
Also github had previously broken this and rolled it back which at least implies some intention for providing stable hashing (since the rollback was in response to community reports of sha256 changes).
.tar.gz stability is very important for several projects.
0 replies
0 replies
Our bazel build also broke today because of a change in the SHA256 of com_github_googleapis_google_cloud_cpp.
I suspect even if we update this sha, we might be broken because of dependencies from external packages having their own SHA256 changed.
1 reply
Majority of tensorflow/* repositories won't build because of chain of external dependencies.
3 replies
0 replies
surely there is someone at github/microsoft with some understanding of the impact of this that is able to draw a line under it
if not i think they will be discussing this in board meeting well after most devs have forgotten about it
0 replies
0 replies
I have been thinking about this problem for a while, "safe and comfortable in the knowledge that it will never break". 🤣
So one good output of this actually occurring and then being reverted: I have gone ahead and actually posted an email to the git mailing list about the possible solution I've been thinking of for a while now: https://public-inbox.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/T/
I live in hope that we'll eventually see a world where the manpage for git-archive says "git archive is reproducible and here is why", and then no one ever has to have this debate again.
1 reply
Could you please make sure that the people make hard guarantees. I understand that they don't want to guarantee something about an external gzip put keeping the door open of changing the tar format seems wrong and will just lead to the same issue all over again. If they want to change the format they should give it a new name e.g. tarx where x is just some number for the new format or make sure there is some way to get the old tar back.
This may not be the right place to say it, but I am going to just in case anyone is curious about how to solve this issue in the long term.
This change didn't break Nix, because Nix computes a NAR Hash. It doesn't hash the tarball itself, because that is unreliable. It hashes the result of extracting it. The NAR Hash is the recursive hash of the directory after it has been extracted. This NAR Hash is then what is used to create lockfiles, and is what is relied upon and stored, rather than the hash of the actual tarball itself, which is sensitive to input library changes, such as the zlib compression library.
You can use functions in the Nix expression language to fetch tarballs, without their hash effecting builds, as long as their extracted content remains the same.
nix-repl> x = builtins.fetchTarball https://files.pythonhosted.org/packages/0c/e6/cad3afda7c4ea91bc10e39af109aa16f4225b164350645c98b23a1a31ae5/hledger-utils-1.9.0.tar.gz
nix-repl> x
"/nix/store/ixlibf24im3ja9w2hqhgf9rp7qj2kj2n-source"
$ nix hash path /nix/store/ixlibf24im3ja9w2hqhgf9rp7qj2kj2n-source
sha256-/bGpun2avILSwlpt3L0hqAJjpfp/KV8GhnG2knIf4vI=
Here is a relevant twitter thread, and it would be helpful if more developers knew not to rely on Tarball hashes, but only of the post-extraction content https://twitter.com/MatthewCroughan/status/1620204622639149056
PS: I made the same comment here https://github.com/orgs/community/discussions/45830#discussioncomment-4824814, but am re-posting it here just in case it gets lost in the GitHub thread timeline and is automatically hidden in future.
12 replies
@eli-schwartz In Nix, a Fixed Output Derivation is a directory or file whos contents are hashed. If the corrupt tarball produces a different directory, it's not going to produce the same sha256. I don't understand what you think a malicious tarball is going to do to circumvent this check. The extraction in Nixpkgs is handled inside a sandbox, so you don't have to worry about tar itself exploiting your host system, as explained in https://github.com/orgs/community/discussions/45830#discussioncomment-4825276
I don't understand what you think a malicious tarball is going to do to circumvent this check.
It could extract files to /etc/shadow or ~/.bashrc or ~/.very_important_tokens, for starters.
So... we're back to "this isn't safe, unless further conditions are fulfilled, by the way which sandbox do you use and what security auditing has it been through".
Which is a far cry from "guys, it's simple, just checksum the extracted contents".
Because that is not simple. It adds more work that has to be done, and the code for doing it is more complicated, and you have to start thinking about sandboxes, then integrating it into your workflows.
It could extract files to /etc/shadow or ~/.bashrc or ~/.very_important_tokens, for starters.
How is it going to do that in a sandbox? Assume trust of the sandbox, otherwise it's a pointless discussion since there is no threat model.
You have to start thinking about sandboxes, then integrating it into your workflows.
With Nix, I can't think of one time I've had to think deeply about the sandbox. It's just there, and it's the only way you can use Nix, unless you turn it off, which takes more effort than working with it.
My test library just reverted to the correct hash! Thanks to everyone at GitHub for your hard work getting this reverted.
$ curl -sS https://github.com/AprilRobotics/apriltag/archive/refs/tags/v3.2.0.tar.gz | sha256sum 111a93a5315f8b8c2a36fa911403236032a819e8f50e8845548fe2d7dd1e5db5 -
0 replies
0 replies
Just fixed some hash mismatches from today, hopefully this issue is now fixed (fingers crossed)
URL: https://github.com/bminor/glibc/archive/2.36-84-g93967a2a7bbdcedb73e0b246713580c7c84d001e/glibc-2.36-84-g93967a2a7bbdcedb73e0b246713580c7c84d001e.tar.gz
-oid sha256:38ff9db8cd2ad2d9c0779711fbe9c9c7d20bc7a5f88b2c903832f683bd7f9a39
-size 37451492
+oid sha256:f14c59bf3c1c772ba21c0d2b8d9eeada2a7609f60f1a026bdbba86c06ed06423
+size 37657461
We use hashes in buildroot to check package integrity so these changing breaks the build:
https://github.com/skiffos/SkiffOS/actions/runs/4050223636/jobs/6967405410
0 replies
@vtbassmatt: thanks for the quick fix on this, and for (hopefully) advance notice of future changes.
Many of us here represent packaging ecosystems, and many of us are still confused about the guarantees provided by release archives on GitHub. Would it make sense to start some sort of working group?
It would be nice if more than just the Bazel folks were looped in on decisions about stable release hashes and other software supply chain issues. On this thread alone, I see:
- @memsharded from Conan
- @eli-schwartz from arch
- @MatthewCroughan from nix
- others from Bazel, Homebrew, Easybuild, Envoy
- me from @spack
As @rsc mentioned on HN, this is an opportunity for GitHub to take the lead on supply chain security.
This isn’t the first hash change I’ve seen on GitHub, but I remember the last one, and it didn’t cause this much uproar. There was no revert. We just quietly re-hashed everything. I think the difference is indicative of how important stable release artifacts have become to many (all?) software communities. They’re at the bottom of every stack.
I’m sure a lot of these folks would be eager to be in closer discussions with GitHub, and I think you’d get valuable feedback on potential impacts in advance.
Thanks again.
4 replies
add @BillyONeal for vcpkg (I am just making sure vcpkg is included ;) )
You probably also want to add eli-schwartz in the meson WrapDB category.
Craig Scott for CMake FetchContent stuff ?
+1, representing EasyBuild here, it would be great to treat this as an opportunity to improve, in cooperation with experienced package managers...
A change to the tarballs served by GitHub also happened back in Sept 2017, and we indeed just dealt with it then (there was no revert), see:
- checksums for source tarballs downloaded from github.com/.../.../archive can change over time easybuilders/easybuild-easyconfigs#5151 (re-opened today since the problem re-emerged)
- Github Archive URL checksums have changed spack/spack#5411
- fix checksums for packages that download from github.com/x/y/archive easybuilders/easybuild-easyconfigs#5162
The latter PR shows that the impact was relatively limited back in Sept'17, but:
- i) This happened only shortly after we starting keeping track of checksums for everything we download;
- ii) We have way more software supported in EasyBuild now (~2,000 unique "titles" in Sept'17 vs ~5,000 today) - I'm sure the same is true for other packaging tools;
- iii) Way more software is obtained through GitHub today than there was in Sept'17;
- iv) We live in a different world now, where supply-chain security is a hot topic (as it should be), and many more people are paying attention to things like changes in source tarballs (for good reason), as is clear from the size of this discussion.
👋 hey there, sorry for the delay in responding. I would definitely like to talk to communities with a dependency on archive hashes. I've started that discussion here. If anyone would prefer email for privacy (including if you'd like to schedule a face to face video chat), you can use my GitHub handle @github.com.
5 replies
I was just about to quote Yocto manual on this. While Yocto user's are, probably, an insignificant number to GitHub, if more projects follow this advice, ultimately it will probably be a visible increase in load on GitHub.
And yes, this sanity check saved a lot of people a lot of work. Thanks for that.
FWIW: Even this src-uri-bad QA check isn't always enough, I've just noticed one recipe fetching:
https://opensource.apple.com/tarballs/mDNSResponder/mDNSResponder-878.30.4.tar.gz
which is just redirected to these regenerated archives:
$ wget https://opensource.apple.com/tarballs/mDNSResponder/mDNSResponder-878.30.4.tar.gz
--2023-01-31 10:19:21-- https://opensource.apple.com/tarballs/mDNSResponder/mDNSResponder-878.30.4.tar.gz
Resolving opensource.apple.com (opensource.apple.com)... 17.253.73.201, 17.253.73.206, 2a01:b740:a26:f100::6, ...
Connecting to opensource.apple.com (opensource.apple.com)|17.253.73.201|:443... connected.
HTTP request sent, awaiting response... 302 Redirect
Location: https://github.com/apple-oss-distributions/mDNSResponder/archive/refs/tags/mDNSResponder-878.30.4.tar.gz [following]
--2023-01-31 10:19:21-- https://github.com/apple-oss-distributions/mDNSResponder/archive/refs/tags/mDNSResponder-878.30.4.tar.gz
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/apple-oss-distributions/mDNSResponder/tar.gz/refs/tags/mDNSResponder-878.30.4 [following]
--2023-01-31 10:19:21-- https://codeload.github.com/apple-oss-distributions/mDNSResponder/tar.gz/refs/tags/mDNSResponder-878.30.4
Resolving codeload.github.com (codeload.github.com)... 140.82.121.9
Connecting to codeload.github.com (codeload.github.com)|140.82.121.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2479712 (2.4M) [application/x-gzip]
Saving to: ?mDNSResponder-878.30.4.tar.gz.2?
git fetcher to the rescue :).
I'm not that deep into bitbake, but this sounds like a patch waiting to happen - look for GitHub.com in wget's output and issue the QA.
And yes, pretty much git fetcher to the rescue.
look for GitHub.com in wget's output and issue the QA.
That's significantly more complex than current regex on SRC_URI (without the need to actually fetch something) :).
I've just had a thought. When GitHub do update the hashing for better compression, everyone relying on the tar hash will update their hashes. This is the ultimate opportunity to change the tar contents, effect the supply chain, introduce vulnerabilities, and have everyone trust you. Something like Nix which computes the NAR Hash (the result of the tar contents) will not be effected by this, since it only cares about the content. I think this is much better than worrying about an unlikely tar vulnerability. In a system that only trusts the tar hashes, the original source is not able to take advantage of better compression over time, without massive risk of supply chain attack. If you think you can hand me a tarball that can run arbitrary code, for any version of tar that has ever existed, please give it to me so I can experiment with exploits, and I'll buy you a drink of your choice at FOSDEM if you're there!
10 replies
and most of these changes are usually code changes not tarball refresh on github.
Again, with Nix and NAR, you actually know the difference between a tarball refresh and a content change, whereas without you do not know the difference, and it is an opportunity for people to introduce malicious content into tarballs.
You trust the tarball, I trust the contents.
You trust the tarball, I trust the contents.
I don't trust any tarball :).
Who did audit of the tarball contents before NAR hash was updated?
I don't have enough time in the day to audit 1000 tarballs just because their compression scheme changed, with NAR I don't need to.
Again, with Nix and NAR, you actually know the difference between a tarball refresh and a content change, whereas without you do not know the difference, and it is an opportunity for people to introduce malicious content into tarballs.
And again this only helps in this specific case where a lot of tarballs were changed without expected changes in content.
It doesn't help in all other usual cases where the content as well as tarball were changed during package updates.
I wonder whether something like an "official hashing script" would be a viable option. Something that GitHub distributes and that looks something like
1. Fetch the corresponding git repo in some way other than the archive download.
2. Take version 1.2.3 of compression tool C.
3. Apply command "C ..." to compress the repo.
4. Verify that the archive you created and the hash of the downloaded archive match.
While this may be somewhat expensive to execute, I could at least get the git repo and the compression tool using arbitrary third-party channels. With this I wouldn't have to decompress. Also, this is only an issue if hashes actually change.
This would also let GitHub change hashes/compression since it gives users a way to re-verify archives for themselves if hashes change.
5 replies
@eli-schwartz If Git guaranteed git-archive reproducibility we wouldn't need such additional scripts, right?
I wish there was a way to get the "cool kids" compressions working in a reprocudible way though...
git archive --compression-plugin 😂
I could at least get the git repo
AFAIK, git commit hashes are hashes of the content, so if you know the commit hash and cloned the repo, you don't have to do anything more.
I wish there was a way to get the "cool kids" compressions working in a reprocudible way though...
I believe zstd actually has pretty strong reproducibility gurantees if you use the exact same version.
AFAIK, git commit hashes are hashes of the content, so if you know the commit hash and cloned the repo, you don't have to do anything more.
Yeah but I want to be able to verify an archive hash before I extract it. I don't want to extract to verify its contents. I'm potentially not even interested in the contents, just in why a hash changed. As long as there is a "recipe" to recreate the archive, I'd be totally fine with changing archive hashes.
I think I'd be even better if GitHub published a list of hashes of git archive archives for every commit. Considering the actual commit frequency of even the largest projects it may not even be that computationally expensive. This way one wouldn't even have to download archives 😆
I believe zstd actually has pretty strong reproducibility gurantees if you use the exact same version.
Thanks for this info. Thinking about it, as long as a compression algo is deterministic, it shouldn't even matter whether it's reproducible between versions or not. It only matters whether it is possible to recreate the archive somehow. I think as long as local timestamps etc are not part of an archive, everything goes.
Different needs then.
Re: determinism in compression algos. Everything's fine and dandy when you're single threaded. Multithreading introduces indeterministic factors, especially if you don't know how many threads will be used. And all the "cool kids" compression software is multithreaded nowadays.
Different needs then.
I'm actually surprised how difficult this seems to get right for everyone's needs. As SBOMs become increasingly relevant I'd expect there to be even more diverse requirements. Seems quite hard to get this right for everyone.
Thanks for pointing out the multithreading issue. I wasn't aware, but it totally makes sense that this could produce nondeterminism.
I am thinking that neither "relying on stable checksums", nor "computing the checksum of the content" is a good solution, both for different reasons. I think Github should offer programmatic a way to retrieve the expect checksum of a tarball as produced now, (maybe as well as checksums of those tarballs as produced before?). This would be similar to DNSSEC/SSHFP. Tools could validate the checksum from the source, rather than relying for on it not to change or extracting the content of an unverified tarball.
1 reply
Most of this discussion has moved to https://github.com/orgs/community/discussions/46034#discussioncomment-4846247, but I'll just paste here for posterity on why this isn't safe:
Using an API to get the current checksum, then immediately verifying with it, is not safe. The point of the checksum is to ensure that the file you are receiving is the same one that maintainers approved. It's a very strong way to verify that.
You are trusting two things when you download a tarball from GitHub:
- GitHub (via cert from https download)
- The project
With checksums, contributors or maintainers verify that the project provided a specific release at a specific time, and they record it in their package recipe, which is stored somewhere. When you download and verify an archive, you're verifying against that. The reason you do that is so that if the file is compromised over time, you can detect it. And yes, the initial contribution could have the checksum of a compromised archive, but the key assumption here is that most projects are not compromised most of the time. That's where the security comes from.
With the API model you propose, you are essentially trusting GitHub twice, and you're not verifying anything against the project at all. GitHub may not know that a malicious maintainer has taken over a project, and a Github API is going to tell you the current checksum for that tarball regardless of what has happened to the project.
Using this API approach invalidates the check -- You might as well not have a checksum at all, and just accept whatever Github sends you over https.
IMHO, as solution for future, checksumming tar before compressing will eliminate dependency on compressor (gzip, zstd, xz...). But it's also not backward compatible.
0 replies
is this happening again?
https://github.com/grpc/grpc/blob/master/bazel/grpc_deps.bzl#L496 wants sha256 checksum to be 1e490b98005664d149b379a9529a6aa05932b8a11b76b4cd86f3d22d76346f47 and currently:
$ curl -sL https://github.com/envoyproxy/protoc-gen-validate/archive/4694024279bdac52b77e22dc87808bd0fd732b69.tar.gz | sha256sum
21a61b5cbc67a70e8dccd7595f6cccbdc83c8cfc7b7e575168bc51c324a61367 -
32 replies
@vtbassmatt I think it's just the repository name, right?
The point is that the hash of the archive is critical to maintain the same, as downstream projects check the hash and reject the archive if it changes. The directory name is less important as it's stripped by strip-components.
Therefore, it's more important to keep the hash the same rather than keep the directory name consistent. So it would be best if there was some way to be sure that the directory name would be constant even if the repository is renamed, like what happened with Docker/Moby.
Or that can be considered a docker bug, manually reverified, and you can do a one-time hash update. It's semantically no different from cases where a project force-pushes a tag with different contents (and I've dealt with those and had to do full-tree diffs to verify that the code changes were benign).
Reminder: repository renames aren't guaranteed to maintain any links, although it will redirect for as long as the old repository name isn't reused. This is very frequently an issue when a person has a private project, then moves it to an org, and then re-forks the project to their personal account a month later. Tons of redirects that were working and "stable", post-move, break after the fact. And orgs can and do reuse the names of projects that they renamed.
@eli-schwartz We (buildroot) cannot retroactively change hashes in previous releases. In the case of a hash mismatch we would mirror the original file. Ideally users would not have to trust that mirror and could instead depend on the download from GitHub not having a hash mismatch.