Settings

Theme

Git hash function transition plan

github.com

215 points by vszakats 8 years ago · 58 comments

Reader

benjaminjackman 8 years ago

Only have time to skim it, I didn't see anyplace, so might be a good time to suggest multihash: https://multiformats.io/multihash/

Having git to use that could be a great opportunity to standardize on a de facto hash function encoding standard.

What would be the best way to suggest that (if it hasn't been already, though I am guessing it likely has).

  • ris 8 years ago

    But does not solve the problem. Multihashes are not unique identifiers of a message, which is what git mostly uses hashes for. Now, instead of a single unique identifier, you have N possible ones, where N is the number of hash implementations your multihash library has. And it is not possible to convert between two hash types without having the original message.

  • rjzzleep 8 years ago

    Wasn't there an issue with JWT that was summarized as this:

    "This is a good idea, but it doesn't solve the underlying problem: attackers control the choice of algorithm" ?

    Here's another quote from the Wireguard paper[1]:

    "Finally, WireGuard is cryptographically opinionated. It intentionally lacks cipher and protocol agility. If holes are found in the underlying primitives, all endpoints will be required to update"

    [1]: https://www.wireguard.com/papers/wireguard.pdf

    • benjaminjackman 8 years ago

      Sorry I wasn’t suggesting allowing any algorithm to be used just whichever one was chosen next it be encoded in a way that if it needs to be replaced again it could, and also if possible that numeric id for that algorithm be standardized beyond just git.

      https://github.com/multiformats/multihash/blob/master/README...

    • abritinthebay 8 years ago

      That’s only true of JWT if you allow your server to accept all algorithms.

      You don’t actually have to.

      • rblatz 8 years ago

        Correct, your token authority should specify which algorithms are valid, and your clients should self configure via a secure back channel to only accept the algorithms your token authority issues.

        • abritinthebay 8 years ago

          Exactly! JWT is a much misunderstood system it seems. Though it doesn’t exactly help itself by being quite complex

  • cryptonector 8 years ago

    Well-designed protocols generally include algorithm identifiers. It doesn't mean that upgrade will always be easy though.

    I really don't like given this a new name ("multihash"). We have a name already: algorithm agility. We should use that name.

    I also don't like this idea of having a standard for algorithm agility for hash functions (and another for encryption algorithms, and...).

    It's also not obvious that making every hash/MAC/public key payload carry an algorithm ID is the right design for every protocol (it's not), though for git it is.

    • benjaminjackman 8 years ago

      Yeah this came out of the IPFS camp, might be sensible though to use the same numeric id numbers for the hashing algorithm ids though all other things being equal.

      Generally, and this is just my gut feeling, I think that for any hash code written to disk or stored in some way having an identifier for the hashing algorithm used is such a common bite you in the ass later thing that it makes sense to always just do it from day one. To that end it’s easier to do day one if everyone agrees to a standard set of numeric codes.

      Multihash is the standard set of numeric codes for different algorithms I am aware of.

      Unifying here might allow git objects to be served natively over IPFS.

      • whyrusleeping 8 years ago

        Just a quick note, while we still would really love to have git use multihash. You can already serve git objects natively over ipfs via: https://github.com/magik6k/git-remote-ipld

        Which uses our new plugin system: https://github.com/ipfs/go-ipfs/blob/master/docs/plugins.md

      • lgierth 8 years ago

        > Generally, and this is just my gut feeling, I think that for any hash code written to disk or stored in some way having an identifier for the hashing algorithm used is such a common bite you in the ass later thing that it makes sense to always just do it from day one. To that end it’s easier to do day one if everyone agrees to a standard set of numeric codes.

        Yes, that's the basic idea of all multiformats: "it's never gonna change" is considered harmful.

        > Unifying here might allow git objects to be served natively over IPFS.

        IPFS can already do that thanks to the CID format: https://github.com/ipld/cid

        There's no good examples for Git specifically yet, but there's a good bunch of working code for transporting e.g. Ethereum and Zcash transaction blobs over IPFS. For Git it's in principle the same: import the raw object into IPFS, and starts addressing it with /ipfs/<git-cid><original-git-hash>

  • oconnor663 8 years ago

    Something about multihash makes me worry it's a security risk. Like I worry that it encourages this mistake:

    1. Define a new protocol with multihash somewhere in it.

    2. Import a super convenient multihash library.

    3. Verify all hashes with a simple library function.

    That sounds super natural and convenient to me, but if it means that you support MD4 by default, then you've introduced a downgrade attack into your protocol.

    • lgierth 8 years ago

      You can lock it down to specific hash functions no problem.

      • stouset 8 years ago

        If I’ve learned anything from being in this field it’s that:

          1) many if not most implementations will support lots of algorithms by default, and
          2) as a result, approximately zero users will lock it down
        • acdha 8 years ago

          3) the users who do lock it down will be harangued about not being compatible with less secure versions barring a major incident

        • oconnor663 8 years ago

          Yeah that's exactly what I'm worried about. The nature of the beast makes it tricky to define a safe default.

ris 8 years ago

Funny, I always expected Git to transition by adding a stronger hash as a piece of metadata to each commit and continue using SHA-1 for the day-to-day identifier, seeing as most of the time Git doesn't actually go back and actually verify the whole commit chain unless you ask it to.

  • a_t48 8 years ago

    They actually considered the reverse (search for `Using hash functions in parallel`)

colinbartlett 8 years ago

Previous discussion: https://news.ycombinator.com/item?id=13906804

styfle 8 years ago

This doesn’t render very well on mobile. I wish the Git team would write their docs as a .md so GitHub could render as HTML with word wrap in all its glory.

pwagland 8 years ago

So, this is the transition plan. Is there anywhere where we can find what progress has been made on the plan? As far as I can tell, it is only a plan at the moment?

I also like the idea of a transition plan, but is there anywhere a proposed timeframe, for phasing out the non "post-transition" modes of operation? That is, as an organisation, is there anything that we can do with this now towards our future planning?

  • Piskvorrr 8 years ago

    For something as widespread as Git, there is no "post-transition", I'm afraid: while maintained code will get migrated, old repositories will hang around Forever.

    Note that Git is a protocol - all of its implementations will eventually need to change, and each repo using it as well. This is decentralized by the very purpose of Git.

bjackman 8 years ago

So it says the protocol won't be extended initially, only the repo format. I'm trying to figure out the implications of that. IIUC this basically boils down to: can we make sure that when you have a signed tag (i.e. a hash signed with GPG), the content of your repo is truly the same as what the signer intended, and not a collision generated by a bad actor.

It says that there will be a new format for signed objects, i.e. you will now be able to sign tags with NewHash. But if the format is not extended, does that mean you can't get push or fetch those objects? If so then I believe this is just foundational work with no immediate functional impact, right?

(Not shitting on it btw, it's obviously still a Good Idea!)

  • mathw 8 years ago

    It explains this further rather later in the document.

    There's a compatibility mode, where it understands a translation between SHA-1 named objects and NewHash named objects, and translates them at the boundary - i.e. during a pull or a push.

    Obviously you're at risk to some extent of flaws in SHA-1 being exploited in your remote, although presumably if the translation layer detects the SHA-1 of something didn't change but the NewHash did then it'll scream.

    It does seem this is a temporary situation though, as it mentions in one small sentence that for the final transition stage they envisage the protocol also supporting NewHash, so they can throw away all SHA-1 metadata everywhere. What they don't address in that plan is how the protocol gets extended, but they do clearly rely on that happening for the full transition to take place.

cdancette 8 years ago

Torvald's on signing commits : http://git.661346.n2.nabble.com/GPG-signing-for-git-commit-t...

  • scrollaway 8 years ago

    He makes excellent points on tags; the one I hadn't considered before is that tags indeed can be separated from the tree, which makes them a unique asset in a git tree.

    The problem with that however is how we use tags today. Creating a tag in the modern lingua franca of git means creating a new version. If you push that tag to Github or Gitlab or what have you, a handy "release" will be created for you. If you're signing all your commits for some security reason, you don't want that, aye?

    So you'd want tags that are tracked separately and that's not easy to do. `git commit --sign` is going to include the signature in the commit, not create a separately-tracked tag with an appropriate name or whatever. It certainly sounds interesting, albeit unintuitive, and that summarizes git perfectly :)

    • finnthehuman 8 years ago

      >The problem with that however is how we use tags today.

      "Doctor, it hurts when I cargo-cult workflow from GitHub..."

    • jrochkind1 8 years ago

      pushing a tag to github-hosted repo certainly does not automatically create a 'release'.

      • stevekemp 8 years ago

        Github "releases" - as listed on the repository index - are solely based on the tags of the repository. So yes, pushing a tag does create a release.

        See here for documentation:

        https://help.github.com/articles/about-releases/

        • jrochkind1 8 years ago

          All releases are tags, not all tags are releases. Have you used the feature?

          > 1. On GitHub, navigate to the main page of the repository.

          > 2. Under your repository name, click Releases.

          > 3. Click Draft a new release.

          https://help.github.com/articles/creating-releases/

          Pushing a tag does not create a release. You can have lots of tags that are not releases. You have to choose to create a release, as a separate step. All your releases are tagged though, yes (as they should be, using github and it's release feature or not, to identify the state of the repo from which the release was built).

    • bsimpson 8 years ago

      Tags are also how the code review tool Phabricator sends diffs to CircleCI for testing. If you have that integration enabled, you quickly end up with more GitHub Releases than your project has commits.

    • yes_or_gnome 8 years ago

      Based on my own research, it appears that the first git tag was created before the first git commit.

      The first tag (?) points to a tree.

        $ git cat-file -p v2.6.11-tree
        object c39ae07f393806ccf406ef966e9a15afc43cc36a
        type tree
        tag v2.6.11-tree
      
        This is the 2.6.11 tree object.
      
        NOTE! There's no commit for this, since it happened before I started with git.
        Eventually we'll import some sort of history, and that should tie this tree
        object up to a real commit. In the meantime, this acts as an anchor point for
        doing diffs etc under git.
        -----BEGIN PGP SIGNATURE-----
        Version: GnuPG v1.2.4 (GNU/Linux)
      
        iD8DBQBCeV/eF3YsRnbiHLsRAl+SAKCVp8lVXwpUhMEvy8N5jVBd16UCmACeOtP6
        KLMHist5yj0sw1E4hDTyQa0=
        =/bIK
        -----END PGP SIGNATURE-----
      
      
      First commit

        $ git cat-file -p v2.6.12-rc2
        object 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
        type commit
        tag v2.6.12-rc2
      
        Linux v2.6.12-rc2 release
        -----BEGIN PGP SIGNATURE-----
        Version: GnuPG v1.2.4 (GNU/Linux)
      
        iD8DBQBCbW8ZF3YsRnbiHLsRAgFRAKCq/TkuDaEombFABkPqYgGCgWN2lQCcC0qc
        wznDbFU45A54dZC8RZ5JxyE=
        =ESRP
        -----END PGP SIGNATURE-----
      
      Unfortunately, I don't think I can confirm my suspicion using git alone. Maybe if I look at some mailing lists around July/August 2005 I could get a more accurate confirmation.

      This is due to the fact that those tags pre-date the tagger header which came a short while later.

        $ git cat-file -p v2.6.13
        object 02b3e4e2d71b6058ec11cc01c72ac651eb3ded2b
        type commit
        tag v2.6.13
        tagger Linus Torvalds <torvalds@g5.osdl.org> 1125272548 -0700
      
        Linux 2.6.13 release
        -----BEGIN PGP SIGNATURE-----
        Version: GnuPG v1.4.1 (GNU/Linux)
      
        iD8DBQBDEkvwF3YsRnbiHLsRAp5tAKCEK1XmOropxvWp+k9eiTcafNMXXACcDAVb
        +hOwdKI+bi84SSNNSGnSXGg=
        =cnNS
        -----END PGP SIGNATURE-----
      
      Edit:

      Just to reenforce the "First Commit" claim, here's the rev-list for the commit and the commit contents. (Notice it has no "parent" commit.

        $ git rev-list 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
        1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
      
        $ git dump 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
        tree 0bba044c4ce775e45a88a51686b5d9f90697ea9d
        author Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700
        committer Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700
      
        Linux-2.6.12-rc2
      
        Initial git repository build. I'm not bothering with the full history,
        even though we have it. We can create a separate "historical" git
        archive of that later if we want to, and in the meantime it's about
        3.2GB when imported into git - space that would just make the early
        git days unnecessarily complicated, when we don't have a lot of good
        infrastructure for it.
      
        Let it rip!
  • rvern 8 years ago

    Mike Gerwitz on signing commits: https://mikegerwitz.com/papers/git-horror-story.

CobrastanJorji 8 years ago

The main downside to switching the hash function is that, when explaining why developers should stop worrying about hash conflicts, we'll need to calculate a new analogy to replace the standard, 180 bit "every member of your programming team being attacked and killed by wolves in unrelated incidents on the same night" scenario.

  • deathanatos 8 years ago

    That analogy presumes that the hash function's output is uniformly random; when you know how to manipulate it s.t. its output is not random, then obviously it doesn't hold.

    The question of accidental collisions is still relevant, even with SHA-256, and the answer is still the same: it's so vanishingly improbable that it is assumed to be impossible.

westurner 8 years ago

> Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, K12, and BLAKE2bp-256.

amelius 8 years ago

How does it prevent this exact same problem in the future?

joseluisq 8 years ago

> In early 2005, around the time that Git was written, Xiaoyun Wang, > Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 > collisions in 2^69 operations. In August they published details. > Luckily, no practical demonstrations of a collision in full SHA-1 were > published until 10 years later, in 2017.

> The hash function NewHash to replace SHA-1 should be stronger than > SHA-1 was: we would like it to be trustworthy and useful in practice > for at least 10 years.

hwc 8 years ago

Why is SHA-3 not explicitly mentioned as a candidate?

anton_gogolev 8 years ago

NewHash is a terrible name - on par with Xbox One [X] and iPad New. Googling stuff will be hard, and good luck explaining to less technical-savvy users what is this all about.

Plus, in 100 years, when SHA-256 is compromised, what would be the name of a new new format?

  • MaxfordAndSons 8 years ago

    I was under the impression that it's just a placeholder until the actual new hash function was decided on.

    Also the whole point of this transition plan is that it will be a completely optional, per local repository, transition. So less technically savvy users won't have to worry about it in the first place.

  • Ixio 8 years ago

    Can someone explain the name? It does not look like a good name. Or is NewHash just a placeholder name for the git project because the haven't made a final decision on a new hash function? (It's hard to google and find out)

derekmhewitt 8 years ago

Can someone explain why they would transition to a new hash function and not a block chain based system of tracking? If one of the goals of introducing a stronger hash function is signage of individual commits it seems like a block chain would be ideal.

  • milkey_mouse 8 years ago

    Chains of Git commits are already a blockchain - at least, already a DAG, and to be more specific, they are both Merkle trees. Internally, each commit contains the hash of the previous commit it was based on:

        $ git cat-file -p HEAD
        tree e013f4d121199d60b70043f525aef4a7e641b5f6
        parent 152bbb43b30ced1b32e9ed6f5ba2ac448de725b6
        author Linus Torvalds <torvalds@linux-foundation.org> 1510512373 -0800
        committer Linus Torvalds <torvalds@linux-foundation.org> 1510512373 -0800
    
        Linux 4.14
    
    You can even GPG sign each commit if you want to ensure authenticity. The other aspects of cryptocurrency blockchains don't really apply here: we don't need a single "true chain," in fact that's the point of branching.

    (Kids these days with their blockchains...)

  • 0T-T0 8 years ago

    git already is a hash linked datastore with the ability to sign your 'transactions'. The doc just points out that SHA1 is not a reliable hash to address objects anymore.

  • hdhzy 8 years ago

    > ...it seems like a block chain would be ideal

    If I could get just 1 satoshi every time I see this suggestion...

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection