Beagle: git, URIs and all the dirty words

Human authored

Git's basic model is a wonderfully simple system of blob trees and commit chains that one can explain in 5 minutes to anyone. Further up the stack, that wonderful simplicity devolves into a mess of commands and flags developers with 20 years of git experience have difficulty remembering.

That is doubly so when multi-tasking with LLMs. "I believe we implemented it on Tuesday, but it is not here. Where is it?" "Which branch corresponds to that remote?" And so on.

If only we had some universal language to address and access local and remote resources, files and locations in files! Oh wait, we have HTTP and URI, which are as standard as it gets. Those were specifically designed for this task. Supported in so many apps and libs. Can we apply that to git?

Sure we can. Take GitHub URIs for example, they map the git space onto the HTTP URI space. The interesting part of this work is to define a set of orthogonal operations (a basis) so any git wizardry can be represented as a sequence of such steps, unambiguously, but no step can be represented as such a combo of other basic steps. The merge/rebase/cherrypick example will clarify this point later.

URIs

The URI layout we all remember by heart:

scheme: -- the access protocol / addressing scheme,
//authority -- most often the network host,
/path -- path in the remote filesystem,
?query -- other stuff (like arguments),
#fragment -- location within the document.

Can we retrofit that to a versioned store? Well, if all the versioning info goes into the query, the rest is obvious. http://somehost/dir/file?branch#L101 for example. In fact, Beagle is a git-compatible SCM doing exactly that.

HTTP verbs

The case of HTTP is more interesting. Originally, HTTP has a vocabulary of verbs: HEAD, GET, PUT, POST, PATCH, DELETE. Although, people only use GET and POST nowadays. But, there was some reason for the other verbs to exist, right?

GET "retrieves information"
HEAD is like GET, but no body
POST makes the server "accept the entity"
PUT requests the entity to be "stored"
DELETE does what it says
PATCH requests "changes" to be "applied"

While the vocabulary is a bit vague, fundamentally it grows out of the need to access a remote filesystem. That fits naturally the git model, which is, as described, a content-addressed filesystem. For that reason, Beagle uses the HTTP verbs exclusively.

Wait, but it only has patch? What about merge vs rebase?

Git's dirty words

There is always plenty of confusion around merge, rebase, squash, cherry-pick and all the related techniques of git-handling the twisted history of edits. Each command does several often unrelated things and each thing can be done by several commands, subtly differently.

Beagle decomposes those practices into a set of orthogonal operations, building on that wonderfully simple underlying model of git:

GET moves data from repo to worktree (including remotes)
HEAD is like GET's dry-run - fetch and report
POST moves data from worktree to repo (commits)
PUT only edits the reflog (sets branches/tags, stages)
DELETE is like PUT, but deletes
PATCH applies another version's changes to the worktree

As you might see, there is no way to supplement one operation by another: they are strictly orthogonal. Let's see how that applies to the pandemonium of merge/rebase/squash/cherrypick.

Let's see what all git merge variants do:

they apply changes from a diverging commit or branch,
they reuse (rebase) or add new (merge, squash) message,
they refer to the original (merge) or not (rebase, squash).

Consequently, we have 8 options: commit/branch, reuse/retitle, and refer/forget. In fact, only some of these 8 have git terms defined. For example, to squash we have to apply a diverging branch in its entirety, add a new commit message, do not refer to the original branch. To rebase, we apply separate commits, reuse the messages, do not refer back. To merge, we apply all of a branch, add a new message, refer back (the parent header).

The way to express it in Beagle CLI:

    # rebase one commit: first apply it to the tree...
    be patch ?feature
    # then post it with the same author/message, no parent ref 
    be post #!

    # merge a branch: apply all the commits...
    be patch ?feature!
    # and post with a new message (retain the parent ref)
    be post '#merge the feature'

    # squash a branch: first apply all the changes...
    be patch ?feature!
    # then commit it with a new message, no parent ref
    be post '#add a new feature!'

    # rebase the entire branch, commit by commit
    while be patch ?feature; do
        # check every tree state is valid, then commit
        make && make test && be post #!;
    done

    # cherry pick one commit: just apply the diff
    be patch #391a0d33
    # then post it (same author/message, no parent ref)
    be post #!

Here we use the bang modifier to:

?branch! apply the entire branch (default: one commit),
#message! forget the original commit (skip the parent ref).

When we supply no message, the original one gets reused. With rebases, we may message/author but drop the original, so the spell is: #! (reuse message, forget the parent commit).

Branch rebase here may only happen as a cycle, because we make as many posts as many commits we have. This also ensures that all the commited revisions build and pass the tests (which is a clear gap in the git model).

Overall, this model is optimized more for formal correctness and non-ambiguity with the general idea that it is the LLM who will be spelling the URIs most of the time.

FAQ

So, how PUT is different from POST?

POST does commit and/or fast-forward. PUT resets a branch or marks a file for commit/removal (reflog-only operations).

How does that compare to the URIs git uses?

git only uses URIs to access repos, e.g. git://github.com/gritzko/beagle.git That is very limiting, so we want to extend that addressing scheme to access files, revisions, locations in files.

How does that compare to GitHub URIs?

GitHub URIs have a typical web-app structure, that makes them invonvenient for our case.

https://github.com/gritzko/beagle/blob/main/keeper/README.md

In particular, beagle URIs orthogonalize all the versioning information into the query part to avoid overusing the path for everything (project, user, branch, path). Beagle branches are tree-ordered filesystem-like and the top level entries are project trunks, so the GitHub URI above becomes

be://replicated.live/keeper/README.md?/beagle

Note that a Beagle repo may host any number of projects, and the default way to convey a project is the query. If we want to peek into a branch, the URI becomes

be://replicated.live/keeper/README.md?/beagle/MEM-issues