How Do Git Remotes Work? And How Do I Self-host My Own?

What it takes to create the simplest, barest self-hosted git remote, and learn stuff about git in the process.

Let’s pretend we wanted to use our computer skills to escape corporate kangaroo courts ¹, own our stuff, and host our code ourselves. What would it take to leave GitHub?² I only care about the gen-pop use-cases: source control, and build. If you yourself want all the additional stuff like pull requests and the such, you’re probably better off giving your money to someone who knows what they’re doing. For my case, would I use GitLab or CodeBerg when I can get most of I need out of the box, and with minimal code?

Git is supposed to be decentralized. How hard can it be to have my own remote?

into the git remote internals

Stuff I learned about git, that you didn’t need to know, but I will tell you anyway

Turns out, a git remote is really nothing much. It’s just another copy of the repository. All the fetch, pull, push operations are “just” commands synchronizing commits from/to that other repository through some kind of transport.

In its simplest form, a remote is just a filesystem link. This works:

Create a new “bare” repo (a repo with no worktree):

mkdir source && mkdir remote.git
cd remote.git && git init --bare && cd ..
# Initialized empty Git repository in ~/remote.git/

Create another repo, and add the first one as remote origin:

cd source && git init
# Initialized empty Git repository in ~/source/.git/
git remote add origin file://../remote.git

Push to origin:

echo "AHAH" > plop.txt && git add -A \
&& git commit -m "Ohohoh" && git push origin main && cd ..
# [main (root-commit) e8c76d4] Ohohoh
# 1 file changed, 1 insertion(+)
# create mode 100644 plop.txt
# Enumerating objects: 3, done.
# Counting objects: 100% (3/3), done.
# Writing objects: 100% (3/3), 206 bytes | 206.00 KiB/s, done.
# Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)
# To ../remote
# * [new branch]      main -> main

Clone somewhere else:

mkdir other-src && git clone file://../remote.git other-src
# remote: Enumerating objects: 3, done.
# remote: Counting objects: 100% (3/3), done.
# remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)
# Unpacking objects: 100% (3/3), 186 bytes | 93.00 KiB/s, done.
# From ../remote
#  * [new branch]      main       -> origin/main

git init --bare creates a repo with no work tree. That means nothing is checked out, you just have the git internal files³. The .git extension on the folder is just a convention, it might turn useful if you want to implement filters in your HTTP reverse-proxy - more on that later. That’s our remote. This gives us a first glimpse into the core of a git remote is: a server and a transport. In this example, our transport is the filesystem, and the server is, well, git itself.

Server

On a --bare target, git uses the git “smart” git server⁴, and runs it locally to interface with the “remote” (that is, in fact, local). This server is composed of two binaries:

git-upload-pack handles uploading packs from the perspective of the server - that is, when doing fetch/clone/pull
git-receive-pack handles receiving packs when pushing.

The communication follows a protocol specific to git that runs over regular stdin/stdout pipes. The server starts with advertising its capabilities. Client sends commands on stdin, and receives responses on stdout . “Objects” are packed on one side, and sent to the other. From there, packs are checked, “unpacked” into objects, and git updates the relevant refs, including remote branches.

Smart server

These two git commands are the crux of remotes ; as you’ll see below, they’re implementing what’s basically the entire logic of git remote synchronization.

Servers using these commands are called “smart”, because they’re “git aware”. You tell the server which commits you have and what you need (“I want HEAD on main”), it packs up everything remotely, and sends you all that, and just that. If there are smart servers, of course there are also stupid servers for protocols, like ftp, that can’t accommodate the smart server. In this case you’re pretty much up for enumeration.

OK but remotes should be, well, remote. Filesystem is obviously not the most appropriate transport⁵! If you’re familiar with GitHub, you’ve probably used either SSH or HTTPS for transport.

SSH

The smart server communicates through stdio. Oh, the simple and flexible beauty of it! With that architecture, SSH transport is as straight forward as running a cat. So when you use an SSH remote, git connects through SSH, runs the server (either the git-upload-pack or git-receive-pack command depending on whether you’re pushing or pulling), then communicates through stdin/stdout over the SSH connection. Proper, physically separation of the OSI transport and application layers.

Authentication and all security aspects are entirely delegated to SSH and the receiving server: if you have the right key or password, and you have access to the files, you’re in.

Assuming that your remote is ssh://git@myserver.com/git/repo1.git, git determines the appropriate server:

fetch/clone/pull are pretty much just running ssh git@myserver.com git-upload-pack '/git/repo1.git'
push does the same but with git-receive-pack

HTTPS

HTTPS transport relies on an additional piece between client and server called git-http-backend (because consistent naming is only a concern of unsuccessful projects). It’s a lightweight proxy to… can you guess?! git-*-pack. You gotta love the level of reuse there: it’s all abstracted in the form of a series of executables, and whatever you need is assembled at runtime. This is in prod in millions of companies around the world and it just works.

The implementation of git-http-backend is a few thousands of lines of C, most of which are utility functions. It’s not a server properly speaking, but a CGI backend - so you’ll need an HTTP server to handle the HTTP stuff. Like with SSH, all communication, authentication, encryption and other such concerns are delegated to the transport itself. If you want auth, you have to setup auth on your server. If you want encryption, you have to setup TLS.

Using http backend

Everything is very similar to FS or SSH, but the connection isn’t maintained, and it takes two calls to perform a single command:

To fetch, first GET remote.git/info/refs?service=git-upload-pack does the negotiation to figure out what needs to be transferred. Then second, POST remote.git/git-upload-pack does the actual transfer.
Respectively for git-receive-pack. From there git the behaviour is just following the git protocol.

There are a few other HTTP endpoints provided by the backend. As far as I can tell they’re only here to support stupid-non-smart git remotes over HTTP, in case you have a stupid-non-smart client which basically navigates the remote repository using file browsing. Idiot client.

… and more!

Take a minute to appreciate that: when using SSH, git opens a plain-old, regular SSH connection, starts a small binary on the other side, and just uses stdio to do all its magic. This is RPC in its purest, simplest form. Flexible to the core: as long as you can speak text, you can implement the git protocol on top of it. If that’s not a beautiful architecture built from clever choices, I don’t know what is. I find utterly clever.

SSH, file system, HTTP and the git protocol are built-in (http is a bit of an “in-between”, there’s some specific treatment for it within the code, but some of the piping is done with separate binaries). For other protocols including HTTP/HTTPS, git is using “remote helpers”, which are simply commands named git-remote-protocolname. If you ls /usr/lib/git-core you’ll find those that your copy of git shipped with a bunch of remote adapter, such as git-remote-https. Mine ships with fd (file descriptors), ext (don’t ask), http/https, and ftp, I think that’s standard. You can create your own by following the protocol, and finally support whatsapp://myrepo and bypass airline limitations. Why not imap://myrepo. Sky’s the limit!

Onwards to our own git remote

Lots of stuff we didn’t need to know ; but don’t you feel better about it?

Initially I was trying to make my own git remote. How complex is it? Not complex. Not very complex at all.

For now I’ll focus on creating a remote. I want a build system, but we’re already 2000 words in, so I’ll get to that another day. For that remote, I will use SSH, just because it opens a bunch of other opportunities that I’ll get into another day⁶.

SSH container

But using SSH creates a bit of a problem: security-wise, an SSH port dangling open in the wild is as close to a “come and hack me” sign as it gets. If someone bypasses your HTTP auth, they can retrieve data. If they bypass your SSH auth, they can run commands. Terrifying shit. I could close that port and connect over VPN, but I don’t wanna, for a lot of good reasons. So instead I’m going to be a big courageous little boy, and use a small hardened container:

run on a non-standard port because that’s cutting 95% of the non-targeted attacks. (If you want proof, just open port 22 for a couple hours and check your logs, then switch to port 23)
use minimal alpine image with just git and openssh installed
run under a separate account, can’t sudo
only has access to my git repos
only authenticates using keys
has no access to network, internal or outbound
uses git-shell to limit the commands that can be ran. It’s not fool-proof in preventing remote execution, but it’s limiting

I feel pretty good about that setup. I’m sure I’ll get proven wrong ultimately, but I’ll live to tell the story!

Migrating

I have 187 repos to migrate from GitHub (don’t ask what’s in there, I don’t know), and I’m not going to do that manually. Luckily GitHub provides us with a CLI, so we can script it easily:

#!/bin/bash

VISIBILITY="$1" # private/public
GITHUB_USER="$2"

repos=$(gh repo list --visibility $VISIBILITY --limit 1000 --json "name" -q ".[].name")

for repo in $repos; do
    echo "Cloning $repo..."
    git clone --bare "git@github.com:$GITHUB_USER/$repo.git"
done

echo "All done!"

I’m separating public and private repos, that makes it pretty obvious which is what.

Git is working at the scale of a single repo⁷. One thing missing in the box because of that: creating and listing repositories. I don’t want to have to connect to my server, and initialize a repo myself like a peasant. I want the comfort that I had with GitHub. I’m going to use my own custom git command: git repo⁸.

git repo create MY_REPO --visibility public does what it says. The --visibility allows to specify which top-folder the repo will be initialized into. By default everything is private
git repo list --visibility private returns a list of all repositories in a format that can directly be cloned
git repo clone finds the repo, then clones it.

Because we’re using git-shell, we can’t directly run ls or even git from the command. So we have to add a couple of scripts in the git-shell-commands directory of the user we are connecting with. These become commands are then valid to run in git-shell:

create-repo:

#!/bin/sh
if [ -z "$1" ]; then
	echo "Missing parameter"
	exit 128
fi
git init --bare "$1"

list:

#!/bin/sh

if [ -z "$1" ]; then
    echo "Missing parameter";
    exit 128;
fi;

ls "$1"

Then our local git command can “just” call these.

Because I want to be able to access my repos in a bunch of different places, I also created a small install script to get the command that I can run simply with bash -c "$(ssh $GIT_REPO_ADDRESS remote-install)".

Backup

YOLO.

Automation

This blog’s magnificent infra makes use of GitHub actions. I also want to keep pushing public code to GitHub. So, I need some kind of alternative. I have a solution for that. We’ll get over it next time⁶.