How we made self-hosting Plane a breeze for 100,000+ Docker + 44,000 Kubernetes deploys

Plane, like any other open-source B2B software, offers both a Cloud hosted edition and a self-managed one. While our Cloud users far outnumber our self-hosted users, it's nice to see for an early-stage start-up like ours an estimated 50,000+ active self-managed Plane instances.

We don't collect telemetry, so there's no sure-fire way to know the exact number of active self-hosted Plane instances, but we do some back-of-the-hand math for our best guesstimates.

However, our deployment methods weren't what they are today. Docker Compose is the most popular, well-documented, and, one can argue, the simplest of all deployment methods available to self-hosters. Stupidly, it took us more than half-a-year to get around to its current standards of ease and even more to simplify its use. To have gotten to today is no small testimonial to our community that stayed with us in upgrading, migrating, and templatizing their deployments while giving us critical feedback.

We intend for this post to be educative to other open-source early-stage start-ups for the merits, pitfalls, and lessons from our choices, both in the past and right now.

Start-stop-start

Plane is a Django-NextJS application. Our Cloud runs the Django backend independently of the NextJS front-end on two different domains, typical of cloud-hosted software.

We naively followed that for the first few versions of our self-hosted edition. The front-end talked to the back-end via an .env variable called NEXT_PUBLIC_API_BASE_URL. In sticking to an already bad choice, we also made that variable and others like it, all beginning with the prefix NEXT_PUBLIC, build-time variables. Anyone who wanted to self-host Plane had to,

clone our repo
find and edit three of the several .env files to specify values for

NEXT_PUBLIC_API_BASE_URL
SECRET_KEY
For keeping your usernames and passwords salted in your self-hosted instance
WEB_URL
For redirecting from emails to the domain your app was hosted on

build images locally with docker-compose

There were no pre-built Docker images, no Docker Hub, and no out-of-the-box way to run the stack as a cohesive app. It was the Wild-West for our self-hosters when open source had moved on long ago to deployment sophistication.

Expected trouble

Most users didn't care for Docker Hub or pre-built images when they were trying out Plane. They just needed to set values for the three .env variables mentioned above and build their images locally. Not the best, but not the worst either. Trouble brewed when they were beginning to adopt Plane.

We have twenty-five .env variables that an admin can set values for. 25 customization options. 25.

If an admin wanted to change even one of those, they had to,

stop the instance
hunt for the variable in one of four .env files
specify values for the variable
rebuild all the containers

And they had to do it for every change. Let that sink in for a second.

As you can imagine, in the first seven days from installation, admins were building their images from scratch an average ten times if not more. As you can also imagine, the brickbats we got for this bordered on the poetic. Some users said we didn't know what we were doing—we didn't—while others just said we were intentionally making self-hosting difficult—we weren't.

What we learnt painfully

The more you sweat in training, the less you bleed in battle.

We could have looked at other self-hosted products and their architectures instead of copying from our Cloud's, taking a little more time to ship our self-hosted editions right. - Simplify for your users, not for yourself.

We shipped this self-hosted version as a replica of our own Cloud, which was lesser work for us and a lot more for our users. Sounds cliched that we should have simplified for the user, but it's one of our commandments now.

Solve, phase #1: Removing builds

Our first big hill to scale was taking out builds from the process. The user story went,

I should be able to download pre-built images and specify .env variable values to get my instance up with docker-compose.

Come February 2023, we were pushing pre-built images to Docker Hub, but with NEXT_PUBLIC_API_BASE_URL hard-coded to the value http://localhost.
"But why", you ask perplexed.

Well, we weren't running automated pipelines to compile images from our repository. We were building them locally and uploading them to Docker Hub. But that didn't matter right then—you will see how it did matter a lot just a little later—because admins could now download these images—both an all-in-one and five separate ones, one per service running inside the Docker network. They just had to change values for NEXT_PUBLIC_API_BASE_URL to set their hosted domain. No building images, no rebuilding images.

All for nothing

But remember how NEXT_PUBLIC_API_BASE_URL was a build-time variable, not a run-time variable? So, admins edited the .env file to specify its value and realized they had to git clone the repo to build images locally anyway. You would be hard-pressed to find a better example of adding insult to injury if you hunted for one with a vengeance.

Wake up!

Luckily, not very many people complained—we are signing this off to just the best self-hosters in the world, period—until one day, ↓ happened.

We repeat, our users are just the nicest.

Five months from the time we went to Docker Hub, this came as the sign we were waiting for. Docker Hub had to work without shenanigans, so that became our bigger hill to scale.

Setting up GitHub Actions was too daunting and time-intensive back then—we were shipping weekly with a team of one back-end engineer—, so the genius solve was to find a way to replace the hard-coded NEXT_PUBLIC_API_BASE_URL value at service start-up with whatever the admin had specified as its value.

The faux-clever solve

Enter Cal.com's community with their Bash script! Turns out, Cal.com had a similar issue and their community that overlaps with our community big time had ↓ to replace strings in values for .env files.

 FROM=$1
  TO=$2
  if [ "${FROM}" = "${TO}" ]; then
    echo "Nothing to replace, the value is already set to ${TO}."
    exit 0
  fi
  echo "Replacing all statically built instances of ${FROM} with ${TO}."
  find apps/web/.next/ apps/web/public -type f |
  while read file; do
    sed -i "s|${FROM}|${TO}|g" "${file}"
  done

Awesome! All we had to do was change the script for our folder structure and set it as the first thing to execute for the front-end container when a self-hoster ran docker-compose.

#!/bin/sh
  FROM=$1
  TO=$2

if [ "${FROM}" = "${TO}" ]; then
echo "Nothing to replace, the value is already set to ${TO}."
exit 0
fi

echo "Replacing all statically built instances of ${FROM} with ${TO}."
find apps/app/.next -type f |
while read file; do
sed -i "s|${FROM}|${TO}|g" "${file}"
done

That script ran with the following arguments.


/usr/local/bin/replace-env-vars.sh "$BUILT_NEXT_PUBLIC_API_BASE_URL" "$NEXT_PUBLIC_API_BASE_URL"

And it worked! It worked like a charm. Or so we thought.

Chokepoint

Here's the thing about Bash scripts. They stall, fail, and crash when you have a lot of minified JavaScript files—in our case, hundreds, courtesy NextJS. Finding and replacing http://localhost across an average 300 minified JS files held up the front-end from starting for a while. In one instance, as reported by an admin, it was thirty minutes before the script exited the run with an error that made no sense. Add to that the case of one invalid character in the admin-specified value for NEXT_PUBLIC_API_BASE_URL and, yep, it spit out an unhelpful error.

What do you know, it was not the elegant solution we thought it was. 🙄

The true-clever solve

In fact, it was so simple, we had to swat ourselves on our head for not thinking of it first. We had shipped our repo and the Docker Hub images with NGINX to limit the number of ports an admin opened to one. Turns out, we never used NGINX as a reverse proxy because, well, the front-end ran independently and on a different domain from the back-end. 🤦‍♂️

All we had to do was run all services in the container on the same domain and let NGINX tell the front-end how to talk to the back-end. Simple!

Took us two months after we shipped the Bash script to get to this, but we did. Starting v0.13, there's no Bash, no replacements, no builds, and no chokepoints. Admins specify values for .env files and run docker-compose to get their instance running.

Phew!

What we learnt even more painfully

Duct-tape = unpredictable crashes

Short-term solves are good only when you are close to a good mid-term solution. We now err on the side of delays instead of shipping something untested, half-baked, and quick. - What works for others may not for you

The Cal.com community found a solution for the Cal.com repo, and the community's usage of that repo. It couldn't have worked for us with our set-up.

Solve, phase #2: Unblocking asset and file uploads

A typical Plane instance stores and serves a few thousand images and files over just a month. Think profile pictures, project backgrounds, and screenshots + files to not just their issues and comments but also their Pages, a knowledge-capture feature in Plane. On self-hosted instances, this is handled by an S3-like image store called Minio.

To optimize upload queues and to prevent abuse, we restrict per-file size to 5MB on the Cloud. Those restrictions don't apply to self-hosted Plane, but we ship with the 5MB-default that admins can change. Turns out, we accidentally let a restriction of 1MB slip though.

The culprit was NGINX, not Minio

Remember NGINX as our router of choice for all it surveys inside our Docker container? How many of you know NGINX ships with a 1MB file-size limit by default?

We are guessing most of you have raised your hands right now, so call us ignorant and get us a get-out-of-jail-free card. Because we didn't. We just naively thought we had f-ed up our Minio implementation.

Some debugging later, we saw the files didn't even reach the backend that would then talk to Minio to store files.

The error was reported in v0.7 on May 29 last year and we shipped the fix two weeks later.

That's not the fastest for us, but we had had to swallow bitter pill for shipping untested fixes. We weren't playing fast and loose anymore.

But Minio wasn't far behind

Turns out, bitter pills weren't done with us yet. Minio, an open-source alternative to Amazon S3, was going to land us in more trouble.

Minio is great for folks who don't have S3 and want to get started with Plane quickly before making .env changes for their infra. So, when a user uploaded an avatar or a file to Issue details,

it was sent to NGINX
which sent it to the Django backend
which used Django S3 storage, a Django library
that talked to Minio at https://plane-minio:9000
to save the file at http://plane-minio:9000/<filename>
and send that URL back to the front-end
so it showed up on the app's interface

It worked exactly like that when we tested our Minio implementation. Why wouldn't it? After all, we were running the app locally on http://localhost.

Minio, like any other service running as a container in our Docker network, can't be accessed from outside. Users talk to our network via NGINX, which in turn talks to the front-end and the back-end. For admins running the app on https://anysubdomain.anydomain.tld and users accessing it from that address, http://plane-minio:9000 would be inaccessible, and thus, unusable. And it was for everyone who tried to upload images.

Add to that the absence of validation for a successful upload to the storage server before rendering the image on the interface and we had led admins and users down a rabbit hole.

The image was there when uploaded. It isn't there when I switch out from the page to another and come back. What's happening?

To the technical and curious, we showed the image copied to Issue details on the browser without uploading it to an instance's storage. That's how drag-and-drops used to work on Google Drive or Dropbox. You could see the file before clicking Upload, but refresh that page without that click and you lost that file.

Reintroducing WEB_URL

If you didn't look closely at the list of .env variables under Expected trouble in this post, we shipped WEB_URL from the time we walked on to GitHub in November 2023. That variable tells the back-end which domain to use for redirections to the front-end—useful for link clicks in emails and Slack. It is the same domain that the front-end is rendered on and that users access Plane from.

Instead of introducing another variabl, we just reintroduced WEB_URL to the back-end, specifically to Django Storage, so it could post-process file URLs to anysubdomain.anydomain.tld/<filename>.

Admins were specifying values for WEB_URL anyway, so that worked great.

Until v0.17, our latest release, we still ship code that defaults to http://plane-minio:9000 when WEB_URL is not specified. We have an elegant fix for this that we will talk about below.

What we learnt embarrassingly

Beta-test in the real world

Greens all the way when testing internally is a big illusion that will get you shown up when you go live. As we learnt through phase #2, it is much better to let willing users test pre-releases than to roll out early-stage changes to everyone. - Slow down to speed up

Our faux pas with NGINX had told us we needed to slow down with the day-to-day to speed up in the mid-term. The mad rush to ship every week gave in to a comprehensive release plan, headed by a new-hire release manager, which helped immensely in phase #3.

Solve, phase #3: Upgrades, migrations, and the Plane CLI

Quick recap of the road so far

Admins didn't need to build images.
They still needed to clone the repo for the .env files we referred to in docker-compose.
They needed to specify .env values in those files—-four of them.

While Docker Hub + our repo cloned worked okay for most, there were troubles, not the least of which was just how long it took to get started. Admins didn't like it.

We weren't trying to make self-hosting harder so we could monetize the Cloud because we are monetizing self-hosting, too, as you will see in Phase #4 below, but we got the spirit of those comments.

Removing git clone from the equation

We started our first closed beta with the lowest hanging fruit, that had the most grunt work for admins—cloning the repo to get four .env files that docker-compose reads values from and and editing those four files to specify those values. To solve for those two, we moved all our variables into a new docker-compose called docker-compose-hub that was available to our beta testers. These testers would simply need to download the images of our containers from Docker Hub along with the docker-compose-hub.yaml file, edit the file to specify .env values, and run it. No cloning the repo. No hunting for variables in four files.

Persistent .env values and simplifying docker-compose-hub.yaml's structure

Writing .env variables from the four previous files into docker-compose-hub as blocks of text had some merit—admins now used to a sub-optimal structure wouldn't come at us with pitchforks for radical changes when this method became generally available—, but it became clear soon enough our seasoned testers were finding the new structure painful to edit, especially if they were making changes after first set-up.

There's a whole lot more that couldn't be included because the image would become illegible.

The bigger problem was upgrades. A single docker-compose file with .env key-value pairs meant if we introduced new variables, admins would have to download that version's docker-compose-hub. If you are thinking, "Oh, no! That'd remove all custom values with the default ones", ten points to you. And zero to us.

So, fast-following a week later, we moved all those .env variables out into a single file called variables.env which was saved as just .env on the host machine and kept a copy of it on docker-compose-hub, too. That way, if an admin accidentally deleted the .env file, they could always recreate it with their previous values with the copy as a reference.

Partially automating set-up with setup.sh

Because we weren't shipping these changes for general availability, we were free to think about the next problem—swapping out manually run docker pull and admin-edited.env values with command-line automation.

Enter setup.sh, a Bash script that admins could download with a simple cURL command and see a command-line menu.

$ mkdir plane-selfhost
$ cd plane-selfhost
$ curl -fsSL -o setup.sh https://raw.githubusercontent.com/makeplane/plane/master/deploy/selfhost/install.sh
$ chmod +x setup.sh

Behind the interface, we would still download docker-compose—docker-compose-hub was now docker-compose and still in beta, but getting ready to be generally available—, the images from Docker Hub, and the .env file, but admins wouldn't have to bother with any of that. They would choose Install or Upgrade and voila! Everything after that would happen automatically.

We included an option to see logs for each service in the Docker network, too.

This worked beautifully for both new + existing admins and became our first generally available stable improvement for the community. Configuring .env variables had to be done still with edits to the .env file, but that was a lesser problem at this point. We had bigger fish to fry.

Launching Kubernetes

Plane, by this time, had enough mid-market traction that we started seeing requests for Kubernetes support, with some even looking to the community for help.

Launching Kubernetes was a no-brainer for what we had planned next, so we put our Helm charts on our site and ArtifactHub simultaneously with default values for Memory, CPU and Replicas variables. The quick move to Kubernetes helped in two ways.

Mid-market companies started using Kubernetes way more than before.
We discovered a problem with database migration on Kubernetes that would become a problem, if it wasn't already, on Docker, too.

Database migration, replicas, and Migrator

Admins deploying on Kubernetes started reporting what looked like a weird problem.

Users of self-hosted instances on Kubernetes saw this for a while when it looked like an upgrade had been successful.

They also reported this happened almost exclusively during upgrades. So perplexing. Our upgrades were seamless on Docker and ran something like this.

The API service would attempt a migration of the database during an upgrade.
It would run the migration successfully.
The app would come up on the registered domain and users would use the app as usual.

Especially on fresh installs, API ran these Django migrations without a hitch for all our Docker-based deployments.

Turned out, our default value of 3 for the variable Replicas was not only creating three instances of API, Worker , and Scheduler on Kubernetes—expected outcome—but also making all three instances of the API micro-service attempt a database migration—unexpected outcome. While one of those three successfully locked on to the migration and ran it until done, the other two kept attempting to access the database, failed to do that, crashed, and restarted. Making a bad situation worse was ↓, a log-line confirmation that told admins the deployment had succeded.

Consequence? The app would continue to show the spinner until the migration got done.

"Why didn't that happen with Docker and why only Kubernetes-based upgrades", we hear you asking.

Our docker-compose and variable.env shipped with a default of 1 for Replicas. So, while admins saw ↓ with fresh installs and patiently waited for the app to come up, they didn't see it with upgrades. There was just one API micro-service running the migration. Same with Kubernetes, although why admins didn't report it for first installs, we don't yet know.

Nevertheless, wow, what a reveal!

A good solve had to not just work for Kubernetes but also Docker scaling, i.e., when admins changed the value of Replicas to more than 1.

Enter migrator, a separate micro-service and job on Docker and Kubernetes that moved migrations out of API and blocked all replicas of API, Worker, and Scheduler from starting. migrator would first say ↓.

and then ↓.

The success message ↓ was true success and admins could ask their users to log on with confidence.

Even cooler, migrator was a one-and-done job on Kubernetes, so each time you upgraded, you would essentially download migrator to remove the previous job and run another one-and-done job.

No more competition to run migrations. No conflicts. Just smooth upgrades all around.

What we learnt surprisingly

Ship incrementally
We could have beta-tested [setup.sh](http://setup.sh) and Kubernetes together. Pulling back on our ability to ship a large block of code and releasing smaller enhancements helped gain beta-tester’s confidence and give us the breathing room to act on their feedback. - Shoot for parity, settle for familiarity

Our Docker admins didn’t worry too much about scaling. Even if you sign that off to our defaults—replica = 1—, not one of them told us about the problem with migrations. Kubernetes, though, stands for scaling by its nature. So, while we would have loved to set parity between both those methods, shipping K8s with replica= 3 let us stick to familiarity with Kubernetes admins. That it discovered a problem and led to a solution for everyone is a happy byproduct.

Solve, phase #4: Data, one-click install, the Go binary, and our own command-line interface

Back to Docker

The larger community on Docker hadn't been party to our behind-the-scenes experiments in Phase #3, so most of them just chose Upgrade whereas all new admins chose Install. By any standard, Phase #3 was going superbly, what with our beta-testing admins loving the installer Bash and the command-line configure menu. That's when we launched our first pricing experiment—a bespoke Plane implementation and dedicated support for self-hosted users for a flat per-user-per month price for the entire year.

Plane One and the path to COSS sustainability

Before that experiment, we had offered the Community + the Free Cloud editions to all our users. A significant chunk—we won't disclose those numbers right now—had asked us about pricing and Plane's viability as a long-term solution for org-wide project management. The experiment we ran from January this year got enough traction to signal a set of unmet needs that admins struggled with.

1. Offer quickstarts for self-hosters

2. Regular, even weekly security and performance updates

3. Support for more hosting services like Railway

4. Monitoring and logs

5. Back-ups and recovery

That started us on the first step toward making the last on the list possible—back-ups and recovery—and what would mark the inception of Plane One.

Making the database more accessible

Despite the success of Phase #3, one evolutionary problem persisted—Docker Volumes and how difficult it was to access, back-up, and recover this data for non-native Docker users.

If the instance crashed—which it sometimes did—, the data would be lost to such admins. Even for non-disaster needs like expanding the storage, Docker Volume posed a challenge.

Phase #4 started with moving the data out of Docker Volume and mapping it to the host machine that admins were installing Plane on. This had two immediate benefits.

Admins could now access and back-up the data without being a Docker expert.
They could increase storage over time because they controlled the host machine's configs.

Automatic, scheduled backups and choice of storage are all in the works in Phase #4, but we are getting ahead of ourselves.

Fully automating set-up with setup.sh

While the Bash-based setup.sh worked well for most, admins looking at wider adoption of Plane in their orgs called out two time-taking problems.

1. Fresh installs were still a function of admin input.

2. .env values were still a function of edits to the .env file.

To make quickstarts possible, we extended setup.sh to work with all Linux-flavored AArch64 and x64 machines and took out admin-input steps.

With just a single-line command—curl -fsSL https://raw.githubusercontent.com/makeplane/plane/master/deploy/1-click/install.sh | sh -—admins could now sit back and watch the script,

check for a previous installation of Plane
the architecture of the machine
download the images, docker-compose, and variable.env
ask for a domain name to host Plane,
start services, and
show a URL for accessing the app

There was a --help operator, too that'd show a more helpful menu of options.

Selecting --configure would bring up a far more intuitive configuration step-form that was such a big and good departure from manaully specifying .env values in a file.

Typefully-like steps, easy input of values, nice little Save button. Awesome!

We had made quickstart real for our Community and in record time!

A new CLI and the Go binary

For paid self-hosted instances, we had three non-negotiables.

Validate the license at first install and then periodically to insure one license was used with one domain
Download the license, docker-compose, variable.env and the container images from a private repository instead of Docker Hub
Insure high security for all of it

The last successful iteration of setup.sh, despite being a really cool bit of work, didn't work for how open it was—it was a plain-text, editable file after all—and just how vulnerable Bash generally is. Also, it wasn't the most intuitive and admin-friendly. Our licensing server had to be untouchable, usable, and pretty.

It was time to ditch Bash for an alternative that could turn our script into a secure binary.

We first evaluated popular CLI-builders like Python, NodeJS, and Rust.

Python was our top choice because we have written in Python all our lives, but it is plain-text editable and thus, not very different from Bash.
NodeJS was our next top choice. It was easy enough to compile TypeScript into an Electron app for all user operating systems, but the binary would be heavy and a memory hog.
Rust was in consideration for thirty seconds before we saw the steep learning curve for all of us.

With that, we turned our research to Go and quickly saw how well it fit.

docker-compose v2 is written in Go, so the compatibility of Go libraries with Docker is as good as it gets.
Go talks nicely with the Docker daemon for downloading .tar.gz files from a private server.
It manages our docker-compose operators and associated services like start, stop, and restart superbly.
It can stream Docker logs to the licensing server for troubleshooting and support.

Early trials with Go were encouraging enough for us to fully commit to it. By our final internal demo, Go had beaten all our expectations for a secure, usable, pretty CLI that had wings for the future.

Bringing it all to Plane One

This tying-the-loop section should have been called, "Ta da!" but well, SEO doesn't like that. See just how easy it is to install Plane One for yourself.

What we learnt humbly

Abstraction is a journey

Before you pat us on our backs, let us be the first to admit there’s a lot more we can do with the set-up and our overall deployment landscape, some of which is in the final section below. Abstracting anything away from the user isn’t a sudden goal you can achieve. It took us a year to get here and perhaps it will take us a year more to look back with new learnings, but we will continue to abstract away to the extent possible without trading off controls and smart defaults. - Open source engineering isn’t COSS engineering - Our journey may have begun with a GitHub repo, but we have always been a COSS company and are on the path to engineering Plane for self-hosted scale as we offer a parallel Cloud edition. That path, irrespective of the type of COSS software, brings some DevOps considerations into focus quicker than with community-supported open-source. - Should you prioritize Cloud over self-hosted? - How much parity is too much parity? - How do you maintain one very large deployment—the Cloud—while shipping features for tens of thousands of smaller deployments—the self-hosted audience? Lucky for us, we had several playbooks to borrow from to figure out our own deployment thesis. That could only happen because we were always clear about being COSS over community-supported, about owning our roadmap over shared responsibility, about envisioning self-hosted Plane on paper over figuring it out as we went along. That, in turn, guided our long-term strategy, our journey so far, and our hiring.

What's next for self-hosted Plane

There’s a lot planned for self-hosting in 2024 and beyond. Some of those follow.

SSL and certs

Encryption at rest and in transit is basis for our admins. Caddy is a popular reverse-proxy request, but it also looks like a good candidate for both internal and public certs. It also meets admin requests to replace NGINX.
Automated, scheduled back-ups Our CLI will soon support periodic back-ups that admins will set-and-forget. These back-ups will also be uploadable to S3 or any other external file store of choice.
Restores before and after upgrades Not all upgrades work for everyone. We will automatically back-up the data before an upgrade and let admins restore their instance to the last-known good config.
Plane One on Kubernetes Plane One is Docker-only for now. We will test it with admins like we have since the beginning of Phase #3 and launch it on Kubernetes, too.
Official support for more IaaS platforms Our community is already publishing Railway templates and trying out other infra services. We will shortlist our first officially supported IaaS players and ship deployment methods for the first few this year.
Marketplace apps Plane is popular with larger companies on AWS and Digital Ocean. Marketplace apps for those two, at the very least, will help easier set-up and adoption.