Monorepo: please do
medium.comIn my experience, this discussion gets convoluted by confusing modularity with monorepo. They are orthogonal to each other; you can have a very modular codebase in a monorepo but also a very coupled (non-modular) codebase with polyrepo.
Though it's true that monorepos without proper discipline can tend towards coupling. Yet, when discussing mono vs poly, we should keep this in mind.
First monorepo I worked on, we used separate compilation units for each 'module'. We paid a tax on build time but it added a bit of friction to adding new cross-module dependencies willy nilly.
I don't know how you maintain that arm's length separation if you don't have compilation units in your language of choice, and that may contribute to some of the muddiness in this kind of discussion. "It depends."
I think the private visibility and shared build chain that Bazel offers could step in here, in that it makes it harder to build a project without specifying every dependency, when combined with linting tools and clearly assigning code ownership...?
It's not, why do I have to checkout terabyte of code that I don't need, even if the code is modularized?
No need to checkout a terabyte of code. If your repo is scaling that high, you're going to want a VFS layer. Microsoft made a VFS layer for Git. As you might imagine, you simply grab files as needed, and your version control just deals with diffs for the most part. Google's own monorepo is proprietary but the Bazel build system is open source and would work great with a VCS hooked up with a VFS layer.
I want to like Bazel. I really do. But on first encounter the syntax is filled with sigils that don't seem to have obvious differences or purpose for existence. Then it turns out that I and others have spent as much time fighting it as using it. Lastly the coverage of ecosystems is sparse and there does not seem to be a lot of activity around extending them -- doing the boring, tedious, unloved work of dealing with everyone's quirks and bugs and corner cases and annoyances (been there, done that).
Again: I wish it was a smooth experience. Because I like the ideas very much. But it wasn't when I tried and I don't know anyone -- outside of Google -- for whom it was a smooth experience.
I can’t speak to the actual implementation, but I’m surprised at your description of the syntax as “filled with sigils”, as the syntax is basically Python -- isn’t that about as easy as you can get?
I find Bazel’s syntax much easier to deal with than other build languages that use JSON (essentially the same Python syntax but with lots of extra quotes everywhere and extra fussiness about where commas are allowed).
I'm sure the double slashes and colon have important differences. It is not obvious what they are.bazel build //main:hello-world
It's not instantly obvious why one is :hello-greet and the other is //lib:hello-time.cc_binary( name = "hello-world", srcs = ["hello-world.cc"], deps = [ ":hello-greet", "//lib:hello-time", ], )I could swear I've seen @ floating around as well.
As I said above, I am sure these are all very sensible. But I am just tired of memorising minilanguages embedded in strings. I don't want to any more.
Completely valid concern not to want to keep memorizing mini-languages.
In this case, the double slashes are absolute "paths" relative to the top of the workspace, and the part after the colon is a relative "path" to another Bazel target.
I put "paths" in quotes because these are meaningfully different from the true filesystem equivalents; avoiding confusion with real absolute and relative filesystem paths is probably why they made their own syntactic mini-language.
[The sibling reply to mine, referencing Piper and Perforce, goes into a bit more detail on the specifics and the origin of the // prefix.]
What would the better way have been for them to do this?
> What would the better way have been for them to do this?
I don't know, off the top of my head (having been on the other side of this conversation, I am aware how frustrating that answer is). But I know I couldn't keep it straight when I was fighting Bazel and that I gave up. And anecdotally I am not alone: I have seen Bazel torn out of multiple projects, sometimes quite painfully.
Bazel is definitely designed for a very different model than how most of the world works. (I.e., Google's internal model.)
This clearly shows in Bazel's Python support: its internal version (Blaze) gets used quite often with Python inside Google's monorepo, and it works very nicely in that role, but that's a very different way of using Python than approximately the entire rest of the world. It's still Python, to be clear, just everything else is pretty different. ;)
Still, Bazel's model is pretty great if you adjust your brain, tooling, and patterns to it. I accept that most people don't. And some of its preferred usage patterns are more trouble than they're worth in a typical small shop anyway, at least with the usual other tooling one has to integrate with.
Tradeoffs...
Piper, google's source control system has roots in Perforce. In perforce, depot roots are starting with //
The ":" is a bit different, e.g. just "//lib" means "//lib:lib" - e.g. points to the "lib" target in /lib/BUILD file, while "//lib:hello-time" points to "hello-time" target in /lib/BUILD file. So not having the ":name" in "//dir:name" means name="dir" - e.g. "//dir:dir" - at first this is strange, but then you get used to it. Your default target is named after the folder it's sitting in.
It is not a smooth experience outside of Google because the truth is bootstrapping a proper Bazel setup is not actually that easy. If you want hermetic builds for real, you need a hermetic build environment. Bazel tries to accomplish this with a workspace setup in each repo, but unfortunately it's definitely limited and imperfect.
The Bazel rules for languages is also not perfect imo. Like I dislike hooking Bazel up to tools like NPM and Webpack. I'd rather have a system that could sync NPM modules into third_party automatically and setup Bazel files for them, then have a bundling system that is native to Bazel that allows taking full advantage of it's caching and pure building.
Bazel is imperfect on Windows as well. I have tried to help but admittedly it is hard work and it'll take time. I wanted to get Bazel Watcher working on Windows, but my PR is stalled because the Windows API is very truly quite maddening at times. (Feel free to find the PR, it's almost hilarious how convoluted it is to effectively kill a tree of processes. Linux of course is imperfect here but it lets you get 95% of the way Much easier.)
However, here's what I will say: if you are in an organization, I think Bazel really shines. If you can take time to write some custom tools and rules and really integrate your software into Bazel, it can be an awesome experience. Sadly the publicly available rules try pretty hard to match existing semantics and fall short of showing off how nice Bazel can be in some cases, but I think C and C++ is a great area where Bazel shines above the pack.
Another plus: it is Amazing having a build system that crosses languages. Does your Python script depend on a C module and connect over TCP to a Go program? No problem, all of that is easy to express. Do you want to have a Go script that writes a TypeScript file that gets compiled and bundled into your apps JS bundle? Once again this is all fairly natural and you can easily accomplish it with a simple combination of normal build rules and a genrule.
And Starlark is a reasonably complete almost-subset of Python, so it's easy to compose, extend and refactor your rules. If you want to generate a matrix of targets for say, testing across browsers and platforms, you can do that, and make it reusable too.
Basically my advice with Bazel:
- Check out how well it works with C and C++, and I think Java also works quite well. This should give you an idea of how it looks when done right.
- Don't constrain yourself to what Bazel offers in terms of rules. Starlark is hugely powerful and you can easily make your own rules for things.
P.S.: the weird path syntax is probably many parts legacy, but it's not actually super hard to understand. When you see a colon, the left side of the colon is a path to a folder, and the right side is a target name. When you see double slashes, it means absolute path relative to root of workspace. If the colon is omitted the target name is assumed to be the same as the folder name.
//:base -> the base target in the BUILD file in the root of the workspace
//base -> //base:base -> the base target in the BUILD file in the base folder relative to the of the workspace
//app/ui:tests -> the tests target in the BUILD file in the app/ui folder relative to the workspace root
:genfile -> the genfile target in the BUILD file in the current directory
There is some context sensitivity about how to refer to files versus targets and whether you're referring to runfiles, output files, or build files, but most of the time it's surprisingly obvious actually. When it comes to files versus targets, it largely works a bit like Make except there's namespacing for input files vs output files (and runfiles, but that's another topic.)
There is also an @ syntax used to refer to paths outside the current workspace. It mainly comes into play when importing rules.
> However, here's what I will say: if you are in an organization, I think Bazel really shines. If you can take time to write some custom tools and rules and really integrate your software into Bazel, it can be an awesome experience. ... Another plus: it is Amazing having a build system that crosses languages.
This is pretty much what I think of when I want to like Bazel. I wish we had it on Cloud Foundry. Or, rather, I wish it had existed 5 years ago and had been used on Cloud Foundry from the beginning, because CF and its associated projects have hundreds of repositories and these have mostly been kept in sync through mountains of tests and oceans of automation. It works, but I know that in another universe it works better.
Would you attribute the c/cpp success to a lack of "native" build tool?
I would say it is likely that the lack of a native C++ build tool helped Bazel to not have to compromise on how it integrates compilers into the system. I think that C++ is also just a good fit for the design; not all languages will. Interpreted languages fit into the system a bit less well in my opinion (but I still like that it is treated with some level of consistency.)
And what would you say about https://news.ycombinator.com/item?id=18821549?
Sorry not copying it here to avoid repost.
If a mono-repo has a terabyte of code, or if 10 small repos have 1/10th a terabyte each, what have you really gained? In any case, git LFS solves large file storage effectively, as do a number of other artifact storage solutions, and a repo with a terabyte of code is _not_ going to be trivially split apart, since it would be by a factor of thousands, the biggest codebase ever created by humankind.
If I only need to check out one of the smaller repos then I've gained quite a lot in terms of download speed, storage size, etc. Git LFS adds a lot of complexity I'd rather avoid.
Sure but then you only have some small portion of the total infrastructure, which adds its own layer of complexity for the people reviewing your changes :P It's all trade offs, is all I'm saying - I honestly still can't decide between the two, although for all companies sub 20 people, I'd for sure stick with a single repo.
If I'm working on Application X, wtf do I care about infrastructure code? Or for that matter, as a specific... if someone is working on Google Maps, should they care about the codebase for Google Inbox for Android?
You maybe relying on shared component for your app, you simply put in your BUILD bazel (blaze) file deps reference to it - e.g. "//base:something", but now that "//base:something" might itself rely on other deps, but that should not be of your concern.
So - what's stopping you from depending (using) anything else? Or how to stop you from doing this? BAZEL (blaze) has visiblity rules, which by default are private - e.g. the rules in your packages are hidden, unless explicitly made public, or alternatively you can white-list one by one which other packages (//java/com/google/blah/myapp) can include you back.
Let's say there is a new cool service, and your team wants to try it out... but it's not out there for everyone to use, it's in alpha, beta, whatever stage. So you ask for permission from the team, or simply create a CL with your package target, name, "..." folder resolution so that you are whitelisted - eventually you will (if that's good idea, and approved). For example you want, if some library got deprecated, and has been slowly replaced with another, and then now instead of being "//visibility:public" is just white listing the last users of it... Well probably not good idea to be added on that list, as the whole thing is going out soon (yes, Google tends to deprecate internally even faster than externally - ... which is good!). But such mechanisms are helpful in getting this worked correctly.
Does Application X rely on particular infrastructure configuration? Or does Google Inbox on Android integrate with Google Maps?
There are dependencies everywhere. Monorepos are one of the tools which can be used to make dealing with them easier in some cases. They’re not an absolute solution not appropriate for all circumstances, but no tool is!
> If a mono-repo has a terabyte of code, or if 10 small repos have 1/10th a terabyte each, what have you really gained?
If it's a small company where every developer touches every part of the application, sure. Taking the FAANG approach if you're not part of that acronym sounds like introducing inefficiency.
If it's a "small" company then I'd expect that one Git repo would do just fine for all or at least most of the code. When I think small, I think ~10 or 20 developers. If you have reasonable hygiene about things like keeping binaries out of your Git repo (excluding consideration of e.g. LFS here) then the whole repo size will stay fairly reasonable. As long as you have one or two Git mavens on your team it should be dandy.
I'd expect to see problems with this approach once you get into the 100s or 1000s of developers. The tooling for this scale of repository isn't as mature.
Sorry, what am I missing? That's exactly what I was saying - this stops making sense anywhere in between "small" and "the big boys"
> Taking the FAANG approach if you're not part of that acronym sounds like introducing inefficiency.
Is this not saying that small companies should avoid monorepos?
Specifically excluded in the preceding sentence in my post.
Sounds like a tooling problem. We shouldn't use the current state of tooling as an excuse.
Isn't the entire argument about the current (or maybe "immediately foreseeable") state of tooling? We don't really care one way or the other, in a philosophical sense. What works?
When the tools aren't good enough, we can either toss up our hands and say "I guess it's always going to be like this!", or we can get to work and make better tools.
This is an argument about how to use current tools. TFA doesn't argue that mono will be great once we work really hard. It argues that mono is great now. Thread parent has a specific objection to that argument. You don't reasonably counter that objection with statements about morality.
A few things to note:
- I was replying to a comment, not the article.
- The article spoke about points that were largely independent of the current or future state of tooling. Instead, it focused on fundamental issues with mono- vs poly-repo systems. Most directly, being forced to fix migrations and incompatibilities immediately rather than letting versions skew.
If you want to batter someone for not arguing for or against the points in the article, you can do it with the comment I was replying to, or with your own comment just now.
It's not for everyone, but damn, why is there a TERABYTE of code? Just curious - assets? checking in binaries?
Test protos. Evaluated configs. Golden data. JAR archives, etc.
Signs your build system is never going to be adopted outside of people cargo culting you?
- [x] Namespaces and the like without much security benefit - [x] Giant Java dependency - [x] Strange syntax and glyphsWe have a perforce monorepo with ~80gb total payload for the whole thing, but everyone uses streams to filter it, so that's not a problem.
I think there's a false dichotomy here.
In the post yesterday one of the arguments was that if nobody checks out all of the code then what's the value of having the code all in one place?
Last monorepo I worked on, individual contributors checked out just the tree they were working on (we had a suite of applications with several shared modules). We made it simple and straightforward for them to get what they wanted and ignore people whose work didn't impact them.
But the senior people, who were better with architecture and version control trivia, checked out the entire thing. They would steward any cross-cutting changes that needed to be done, and make sure any callers to shared libraries were updated in the face of breaking changes. They were also backstopped by the build plans, (some of) which also checked out the entire thing.
Streams aren't modules -- they're views. If someone takes you as a dependency and wants you to have visibility on them they add themselves to your stream so you pull down their directory as well.
Chances are you will end up downloading a lot of dependencies anyways, why not have git deliver it all?
Huh? You'd download dependencies for the repos you need, not the code and dependencies for the entire company.
It could be several orders of magnitude larger and with a larger organization could be a lot of unnecessary code that any given Dev may never touch.
But imagine the increased productivity of your devs if they only had to check out a single repo. Anyone has the same organization of projects on their machine. All tools are in one place...
I don't understand. Where is the argument for more productivity?
A: You avoid issues such as Readme files stating, "before compiling you have to git clone ../commonA, ../commonB". These always tend to get stale so in reality you also have to git clone ../commonC wasting you tons of hours of troubleshooting.
B: Developer working on daily basis in component A finds a bug in component B. He just has to change the code and commit it for review, instead of understanding the specifics of working with component B repository.
We have one large-ish repo that keeps showing "This repository currently has approximately 547 loose objects.”
We keep pruning and gc'ing with different flags, but pulls just seem far slower than other smaller repos.
The reason I am upvoting this is that it is written in a positive tone. Too many people - especially in the world of DevOps, trash everything. (X is the worst, don't do that, etc) and more often than not do not offer better guidance following their whiney tone. We need more "please do's" in this industry. Thank you Adam.
I have personally migrated a medium size polyrepo code base (something like ~20 repos?) into a monorepo and I agonized over the decision. But it lifted a huge weight off my shoulders.
I feel like if you are working completely in the open-source world, and you are contributing one open-source project to a larger array of available projects, then the decision to use a polyrepo makes a lot of sense. You can submit libraries to a package repository like Yarn/NPM/PyPI or you can use Git references for e.g. Go's package manager.
But what I experienced with polyrepos outside this world is that we ended up with a weird DAG of repos. It was always unclear whether a specific piece of code that was duplicated between projects should be moved into one dependency or another, or whether it should have its own repo. Transitive dependencies were no fun at all, if you used git modules you might end up with two copies of the same dependency. You might have to make a sequence of commits to different repos, remembering to update cross-repo references as you go, and if you got stuck somewhere you had to work backwards. This feels like a step backwards, like the step backwards from CVS to RCS.
Again, in the open-source world you might have some of this taken care of by using a package manager like Yarn. But if your transitive dependencies aren't suitable for being published that way, it can be tough. Monorepo + Bazel right now is a bit rough around the edges but overall it's reduced the amount of engineering time spent on build systems.
On the other hand, it's not like Bazel can't handle polyrepos. In fact, they work quite nicely, and Bazel can automatically do partial checkouts of sets of related polyrepos, if that's your thing.
As for VCS scalability problems, I expect that Git is really just the popular VCS du jour and some white horse will show up any day now with a good story for large, centralized repos with a VFS layer. In the meantime any company large enough to experience VCS performance problems but not large enough have their own VCS team (like Google and Facebook) will suffer, or possibly pay for Perforce.
BAZEL has WORKSPACE file that can work with multi-repos, but AFAIK things are still rough there, though would get better eventually (I'm bit hand-wavy on the details).
Yes, exactly. Unfortunately, sometimes the partial checkouts can be somewhat limited by the fact that your WORKSPACE code will import Starlark defined in other repos. This can get a bit ridiculous if your repo uses a bunch of different languages, if you browse through e.g. the TypeScript support instructions for Bazel you’ll see some of what you’re in for.
If your project is mostly something like C++ (which has support built-in to Bazel) then the WORKSPACE rules will be much more manageable and partial checkouts become a lot easier.
Its almost as if both approaches have positives and negatives. Some of which are more important depending on your project and organization.
I'd be more interested to read about a project or company that failed due to making one choice or the other. And then by switching things to the other way, things were fixed.
Otherwise, as someone who was worked with both, I imagine there are a host of other decisions that will be much determinant on your success.
Let's not get too wrapped up in what color to paint the shed.
>Its almost as if
Please don't do this.
I work at a large organization (2000+ devs). We have used both a Monorepo and Polyrepo. After some extensive experience with both models my conclusion is that the Monorepo is by far a superior model, specially for a large organization.
Of course the Monorepo is not free of downsides, those mentioned in the original article are real, although a bit exaggerated in my opinion. VCS operations can be slow and scaling a VCS system is challenging, but possible. And the risk of high coupling and a tangled architecture is also very real if you don't use a dependency management system like Bazel/Buck/Pants.
But in my opinion the downsides of the Polyrepo are much worse and much much harder to fix. The main problem is that you need a parallel version control system like SemVer on top of your VCS. SemVer is fine for open source projects but for a dynamic organization is a nightmare because it is a manual process prone to failure. SemVer dependency hell is really hard to deal with and creates a lot of technical debt.
Additionally, once you go Polyrepo you lose true CI/CD. Yes, you still have CI/CD pipelines but those apply only to a fraction of the code. Once you get used to run `bazel test` and you know you will run every single test of any piece of code that could depend on the code you just changed, you never want to go back. Yes, you could have true CI/CD with Polyrepos, but it requires a lot of work and writing a lot of tooling that does not exist in the wild. It is cheaper to invest in scaling your VCS in a multi-repo.
My org went from polyrepo 10 commit semver dependency hell when updating an internal API to monorepo and it saves a lot of time. Unmigrated semver breaking changes are a form of technical debt, and it takes a lot more total man hours to do the 'proper' one by one many commit poly repo migration than the other way around.
If we had the tooling to do multirepo atomic commits and reviews then maybe we would of stuck with polyrepos, but it doesn't really exist out in the wild, so monorepo it was.
> Unmigrated semver breaking changes are a form of technical debt
Maybe you can clear my confusion. If Module B is dependent on Module A, then every version of B should refer to a specific version of A, correct? What is there to break? Development can continue on A without interfering with B, and then you can uptick B once it points to a later A.
I'm not sure what this has to do with the mono/poly discussion.
Engineering resources are not unlimited, so naturally new bugs and features will be updated only on master vs. 2 or 5 major semver branches, because 2-5 module Bs haven't bothered updating yet. If you maintain 5 separate branches, then you're spending that much more engineering resources for little benefit, because you don't have external customers. So the modules that haven't migrated yet decay under a state of deferred maintenance, which is a form of technical debt.
To avoid that, you do 10 migration commits so everyone is on the latest version. If you're going to do that as standard operating procedure anyway might as well make it far easier and have a monorepo.
My org went from a monorepo where every project had to obey the same CI model and you could not introduce entirely new CI tools for new prototypes over to a polyrepo with separated semver library repos for shared dependencies, and it simplified everything so much.
Adding additional PRs across different repos is functionally no different than the same PR with scattered dependencies in a monorepo, except that separating the PRs makes each isolated set of changes more atomic and focused, which has led to fewer bugs and better quality code review and, the hugest win, each repo is free to use whatever CI & deployment tooling it needs, with absolutely no constraints based on whatever CI or deployment tool another chunk of code in some other repo uses.
The last point is not trivial. Lots of people glibly assume you can create monorepo solutions where arbitrary new projects inside the monorepo can be free to use whatever resource provisioning strategy or language or tooling or whatever, but in reality this not true, both because there is implicit bias to rely on the existing tooling (even if it’s not right for the job) and monorepos beget monopolicies where experimentation that violates some monorepo decision can be wholly prevented due to political blockers in the name of the monorepo.
One example that has frustrated me personally is when working on machine learning projects that require complex runtime environments with custom compiled dependencies, GPU settings, etc.
The clear choice for us was to use Docker containers to deliver the built artifacts to the necessary runtime machines, but the whole project was killed when someone from our central IT monorepo tooling team said no. His reasoning was that all the existing model training jobs in our monorepo worked as luigi tasks executed in hadoop.
We tried explaining that our model training was not amenable to a map reduce style calculation, and our plan was for a luigi task to invoke the entrypoint command of the container to initiate a single, non-distributed training process (I have specific expertise in this type of model training, so I know from experience this is an effective solution and that map reduce would not be appropriate).
But it didn’t matter. The monorepo was set up to assume model training compute jobs had to work one way and only one way, and so it set us back months from training a simple model directly relevant to urgent customer product requests.
Had we been able to set this up as a separate repo where there were no global rules over how all compute jobs must be organized, and used our own choice of deployment (containers) with no concern over whatever other projects were using / doing, we could have solved it in a matter of a few days.
In my experience, this type of policy blocker is uniquely common to monorepos, and easily avoided in polyrepo situations. It’s just a whole class of problem that rarely applies in a polyrepo setting, but almost always causes huge issues with monorepo policies and fixed tooling choices that end up being a poor fit for necessary experiments or innovative projects that happen later.
> each repo is free to use whatever CI & deployment tooling it needs, with absolutely no constraints based on whatever CI or deployment tool another chunk of code in some other repo uses.
Hear, hear. Let teams choose the processes and tools that work best for them. In previous release engineering positions, I resisted the many attempts to instroduce a single standard workflow for all projects. The support burden of letting a thousand flowers bloom was not great, but the benefit was that devs understood their project and were empoiwered to make changes when the business requirements changed faster than standardized tooling could.
We had a few contracts for standard behaviours, but they were low-overhead: must respond to 'make/make test', have a /status endpoint that 500'd when it was unhealthy, register a port in the service conf repo, etc.
> except that separating the PRs makes each isolated set of changes more atomic and focused
It makes it less atomic if you need simultaneous changes in multiple repositories.
> Had we been able to set this up as a separate repo where there were no global rules over how all compute jobs must be organized, and used our own choice of deployment (containers) with no concern over whatever other projects were using / doing, we could have solved it in a matter of a few days.
I think this was an organisational problem, but I accept the argument that monorepos will provide a seed around which such pathologies can crystallise. But I don't believe it's the only such seed and I don't think it's an inevitable outcome from monorepos.
> It makes it less atomic if you need simultaneous changes in multiple repositories.
No, each individual set of changes is more atomic (smaller in scope, mutating a system from one state of functionality to a new state of functionality).
The problem is that it’s a linguistic fallacy to act like in the monorepo case “the system” is the sum of a bunch of separate systems (it isn’t, because they are not logically required to depend on simultaneously transitioning). So in that monorepo case, to move subcomponent A from some state of functionality to a new state of functionality, you unfortunately have to also make sure you include totally unrelated (from subcomponent A’s point of view) changes that also correctly transition subcomponent B to a new state of functionality, and subcomponent C, etc., which is exactly less atomic (to transition states, you are required to have simultaneous other transitions that are not logically required for any reason other than the superficial sake of the monorepo).
> simultaneous other transitions that are not logically required for any reason other than the superficial sake of the monorepo
I don't see what's superficial about "everything everywhere is in sync", myself.
And I have absolutely seen PR race conditions. Assuming that everyone perfectly sliced up the polyrepo on the first go is optimistic.
> “I don't see what's superficial about "everything everywhere is in sync", myself.”
Well it is superficial by definition, because two unrelated things are “in sync” only because you say so. The very meaning of “in sync” in your sentence is some particular superficial standard you chose that has nothing to do with the logical requirements of the isolated subcomponents (i.e. “in sync” meaning two independent subcomponents were adjusted in the same large commit or PR is, by definition, superficial... it’s just a cosmetic notion of “in sync” you chose for reasons unrelated to any type of requirement).
I work on a polyrepo. The code in repo A has a dependency on the code in repo B. When I update B, I sometimes need to update A.
In a monorepo that's already done when I finish working on the modules in B.
That I am unable to release from A until it has been synced with the module in B is not "a cosmetic notion". It's being unable to release. I consider releasability at all times to be the most important invariant to be sought by the combination of tests, CI and version control.
> “In a monorepo that's already done when I finish working on the modules in B.”
This is not usually true in monorepos or polyrepos, and is quite a dangerous practice that nobody should use and hasn’t got much at all to do with what type of repo you use.
I worked in a monorepo for a long time where you still had to deploy versioned artifacts. So when you makes changes to B, you still have to bump version IDs, pass deployment requirements and upload the new version of B to internal pypi or internal maven or internal artifactory, etc.
Then consumer app A needs to update its version of B, test out that it works and that, from app A’s point of view, it is ready and satisfied to opt-in to B’s new changes, and do build + deploy of its own to deploy A with upgraded B.
Doing this in a way where a successful merge to master (or equivalent notion) of a change for B is suddenly a de facto upgrade for all the consumers of B is insanely bad for so many reasons that I’m not even going to try to list them all. Monorepo or not, nobody should be doing that, that is bonkers, crazy town bad. It’s a similar order of magnitude of bad as naively checking credentials into version control.
I think you're conflating wire format changes (which agreed, should be versioned and backwards compatible) with code level api changes. If V2 of xyz.h adds an argument to some method, a polyrepo just updates the tests and submits the change. In a monorepo, you can't submit until all clients are also updated.
No, even in a monorepo you can submit the code whenever you want, and have CI publish a versioned artifact from just that submodule / package / whatever. Other client code in the same repo can happily keep going along never caring about those new changes until later when explicitly ready to adopt them via adopting the new version.
There’s no reason why CI in a monorepo can’t create versioned code artifacts like Python packages, Java libraries or special jars, Docker containers, whatever. This is a very common workflow, e.g. combining a monorepo with in-house artifactory.
Definitely not talking wire format changes. Talking about publishing versioned libraries, jars, etc., from subsets of monorepo code.
Why would you do versioning in a monorepo?
>There’s no reason why CI in a monorepo can’t create versioned code artifacts like Python packages, Java libraries or special jars, Docker containers, whatever. This is a very common workflow, e.g. combining a monorepo with in-house artifactory.
Correct, and this is necessary. But there's no reason for a to depend on b from the artifactory instead of a just depending on b at a source level, and building a and b simultaneously and linking them together. Now you have fully hermetic, reproducible builds and tests.
Why is not doing versioning so insanely bad that you can't list all the reasons (this would be a much more interesting discussion if you did).
> This is not usually true in monorepos or polyrepos, and is quite a dangerous practice that nobody should use and hasn’t got much at all to do with what type of repo you use.
I guess compilers must work differently for you.
Huh? This comment makes no sense. You just use a build file definition to pull in the version of the compiled artifacts you need according to whichever version you want, no different than how you would include third party dependencies.
For example, in the most recent monorepo I worked in, most everything was written in Java and Scala. But when you compile consumer app A that depends on submodule B, it does not just naively use the code of submodule B already sitting at the same commit of the monorepo. That would be terrible, because it would mean if anyone changed some code in submodule B, then app A has been silently upgraded just by default.
Instead, the necessary shared object / jar / whatever is compiled only for submodule B, which is then uploaded with its new version identifer to internal artifactory that stores the compiled jars, shared objects, whatever (and stores Python packages, containers, and many other types of artifacts too).
Now when you compile app A, it retrieves the right artifacts it needs from artifactory, to treat submodule B like a totally separate third party library, and app A is free to specify whatever version of B that it needs, no different than specifying open source third party dependencies.
It really seems like you are willfully trying to act like you don’t understand what I’m saying. This approach works perfectly for compiled languages and artifacts, that’s one of the primary use cases it is designed from the ground up to solve.
This has nothing to do with monorepos though. Its entirely a company policy issue. There's nothing about the monorepo that prevents you from writing a script that ran on precommit and built and deployed via docker to a test cluster.
Unless you mean your presubmit test would push to production machines, that's bad and shouldn't be allowed, but again has nothing to do with a monorepo.
A company could just as easily have draconian policy about testing and deployment and multiple repos. Maybe you could break the rules (hell you could have broken the rules in monorepo land), but again, that's just a rules issue, not an issue of the repository.
If a tool begets using it wrong all the time, then after a certain point, it’s the tool’s fault.
What you’re saying amounts to something of a No True Scotsman fallacy... “no _real_ monorepo would limit different projects from using individualized tooling if needed...” Yet that limitation suspiciously coexists with monorepo tooling frequently, and does not frequently coexist with polyrepo tooling.
>If a tool begets using it wrong all the time
This is the (wrong) assumption. Like I said, there's nothing about a monorepo that "begets" draconian policy. Your anecdotal experience is not a rule. The monorepo I work with doesn't have draconian policies about how tooling must work. There are apis and recommended tools, and if those don't fit your needs (which is unusual), the teams that maintain those tools are willing to support your uses, but if not, you're also free to hack yourself something that works. Writing additional pre-commit hooks is encouraged.
> What you’re saying amounts to something of a No True Scotsman fallacy
Again, no. Certainly monorepos can do this. They're still real monorepos. But polyrepos can too. They're still polyrepos. Its orthogonal.
There absolutely is something about monorepos that begets monopolicies: exactly the very thing that makes them co-occur. It doesn’t matter if it’s sociological or technological, the co-occurrence itself is the thing.
I’d flip it around and say instead that you are assuming the properties by which to compare the two approaches ought to be properties that are roughly like “first principles” and that no first principles difference really exists between them in terms of limiting what you can do.
But this is the wrong way to look at it because, pragmatically, it’s simply just not the sociological phenomenon that actually happens as a side effect in terms of the practical result. Who cares if there’s a first principles reason for them to be different in terms of effectiveness? I certainly don’t— they just are different in terms of effectiveness.
>There absolutely is something about monorepos that begets monopolicies
Correlation (and a weak one at that) is not causation.
I can just as easily suggest that monopolicies beget monorepos, and that indeed makes a lot more sense. Its easier to enforce global standards when there's a single repo. So companies who wish to enforce draconian standards may move in that direction. That says nothing about companies that don't wish to enforce draconian standards though.
If monorepos enable monopolicies (even if monorepos are compatible with usage that doesn’t involve monopolicies), that’s perfectly good reason to avoid them regardless of whether they cause monopolicies.
In “The Beginning of Infinity,” physicist David Deutsch makes a point like this about styles of government. Deutsch suggests the defining characteristic of a good governmental system should not be whether it consistently produces good policies, but instead that there is an extremely low-cost barrier to removing bad policies once it becomes clear they are bad.
Thinking this way, if monorepos permit a situation where there are monopolicies about allowed languages, allowed deployment tooling, etc., and those policies cannot be quickly discarded when it becomes clear they are bad for a certain business goal, then this is perfectly good reason to disfavor monorepos regardless of whether they cause the bad policies.
I think your responses continue to miss the point because you’re talking about correlation and causation as if it matters in a situation like this: but it precisely doesn’t matter.
If a tool doesn’t actively prevent certain policy failure modes (even if it does not cause anyone to choose a bad policy), that is a relative failure of the tool.
Contrasting with polyrepos where it is quite harder to enforce failed monopolicy ideas is one area where polyrepos are a better tool: to misuse polyrepos policy-wise you have to go way out of your way and add a lot of draconian policy enforcement tooling that often can still be circumvented. Those inherent barriers are a good thing that monorepos don’t have.
Separately, I’d also say that the political failure mode where central IT wants to enforce draconian policies is extremely common, and those types of organizations specifically see a monorepo as a tool of (their desired) oppression and control.
Since the base rate of occurrence of horrible companies is super high among all companies, it probably does mean that P(bad | monorepo) is pretty high conditional evidence of a bad workplace culture.
No. A user may wish to avoid a company given that they have a monorepo, but you've not given any reason for an it department to avoid a monorepo, which is essentially what this article is about.
Yes, I have clearly given strong reasons for an organization to avoid monorepos. Gainsaying that without supporting comments doesn’t achieve anything.
Can we just move along and get to "Monorepo: Maybe do it, maybe don't. Just think it through and own your decision"?
Both monorepos and polyrepos have advantages and disadvantages. Many factors — scale, overall team quality and experience, level of integration between projects are a few that come to mind — will affect how much those advantages and disadvantages matter to any given company at any given point in time. The right choice for you isn't necessarily the right choice for me.
Much more important than which approach you choose is understanding, and accepting, the consequences of your choice. You'll want to extract value out of the advantages, you'll need to mitigate the disadvantages. You won't be able to adopt tools and processes meant for the other approach without some degree of friction.
That's what most people do. They just don't blog about it.
Observe how the verb "force" gets used 6 times. Monorepos "force the conversation." You the individual contributor are "forced to deal with the situation" and "forced to see the upfront cost" of breaking contracts. Your team is forced to "look up from their component, and see the perspectives of other teams and consumers."
All this forcing people to do things the Right Way (my way) is surely part of the pushback against monorepos.
But set that aside for the moment. Let's suppose defaults should force people to do things the Right Way, and that we also know what the Right Way is.
Instead of letting anyone sloppily depend on any code checked into the monorepo, shouldn't we force people to think long and hard about contracts between components -- the default concern in a polyrepo architecture? When and how to make contracts, when and how to break contracts? Isn't this how Amazon moved past their monorepo woes, adopted SOA, built AWS, and became one of the largest companies on earth? Heck, isn't this how the Internet itself was built?
Author here. The Right Way :tm: is situational - there isn't one right answer to things like when and how to make contracts, or how to break them. When I used the term "force", you'll see that I'm usually talking about dialog between people and teams.
It's not that it's a single right way to do it. There isn't, and anyone who tells you there is has something to sell you, or is inexperienced enough to not have seen enough of the problem domain.
What is for certain: teams need to have tooling that causes the conversations and behavior that lead to the outcomes we want. As systems and teams scale large enough, this tooling becomes essential - without it, teams go their own way, and in so doing, may or may not create the culture needed for the outcomes you want.
I have never once in my career, so far, had to tell a team to communicate less. When we're talking about engineering organizations that are large enough to diverge, you must solve these problems somehow, and it needs to be systemic and intentional.
Thanks for the response. Out of curiosity, how does your engineering organization introduce new dependencies within the monorepo? Can B, C and D all depend on A without A's consent or even awareness? (Suppose A is some checked in code that's useful, going to see updates in future, but is dormant at present.)
Your post puts a lot of the onus on A for breaking B, C, and D, but I think equal care and consideration needs to come from the other side of the contract. Eg, What are you depending on? Is it a dependency you want to take on, or are you and the shared code likely to diverge in life? These are top of mind decisions in a polyrepo architecture, but from my experience they're often not even considered in a monorepo. Anything checked in is fair game for reuse. This is why I suspect you may be "forcing" the wrong thing.
For reference I've worked in companies large and small, both monorepo and polyrepo. When I worked on Windows back in the 00's the monorepo tooling (SourceDepot) was quite amazing for the time, but the costs of that sort of coordination were also painfully apparent to everyone.
The place I currently work has a monorepo for desktop software and polyrepos for everything else. It isn't a straight up A/B experiment, but anecdotally the pain is higher and shipping velocity lower in the monorepo half of the world. Most of the monorepo pain is related to CI or other costs of global coordination, the kind of things Matt touches on midway (albeit probably too subtlely). I'd be interested to see your counterarguments to those points as well. Do you need fancy dependency management tooling to make your global CI builds fast and reproducible? Matt argues those end up being equivalent to the kind of dependency tooling that's intrinsic to polyrepo architectures anyway.
Disclaimer: it depends. :) Since that's not a good answer at all, I'm going to write the rest of this as if I have the answer, even though I know I do not, because it's deeply situational.
Equal care does need to come from the other side of the contract. Most frequently, I see teams B, C, and D in a polyrepo world do the worst of all worlds: take dependencies liberally, pin them in place, and try to forget about them. Of course, high functioning engineering teams (and cultures) will try and avoid this: they will be thoughtful about dependencies, and they will keep them up to date. In practice, they most frequently do not. This is especially true in the enterprise broadly. When we get it wrong, and take a dependency we wish we hadn't, how do we know? When do we know? What is our recourse? If I depend on code in the monorepo that diverges, I'm more likely to know near to the point of divergence (because of the nature of the system). That means the conversation about how to fix it happens sooner. I'm not interested in avoiding error - that's going to happen. I'm interested in how close to the introduction of the error do we understand it, and how do we communicate about its remediation.
As far as CI and global coordination goes, the cost is high in either direction if the system is distributed, and the solutions are similar in my experience. I think the worst case is the mixed one (which is a world I inhabit) - you wind up splitting your investment in both style and effort across both approaches. With the monorepo style, one big advantage is where the complex CI interactions can be encoded, since you have access to more of the code itself. Granted, at scale, you likely are testing against artifacts rather than point in time commits outside of the component in question (this is very similar to what you're going to do in a polyrepo, too.)
I think solid testing design requires real effort and understanding of the system under test, regardless of repository layout. Which brings us back to communications again. The more you can see, and the more clearly experienced the pain is across the teams, the more likely you are to have the critical conversations needed to improve the system - rather than making local fixes ("my teams tests are fast", "their component sucks").
Most frequently, I see teams B, C, and D in a polyrepo world do the worst of all worlds: take dependencies liberally, pin them in place, and try to forget about them.
This has been my observation as well, minus the value judgment. Why is pinning dependencies and moving on with life the worst thing in the world? As you point out in your article, a security fix in A does suddenly force B, C, and D’s hand. Another scenario I’ll add to that: if A provides communication between B, C and D, a synchronized update to all dependents might be required.
Thing is, I’d argue these scenarios are the exception to the rule. If you’re drawing boundaries in the right places (again this may come back to contract design) you’re largely free to change implementation details when you need to, on your own terms, and not because some distant transitive dependency has decided it’s time for your build to break.
With monorepos I see lots of the latter. Lots of breakage for no other reason than “everyone needs to be on the same page.” Lots of conversations — O(N^2) conversations, times some constant factor — that might not need to take place, ever, but it’s critical the entire company have them right now because the global build is broken.
Here’s another way of looking at it. Until a few years ago, it was standard practice to frequently update npm dependencies against fuzzy semvers. Now most people pin their dependencies, and their dependencies’ dependencies, with a lockfile. And in other ecosystems like go’s you also have tooling to support much more controlled, infrequent and minimal dependency upgrades (see MVS).
Why the change? Because people got tired of things breaking all the time. They wanted off the treadmill so they could Get Things Done again. I don’t see how monorepos provide this stability, and frankly it seems like the monorepo idea is where npm was about 5 years ago. Perhaps even farther behind than that, since C, C++ and others haven’t even evolved viable language package managers yet.
You’re a rust fan, so maybe cargo + a monorepo is a sweet spot I haven’t encountered yet? Anyway, I do really appreciate you taking the time to share your perspective on these things. It’s been great having a reasonable discussion about them.
If you've got a pre merge build check you can't break global build in a monorepo. That's the benefit, the one introducing the breakage will get a fail in your CI. There is no need for other teams to catch up.
By doing this you only ever "step" a dependency one at a time and one minor minor version at a time, so you only get very few and very small breakages each time. Instead of locking your depfile and then 6 months down the road you realize you need a security fix in component foo but then you got 1000 other backwards incompatible changes to fix because of transitive dependencies that also need to be upgraded in order to satisfy foo 1.2 dependencies.
I think we agree (and it's probably self evident) that it's hard but necessary work to try and get the boundaries right, and it requires a lot of refactoring before things stabilize. In the early stages of work, that refactoring is frequent and often deep. Later, it (usually) becomes infrequent and shallow.
I think it's important to separate internal dependencies from external ones. My personal advice is to treat external dependencies in whatever way the language prefers, and upgrade on a cadence. This is because you can't have any real impact on your external dependencies - even if they are critical, you can essentially treat them as a black box for terms of this conversation. For the rest of my response, lets assume we're talking internal dependencies.
The thing about breakage "for no reason" is that you are still broken, you just don't know it yet. One assumes the team that broke you had a reason. It might be a good or bad reason, from your point of view, but it wasn't no reason. When I talk about forcing the conversation, this is why. It's not better to hide from the changes, or pretend that you are safe. You aren't. All that happens is you move the time between when the breakage was introduced, and when you discover it. Most frequently, that discovery happens when the upgrade becomes critical (security) - and the time to apply the change has gotten longer, and the team who made the breaking changes no longer remembers clearly the drift. This makes teams even more less likely to move.
By ensuring these types of changes hurt, and are understood to be a shared responsibility (the consumer has a responsibility to move, the producer has a responsibility to understand and protect the stability of their consumers), teams have the impetus to design and build systems that ensure their stability. It's one thing to ask for things like circuit breakers, backwards compatible interfaces, etc. It's all theoretical from a single engineers, or single teams, point of view. It's not a panacea, but when the contract is structured this way, everyone adapts to the issue: producers get more defensive, consumers get less debt.
Like I say in the original, I think this comes down to perspective. When my concern was primarily the efficiency of a single team, who was small enough to stay connected through conversation and shared understanding, it matters way less.
A lot of your reply comes from the perspective of wanting, as an engineer, to just Get Things Done again. I get it, and I'm sympathetic. It is harder to work this way, because you can't take the easy shortcuts (pinning, delaying the upgrade, ignoring your consumers, etc.) - but that's precisely the point. Those things are bad in the long term.
Actually Windows wasn’t a monorepo back then: there were separate repos for the shell, kernel, filesystem, etc. Hence the need for cross-repo tooling like “sdx”.
Source Depot was great (modulo availability issues), but I don’t think they got anywhere near the scale of Piper.
Actually Windows wasn’t a monorepo back then: there were separate repos for the shell, kernel, filesystem, etc. Hence the need for cross-repo tooling like “sdx”.
This is a bit misleading to outsiders. Each of these repos was huge for the time, corresponded to a major subsystem with many disparate components, and the default tooling on the ground was the cross-repo tooling. One got the impression that if they could have pulled off one giant monorepo to rule them all, they would have, but they fell just short due to some technical details (cough spinning magnetic disks). In the meantime `sdx` was a convenient abstraction that allowed people to work in a monorepo way.
All in all it wasn't so different from present-day monorepos broken into git submodules for performance reasons.
Thank you so much for writing this. As someone who’s worked in the best and worst of these two words, the productivity gains are absolutely insane and the limitations, as stated by the author, are no more painful than limitations of federated/polyrepo code.
Fighting back against monorepo design is dangerous - embrace experimentation.
> Fighting back against monorepo design is dangerous
What's dangerous about it? Monorepos have a lot of benefits, and should absolutely be considered. Maybe even by most. But right now in the community it's almost pushed as the "only true way with all benefits and no drawbacks", and that's absolutely not true. To the point the knowledge of why and how to poly repo is already starting to get lost.
That's dangerous.
The real danger here is anyone talking about any system architectures or tooling as "dangerous" (or "not dangerous") absent any other context...
What do you even mean by "dangerous"? To a business? To your health?
What is the deal with people trying to make these sorts of global assertions in a vacuum about what's "good" and "bad"? This doesn't make any engineering sense in any way to me. You have a problem and you figure out the best way for your business to solve that problem given some bounded resources. Nothing in the basic problem solving process (scientific method?) necessitates all the arbitrary "should" axioms. Why don't people just analyze their specific situation and figure out a solution?
It's like people arguing vehemently about the optimal design that every company "should" be using for all windshields for all personal vehicles on the road, without even remotely discussing various vehicle body shapes and sizes.
Well just to present the other side, I don't really understand the prevalence of the "there's no one answer that fits for everybody" comment trope. You see a couple of comments like yours in every discussion like this. So no offense, but I'm going to rant about it for a few paragraphs.
If the "no one-size-fits-all" claim happens to be genuinely and axiomatically true for a particular engineering trade-off, then fine. There's no one correct displacement of an internal combustion engine. There's no one correct resolution of an LCD screen. Fine. It's demonstrably true that a trade space exists.
But a lot of times people seem to just throw up their hands and call it a trade space when really they just haven't reached a conclusion yet. "There's nothing inherently better or worse between Ubuntu and Windows, they're basically just ice cream flavors!" No! Maybe we haven't fully realized a more perfect operating system yet to settle the debate, but that doesn't just make it a meaningless question. It's perfectly possible for a system to be architected poorly given both the real world it has to interact in and the future world it makes possible. To say that this question is an unanswerable matter of taste is to be completely unimaginative about how good an operating system _COULD_ be. (See the death of operating system research and all that).
CVS is _worse_ than git. It just is. I don't want to hear this "well maybe if it fits your use case" mumbo jumbo. If you think that you have a unique snowflake reason that CVS is more appropriate than git, than you are almost certainly lying to yourself or misinformed.
And it's strict hierarchies like that that inspire these articles. There are a lot of technologies out there, and lot of ideas, and most people don't know most of the things you need to know to come up with a good answer to what suits "their specific situation". So people like myself are looking for lessons learned and certain invariants that help them narrow the solution space. I have no idea whether a monorepo would work well for my organization, and if the only thing that your article has to contribute is "monorepos sometimes work for some people, but YMMV! Good luck!" then I have learned nothing. But if somebody thinks that they've learned a fundamental truth about the universe, that that could be useful to me. Whats more, most people like me have a situation that _isn't_ that specific. We have to write some code, there's some ML shit in there, and some real-time critical stuff in there. Nothing mindblowing. _Most_ software shops shouldn't need something that is particularly bespoke. So coming in with the prior that everybody will have to do something unique to their organization is bizarre. There is so much commonality between what each software company does, in fact, that if a commonly used technology can be used by shop A but legitimately can't be used by shop B, there's a decent chance that this is a problem or limitation with the tech.
So who knows, maybe saying monorepos are _always_ better or _always_ worse really is too ambitious. But I don't think the concept that they _could_ be is a priori ridiculous. End this software relativism! Things can be made better! Yes, strictly better!
I'm of a mind that true understanding only really comes from questioning the most firmly held "universal truths." I want to know why I know what I know, I have no use for vague quality axioms that put black box abstractions on top of complex systems and processes.
> most people don't know most of the things you need to know to come up with a good answer to what suits "their specific situation".
And most people aren't competent software architects capable of adeptly steering an engineering team in the right choices to make. I'm not sure I understand the point here, or why you want to make a technical field like software engineering dumbed down to the point where "most people" can intuit the right decisions to make simply by asking HN what "the best thing" is.
This complaint belongs one level higher in the thread.
I'm currently working on a monorepo now in our ~30 person engineering organization. We have a microservice architecture, and would like to avoid rebuilding the things that don't need to be rebuilt; however, it seems like most tooling assumes that the whole repo is your project. I've tried working around this by diffing against master or diffing against HEAD~1 or keeping a special tag that tracks the last good state, and all of these seem to spawn odd edge cases and generally are tedious when your build target is entirely contained in a single directory--no idea how to solve for build targets that depend on files elsewhere in the repo. A tool like Bazel seems really heavy for an organization our size (and unless I'm mistaken, it seems to assume that all of your code lives in your repository--pulling dependencies from Pypi seems like a fourth class citizen).
Have you (or anyone reading this thread) encountered similar issues? How do you solve them in a monorepo?
> (and unless I'm mistaken, it seems to assume that all of your code lives in your repository--pulling dependencies from Pypi seems like a fourth class citizen).
My feelings here are apart from your tool of choice (Pypi) so read them with that in mind.
Why are you dependent on 3rd party code that isn't in your repo? I am a huge advocate of the monorepo and vendoring. Depending on your tooling of choice and your workflow checks for updates on this third party code should be frequent (security) and done by someone qualified (not a job for the "new guy").
The question is where should this start and end? The answer (for me) is everything and I have elected to use less (and reduce complexity) to avoid bloat. Really though this is an artifact of my use of Git: https://unix.stackexchange.com/questions/233327/is-it-possib... --
> Why are you dependent on 3rd party code that isn't in your repo
Not the parent, but for us, the 3rd party code is in a private package manager (artifactory, private npm, whatever). Having thousands of libraries we didn't write in our repo doesn't sound like fun.
Why are you using thousands of libraries?
The joke would be that even the simplest nodejs app these days pull in hundreds of libraries because of left-pad style libs...
The more serious answer is that when you have hundreds/thousands of applications with as many use cases, countless products and teams, and generally just ship a lot of stuff, it adds up.
There aren't good monorepo solutions out there (yet). Git LFS is great for few large files, but it doesn't help with tons of smaller files. Git submodules are crap when it comes to usability, and have been for a long time, it's even mentioned in the famous Torvalds Git Talk.
Git had a sparse checkouts feature since a long time, but it only affected the checkout itself, all the blobs would still be synced.
Now, Git is gaining good monorepo capabilities with the git partial clone feature [1]. Their idea is that with them you can only clone the parts of a repository that are interesting to you. This has been brewing for a while already but I'm not sure how ready it is. There doesn't seem to be user-level documentation for it yet, to my knowledge, so I am linking to the technical docs.
[1]: https://github.com/git/git/blob/master/Documentation/technic...
From earlier discussions around monorepos, I saw references that Google, Facebook, and other large monorepo orgs have been making use of Mercurial.
Yes, Facebook is mercurial based to my knowledge. Google is using its custom solution called piper I think: https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...
https://vfsforgit.org/ is another option here - MS-originated and Github is adopting it - https://venturebeat.com/2017/11/15/github-adopts-microsofts-...
Monorepo is a total win, if you have something like https://github.com/Microsoft/VFSForGit (ex GVFS) - e.g. any monorepo that overlays changes, and the rest are simply file names with no actual contents is a win.
You can certainly achieve this with Perforce, SVN, HG, any repo system there too.
Linux: FUSE + ?
Windows: Dokan? CBFS? Or the new fangled https://docs.microsoft.com/en-us/windows/desktop/projfs/proj... which VFSForGit uses
Monorepo could be a decent choice if your software stack does not require too much external dependencies. Or more precisely, the ratio of own code to the third-party code is reasonably high.
Let me give a concrete example. The Android open source project (AOSP) which builds the system of Android devices has the code size close to the scale of tens of GB (let alone all the histories!). It is already a massive monorepo in itself. And typically you would have many of them from different OEM/SoC vendors of different major releases. In such a scenario, it would turn into 'a monorepo of monorepos,' which is quite unpleasant to imagine.
I think that the reason of the argument is that both authors understand the definition of "large repository" very differently.
With 100 engineers a monorepo might seem a good idea. With 500 it becomes nearly impossible to do anything involving a build. Some isolation is needed.
Also from my experience many engineers just don't give a shit about architecture. They create entangled mess, that kind of works for the customer, and go home. Without some enforced isolation it is impossible to maintain it.
That being said I am more inclined to polyrepos.
the fact that essentially 100% of big tech companies use monorepos seems like evidence that it is at least possible to do it in a scalable way...
Definitely not 100%. It also has a lot less to do with company size, and more about when the company was created. Before the git and similar tools of the world came to be, managing a single repo was a pain, nevermind hundreds or thousands of them. So (almost) everyone did it the way these big companies did.
Today, not quite. I work for a multi billion dollar tech company and we have several thousand repos (and it's awesome)
Not true. Google, Facebook, Goldman Sachs and JP Morgan, all companies that run mono repos and predate git by very far.
Git cannot checkout sub directories and it slows down exponentially with the number of branches. It's the opposite of what is needed to run a mono repo in a large company.
My wording must have been awful...because that's exactly what I was trying to say.
The big companies that predate git and such used monorepos because that was the norm at the time, and it was easier to do with the tools at the time, and as they scaled, they just scaled their process instead of changing everything. But several large tech companies, especially newer ones, do the multi repo approach.
Amazon does not use a monorepo, so you might want to rethink your "statistic".
AMZN doesn’t, unless things have changed drastically in the last 3 years.
yeah by writing custom version control software. Am I going to convince my company to do that (which has like 50k software engineers) probably not.
How many companies have 50k software engineers? Seems like the handful that do, should do whatever works for them. The rest of us can just use a monorepo.
> With 500 it becomes nearly impossible to do anything involving a build.
Both FB and Google have more than 500 devs and are using a monorepo.
At what cost? Both FB and Google employ hundreds of devs to work on internal tooling only. For most companies this isn't feasible.
It's actually the other way around. The cost of dealing with all the downsides multiple version control systems is way higher than investing in tooling for the Mono-Repo. This is mentioned in articles by Facebook and Google.
Source? Sounds like you're pulling that number out of thin air.
Isn't it reasonable to assume that FB/Google will do a cost analysis of mono/poly repo approaches and pick the one that is the most cost effective? At that scale they have absolutely no room for dogma; it's all about costs.
Afaik, fb employs less than 20 people to take care of the monorepo.
Why does the build not work at 500 engineers? Just because there are 5 apps / services and their 25 libraries doesn't mean you have build the 4 other apps when doing development just because you did a git pull. Libraries would still need to build their dependants in a CI system with poly or mono repos.
The 5 apps and their 25 libraries are in fact one entangled monolith. That is what I am talking about. It is only marketed as 5 apps. That has some hundreds external dependencies, of course.
I wonder if a star pattern would work, where you have a single, shared repo for all your libraries and a repo for each app.
This would help people working on smaller apps, since they don't need to look at other apps unless they're working on shared library code.
Of course, once you are working on library code, you have to build and test all the apps that use it. But even at Google, the people working on the lowest levels of the system can't use the standard tools anyway.
A star pattern still has most of the downsides of the multirepo approach. Specifically, it has the problem of needing a parallel version control (e.g. SemVer) on top of your individual repositories. This creates fragmentation, where different applications have dependencies on different versions of the libraries which ends up in dependency hell, technical debt, and CI hell.
An alternative would be to have a policy where all the app repos must use the same version (nobody can upgrade until they all upgrade). This makes things harder for the library maintainers, but no more than a monorepo.
I don't see why you'd need semver. The apps could sync to a particular commit in the library repo.
What you propose is just a fake monorepo, containing the global policy of allowed version of X, disguised as multiple subrepos. OP discusses this.
"Shared responsbility" is one of those ideas that sounds good on paper, but doesn't really scale terribly well in the real world. As the old saying goes "when everybody is responsible, nobody is responsible".
More to the point, as the author of TFA allows, once a system reaches a certain size, nobody can understand it all. At some point you have to engage division of labor /specialization, and once you do that, it doesn't make sense to have just anybody randomly making changes in parts of the code-base they don't normally work in.
I'd rather see a poly-repo approach, with a designated owner for discrete modules, but where anybody can clone any repo, make a proposed fix, and submit a PR. Basically "internal open source" or "inner source"[1].
In my experience, this is about as close as you can get to a "best of both worlds" situation. But, as the author of TFA also says, you absolutely can make either approach work.
That's what we did at my last workplace; 800-odd devs, 3000-odd repos, and a service we wrote to record and gather all the ownership metadata (which team owns this repo? where can I contact them? who should be on the code reviews? how do I page them? which project should tickets be created in?)
It was a gigantic pain trying to find owners for half-dead repos for services still running and in use, where the original authors had left years ago & from teams 4 or 5 restructures ago. The one thing I learned was: never make a user the owner of a repo (unless it is in their personal space), always find a team to accept responsibility for it.
> poly-repo approach, with a designated owner for discrete modules, but where anybody can clone any repo, make a proposed fix, and submit a PR
This is how it works at my company. The issue we run into is that PRs coming from non-core maintainers tend to either get over-scrutinized (e.g. "this diff may work for you but it's not generic enough for X/Y/Z") or flat out ignored at the code review stage and sometimes don't land in a timely enough manner.
Another challenges with this approach is when you have deeply nested dependencies and need to "propagate" an upgrade in some deep dep up the tree. In the JS/Node world, this usually means fixing an issue involves hacking on transpiled files in the node_modules folder of a project to figure out what change needs to be made, and then mirroring said change into the actual repo and then tweaking things until type checking/linting/CI pass. Not really conducive for collaboration.
One other problem is that security/bug fix rollouts are a bit more challenging. We had a case a while back where a crash-inducing bug was fixed and published but people still experienced crashes due to not having upgraded the one out dozens of packages required by their projects.
> As the old saying goes "when everybody is responsible, nobody is responsible".
Here's my rule: You break it, you fix it.
> I'd rather see a poly-repo approach, with a designated owner for discrete modules, but where anybody can clone any repo, make a proposed fix, and submit a PR.
I'd rather see pairing, extensive tests and fast CI. I see PRs as a necessary evil, rather than a good thing in themselves. If I make a change that breaks other teams, I should fix it. If I can make a change to fix code anywhere in the codebase, I should write the test, write the fix and submit it.
Small, frequent commits with extensive testing creates a virtuous cycle. You pull frequently because there are small commits. You are less likely to get out of sync because of frequent pulls. You make small commits frequently because you want to avoid conflicts. Everyone moves a lot faster. I have had this exact experience and it is frankly glorious.
> You break it, you fix it.
I’ve seen this invoked so many times to shirk responsibility though. Someone piles up all kinds of crap in a tight little closet, complete with a bowling ball on top, and the next unsuspecting dev who comes by and opens it gets an avalanche of crap falling on them while the original author can be heard somewhere in the background saying “it’s not my problem.”
This winds up leading to more crap-stacking just to get the work done ASAP and you wind up with a mountain of tech debt.
I like the zero flaw principle where new feature work stops until all currently known flaws are fixed. Then everyone is forced to pitch in and responsibility is shared whether you want it or not.
> I’ve seen this invoked so many times to shirk responsibility though. Someone piles up all kinds of crap in a tight little closet, complete with a bowling ball on top, and the next unsuspecting dev who comes by and opens it gets an avalanche of crap falling on them while the original author can be heard somewhere in the background saying “it’s not my problem.”
I'm accustomed to collective ownership where, ideally, this never happens and in practice happens rarely (followed by the little closet being torn out and replaced).
> I like the zero flaw principle where new feature work stops until all currently known flaws are fixed.
I agree: stop the line. But I think it's orthogonal to the sins or virtues of n-repology.
I couldn't agree more!
Any good mono repo build tools out there? I've been thinking about this for the past few weeks. Considering creating a general purpose monorepo tool chain and potentially a mono repo first CI system.
Unfortunately some of the most popular CI/CD services out there(Travis, Circle, etc) don't even support cross-repo pipelines, much less mono repo builds.
Pants and bazel sound like favorites
Interesting, thanks! Didn't realize Bazel was open sourced..
Those both look way more in the weeds than what I would have imagined.. I guess for Bazel at least it makes sense given Googles scale how fine-grain they would get into caching and incremental builds..
For my needs a simple tool that would allow discovering "WORKSPACES" and constructing a build graph based on what's changed, while handing off the actual building to some entry point in the workspace, would be good enough. Have a weird collection Gradle projects, node projects, test suites, docs, and etc with their own build processes already in place.
Some things are also on a "critical" path while others can run async given the context(branch, tag, etc)...
I'm rambling though.
Does anyone know how please (https://please.build) compares?
I agree, use a monorepo. I anxiously await MSFT's git megamonorepo functionality. Until then there's things like git meta[0].
Yet another chapter in one of the big flamewars. Seeing as I fall in the monorepo camp, I must say I mostly agree; also, I much prefer this tone for an article.
I find it enjoyable how plenty of comments both here and in the other discussion are of people saying "We had a mono/polyrepo and things improved tremendously when we migrated towards a poly/monorepo". The issue might be one of growth and complacency: a drastic change like that forces the team to face the technical debt that was being ignored and do a better implementation using what was learned from past mistakes.
>But I think Matt’s argument misses the #1 reason I’ve flipped quite hard to a monorepo perspective as my own level in the organization has gotten higher
Perhaps the fact that since their level was now higher, they wouldn't have to deal with the nitty gritty details and pain of working with a monorepo as a developer?
E.g. I wasn't for it when I was a dev, but now that I can just impose it on others, I love it. Same with how various 'development process' rituals are adopted...
For those using monorepos, what is your branch strategy? Say that 3 projects share a library, and release on different schedules. How does each project freeze shared library changes? Do you keep N version branches?
How does the library team know which consumers a commit may break? What tools are recommended?
As engineers we spend wast amounts of time in constant search for a rival to "tabs vs spaces" debate
The more complicated answer is sometimes you should use a mono repo and other times you shouldn't.
This is starting to get a debate of "principles", like forcing A and B to talk, or forcing A and B to have more explicit boundaries, and so on. Guess where that ends (hint: it doesn't).
With a monorepo, the basic effort you have to put in to start scaling is quite high. To properly do a local build, you need bazel or something. But bazel doesn't stop at just building, but it manages dependencies all the way down to libraries and stuff. Let's say you're using certain maven plugins, like code coverage, shading, etc. Would bazel have all the build plugins your project needs? Most likely not. You have to backport a bunch of plugins from maven to bazel and so on. Guess how many IDEs support bazel? Not a lot.
Then you need to run a different kind of build farm. When you check-in stuff to a monorepo, you need to split and distribute one single build. Compared to a polyrepo where one build == one job, a monorepo is like one build == a distributed pool of jobs, which again needs very deep integration with the build tool (bazel again here), to fan out, fan in across multiple machines, aggregate artifacts, and so on.
Then the deployment. Same again. There is no "just works" hosted CI or hosted git or anything for monorepos. People still dabble with concourse or so on.
And guess what, for a component in its own repo, you don't need to do anything. Existing industry and OSS tooling is built from ground up for that. Just go and use them.
To provide a developer a "basic experience" to go from working on, building and deploying a single component – the upfront investment you need to provide with a monorepo is very high. Most companies cannot spend time on that, because scale means different things to different companies. There is a vast gap in the amount of ops/dev tooling you have for independent hosted components vs monorepo tools. Just search for "monorepo tools" or DAG and see how many you can come up with. So what really happens with a monorepo is, most companies go with multi-module maven and jenkins multi-job. The results are easy to predict. I'm not saying that maven/jenkins are bad, but they are _not_ sophisticated, and are not anywhere close to what Twitter/Facebook/Google or any modern company uses to deal with a monorepo (for a good reason). They are just not good at DAG. If you're relying on maven+jenkins as your monorepo solution, all I can say is "good luck".
Instead, if you start by putting one component in one repo, you keep scaling for _much longer_ before you hit a barrier.
In principle, monorepos are better. In practice, they don't have the basic "table stakes" tooling that you need to get going. Maybe monorepo devops tooling is a next developer productivity startup space. But until then, it's not mainstream for very good reasons.
So... An article based on equating change recording medium with integration testing procedures.
There's a lot of discussion of bazel and co inside sub-comments but i have a question that isn't addressed-
How do the "global build tools" play with language specific build tools?
My primary stack is Rust and Scala. Both have excellent build capabilities in their native tools. How well do pants/bazel integrate with them? I wouldn't want to rewrite complex builds nor would I expect these tools to have 100% functionality of native ones.
Bazel has some level of support for many languages: https://docs.bazel.build/versions/master/be/overview.html#ad...
I know the Scala rules are used in production by multiple companies. Rust support is improving quickly, but it's not perfect. See the dedicated GitHub repositories for more information.
(I work on Bazel)
Please don't. It's just too slow and not efficient. Instead use common open source best practices of shared library architecture. Problem solved! Putting everything into one repo is just lack of organization and creates a huge mess.
I feel like you've really done no work supporting your argument there. "Slow and inefficient"... what, exactly, is slow and inefficient? Because there are plenty of things slow and inefficient about polyrepos.
I'd say that open-source best practices for shared libraries are appropriate if you're making an open-source shared library. However, these practices are inappropriate for internal libraries, proprietary libraries, and other use cases. In my experience, it's also far from "problem solved". You can point your finger at semantic versioning but in the meantime we go through hell and back with package managers trying to manage transitive library dependencies and it SUCKS. Why, for example, do you think people are fed up with NPM and created Yarn? Or why people constantly complain about Pip / Pipenv and the like? Why was the module system in Go 1.11 such a big deal? The answer is that it's hard to follow best practices for shared libraries, and even when you do follow best practices, you end up with mistakes or problems. These take engineering effort to solve. One of the solutions available is to use a monorepo, which doesn't magically solve all of your problems, it just solves certain problems while creating new problems. You have to weigh the pros and cons of the approaches.
In my experience, the many problems with polyrepos are mostly replaced with the relatively minor problems of VCS scalability and a poor branching story (mostly for long-running branches).
However, these practices are inappropriate for internal libraries, proprietary libraries, and other use cases.
Why do you say so?
Basically because for certain projects and teams, the effort to package internal / proprietary libraries and other similar dependencies can be much larger than the benefit. Packaging is effort. You decide to cut a release, stamp a version number, write a changelog, package and distribute it, and then backport fixes into a long-running branch.
This effort makes a lot of sense if your consumers are complete strangers who work for other organizations. If your consumers are in the same organization, then there are easier ways to achieve similar benefits. See Conway’s Law. It’s not an accident that code structure reflects the structure of the organization that created it, I would claim that organizational boundaries should be reflected in code. Introducing additional boundaries between members of the same organization should not be done lightly.
One of the main benefits of version numbers is that it tells your consumers where the breaking changes are, but if you have direct access to your consumers’ code and can commit changes, review them, and run their CI tests, then you have something much better than version numbers. If you are running different versions of various dependencies you can potentially have a combinatoric explosion of configurations. Then there’s the specter of unknown breaking changes being introduced into libraries. It happens, you can’t avoid it without spending an unreasonable amount of engineering effort, but the monorepo does make the changes easier to detect (because you can more easily run tests on downstream dependencies before committing).
Cross-cutting changes are also much more likely for certain types of projects. These are difficult with polyrepos for obvious reasons (most notably, the fact that you can’t do atomic commits across repos).
Packaging systems also have administrative overhead. If you shove everything in a monorepo you can ditch the packaging system and spend the overhead elsewhere. These days it’s simple enough to shove everything in the same build system.
Various companies that I’ve worked for have experimented with treating internal libraries the same way that public libraries are treated—with releases and version numbers. Most of them abandoned the approach and reallocated the effort elsewhere. The only company that I worked for that continued to use internal versioning and packaging was severely dysfunctional. One startup I worked for went all in on the polyrepo approach and it was a goddamn nightmare of additional effort, even though there were only like three engineers.
I broadly agree with all of this, though I think it's possible to simplify the business of packaging and releasing with the right automation. But lowering the cost doesn't change the more important question of whether that cost is worth bearing.
> One of the main benefits of version numbers is that it tells your consumers where the breaking changes are, but if you have direct access to your consumers’ code and can commit changes, review them, and run their CI tests, then you have something much better than version numbers.
A small peeve of mine: Semver and version numbers generally are lossy compression. They try to squeeze a wide range of information into a very narrow space, for no other reason than tradition.
I really don't understand what you describe as effort or huge burden. Writing a simple script that can solve your releasing tasks is simple. Imo a lot of engineers just want to write code but a lot of the time building software consists of other things too, such as testing, releasing, documentation etc. Simply avoiding them doesn't make it better.
If you think that releasing comes down to a simple script then you and I have radically different ideas about what it means to release something.
I’m also completely baffled by your statement that “simply avoiding them doesn’t make it better.” Reading that statement, I can only feel that I have somehow failed to communicate something and I’m not really sure what, because it seems obvious to me why the premise of this statement is wrong. When you avoid performing a certain task, like releasing software, which costs some number of work hours, you can reallocate those work hours to other tasks. It’s not like the tasks of releasing and versioning simply stop happening, but you also get additional hours to accomplish other things which may be more valuable. So it’s never an issue of “simply avoiding” some task, at least on functional teams, the issue is choosing between alternatives.
And it should also be obvious that cutting discrete releases for internal dependencies is not an absolute requirement, but a choice that individual organizations make depending on how they see the tradeoffs or their particular culture.
There really are many different ways to develop software, and I’ve seen plenty of engineers get hired and completely fail to adapt to some methodology or culture that they’re not used to. The polyrepo approach with discrete releases cut with version numbers and changelogs is a very high visibility way of developing software and it works very well in the open source world, but for very good reasons many software companies choose not to adopt these practices internally. It’s very sad when I see otherwise talented engineers leave the team for reasons like this.
Too slow as in "to do it" or too slow as in "to use it". In either case I think if that were true there wouldn't be monorepo's at Google, Facebook, and Microsoft. I will say it's true that didn't come for free, e.g. Microsoft had to make GVFS due to the sheer enormity of their codebase but that's already done and works pretty well.
I agree share library style makes more sense in most cases though. The main problem with it is forcing everyone to use the latest library versions but that isn't insurmountable by any means.
My old boss was an engineering manager at Google in the 90s and early 2000s. He used to tell us that _everyone_ he interacted with at Google _hated_ the monorepo, and that Google’s in-house tooling did not actually produce anything approaching a sane developer experience. He used to laugh so cynically at stories or that big ACM article touting Google’s use of a monorepo (which was a historical unplanned accident based on toppling a poorly planned Perforce repository way back when), because in his mind, his experience with monorepos at Google was exactly why his engineering department (several hundred engineers) in my old company did not use a monorepo.
His experience from the 90s and early 2000s is meaningless in the current era. Version control and Google were in their infancy.
SVN was first released in 2000. Git in 2008. Branching, tagging and diffing were nowhere near what is possible now.
That goes back to desktop with a disk smaller than a GB, CPU in the tens of MHz with a network so slow and reliable, if you have one at all.
My understanding from many Google employees is that the properties of the system that caused problems in ~2000 - 2010 are largely still the same today: the canary node model of deployment, fixed small set of supported languages, code bloat, inability to delete code, bias towards feature toggles even when separate library dependency management would be better for the problem at hand, various firefighting when in-house monorepo tooling breaks, difficult on-boarding for people unfamiliar with that workflow, difficult recruiting for candidates who refuse to join if they have to work under the limits of a monorepo like that.
I work at one of the monorepo companies that you mention and there’s some truth to the “too slow” part. Although it’s it’s been a lot better lately (largely, due to the efforts of the internal version control dev teams), I’ve noticed at times in the past that you could do a ‘<insert vcs> pull’, go on a 15 minute break and it wouldn’t be done by the time you’re back.
Personally, I think there’s a place for mono repos and there’s a place for smaller independent repos. If a project is independent and decoupled from the rest of the tightly coupled code base (for instance things which get opesourced), it makes no sense to shove it into a huge monorepo.
I hate how these monorepo pieces gloss over the CI requirements. Just checkout the code that's affected by the change? Either you have a shared buid job that adds thoussands of builds a day & matching a commit to a build takes ages, or you have a plethora of jobs for each subrepo and Jenkins eats all the disk space with stale workspaces. And let's not talk about how to efficiently clone a large repo... our big repo took 5 minutes to clone from scratch, which killed our target time of 10 minutes from push to test results. We ran git mirrors on our build nodes to have fresh git objects to shallow/reference clone from to get it down to 30 seconds, and the whole system had to work perfectly or else hundreds of devs would be blocked waiting to see if their changes could be merged.
Last time I work at a massive Monorepo, half of my team was running got fetch as a cron job. It was an extremely painful experience
It would be quite remarkable if in-house corporate software, which face different constraints and challenges than open source software, turned out to nonetheless have exactly the same best practices.
The idea of using open source styled practices for internal development is not exactly new or remarkable. It's something people have been doing for a long time.