Moving all our Python code to a monorepo: pytendi - Attendi

13 min read Original article ↗

It is therefore trivial to use any existing code in another project. One possible downside to this approach is that it’s easy to reach into components’ internal implementation, which can result in an overly coupled codebase. We touch on how we tackle this later.

Developer experience

New code mostly builds on top of existing code. If such existing code is distributed as packages, as described in the previous section, it’s not trivial to quickly patch these dependencies. Say you’re calling a function money_printer from a package whose source is located elsewhere:

You need to take the following steps to have these changes properly merged into your project:

  1. Install remote_package as editable (pip install -e) such that changes you make to it are directly visible without re-installing. Other options: directly edit the source code in venv/lib/pythonX.Y/site-packages/remote_package; or clone the original repo and import functionality from that folder.
  2. Make your changes and test.
  3. Upstream your changes by creating a pull request and merging. With a different repo, it is also more likely that this repo is maintained by another team, which possibly creates more friction and overhead.
  4. The remote_package package version should be updated, and published to your registry under the new version.
  5. In your project, install the latest version of the package.

That’s a whole lot of steps. This process introduces delays and decreases productivity, especially for quick fixes or iterative development. What steps would we need to take in the monorepo setup?

  1. Make your changes and test.
  2. Create a pull request and merge.
  3. Pull the latest changes from main.

That’s 60% the amount of steps! This immediate feedback loop accelerates development and reduces the coordination overhead. While it may not seem like much, over time the added steps needed to make even minor changes to packages add up.

There are also downsides:

  • Your repo’s HEAD is your latest version, and there are no package versions you can use as “checkpoints”. That means the consumer of whatever dependencies it uses is in principle forced to upgrade when a dependencies’ interface is changed. This can actually make it harder to make updates to dependencies, as they have to ensure that consumers are also updated.

    One thing that makes this less of an issue in a monorepo, is that we can use our IDE refactorings to be relatively sure that any change is applied in both dependency and any consumers. We can for instance use the update function name refactoring, which should (in theory) update all references to that function.

  • The ease with which implementations and interfaces are changed might make it too easy to make updates. If the development team is not disciplined, this might lead to more changes that break consumers. An important tool to combat this is a good testing suite.

Tradeoffs, tradeoffs, tradeoffs

Everything we do has tradeoffs, especially in software engineering. We mentioned some of them in the previous sections, and want to explore a couple more general pros and cons.

The good

  • It’s easier to enforce a consistent code style with a centralized point for documentation and coding guidelines. Linting rules and tests are applied in a consistent way as their configuration only has to be specified once.
  • We’re able to unify otherwise disparate CI pipelines. As an example, some of our infrastructure uses Azure Functions (the Azure equivalent of AWS lambda), which have a specific deployment pipeline. As we had multiple Azure Functions in different repos, pretty much the same deployment pipeline was replicated over different repos. This increases maintenance burden and can result in unexpected deviations to deployments when these pipelines are only updated in specific repos. One downside here is that the pipeline itself has become a bit more complex to be able to deploy different projects. The same holds for linting and testing pipelines.
  • It’s easier to keep up to date with other developer’s work and progress — all changes are located in one central git history. Forcing people to not work in isolation encourages more frequent peer/code review, so that issues in logic, documentation, coding style or lack of clarity of a component are raised and addressed more quickly.
  • New devs have a single repository that contains all the code, tools and documentation they need to get started.

The bad

One call away

Every function is a single import away. While this makes reusability easier, it allows developers to touch a module’s internals that they shouldn’t be able to touch. If this happens, consumer code is coupled to that module’s internals, meaning that any change there could potentially break the consumer. This also puts pressure on the consumed module to avoid changing its implementation, leading to components that are harder to refactor. In other words, it’s easier for the codebase to devolve into a big ball of mud ™.

As Python does not have built-in features to prevent certain implementation details to be inaccessible, preventing this from happening relies on the developers being disciplined. As this is not scalable, this requires either special tooling or structuring the code base in a specific way.

Currently, we try to solve this by having each module only export a specific set of functions in its __init__.py. We only allow other modules to import functionality exported in other modules’ __init__.py, which we can think of this as the module’s interface. For example:

This way, the consumed module is the one in control of what part of its implementation it exposes as its interface, allowing encapsulation and decoupling of modules to be maintained. Currently, we don’t have the tooling to enforce this on the code level, so for now it’s mostly trusting developers on this and code reviews.

As the codebase grows large enough, tooling like language servers, type checkers, and linters can get slow. Google’s famous monorepo google3 has tons of bespoke tooling to keep their codebase productive. Unfortunately we don’t have that amount of resources to invest in developer tooling (yet), so we can only hope this doesn’t become an issue too quickly. In the meantime, there are some quick wins to be had, like only running tests for those components that have changed, and their consumers.

The ugly

pytorch poetry meme

As we will cover in a later section on structuring the monorepo, we have per-project pyproject.toml files (containing the project’s dependencies), and one master pyproject.toml containing all the projects’ dependencies. One issue with a master list of dependencies is that all modules and projects basically have to use the same versions of dependencies, and these have to be kept in lockstep. The longer the list of dependencies becomes, the larger the probability of dependency version conflicts becomes.

While Python has one of the most vibrant language packaging ecosystems, the package managing tooling leaves much to be desired. We won’t go too much into this here, much has been written about it already. The most glaring potential issues we expect are regarding dependencies and (supposedly) incompatible transitive dependencies.

We currently use Poetry as package and project manager. Poetry might fail to install dependencies when it deems different packages to be incompatible with each other, while in actuality they can work perfectly fine with each other. A large part of this is caused by overspecified dependency constraints by package authors. This is a problem in the python packaging ecosystem in general. See https://iscinumpy.dev/post/bound-version-constraints/#tldr for more on this.

Poetry doesn’t allow overriding a package’s transitive dependency constraints, and the maintainers don’t plan on adding this feature. We therefore hope that new tooling like uv, that does support overriding dependency versions, will mature fast enough to be usable in production.

Structuring the monorepo: Polylith

Once we’d made the decision to move to a monorepo, we needed to decide how to structure it. We had the following requirements:

  • Reusing code needs to be painless, from any part of the codebase. All of us have encountered the dreaded ImportError: attempted relative import with no known parent package error.
  • It should be clear to each developer what code they can find where, and modularity should be incentivized.

Luckily, some people had already thought about this. After some digging around, we came across the Polylith architecture. Originating from the Clojure community:

Polylith is a software architecture that applies functional thinking at the system scale. It helps us build simple, maintainable, testable, and scalable backend systems. (official docs)

I will only give a brief introduction here, I encourage the interested reader to read more about it on its website. David Vujic, who ported some of the tooling to Python, has also written an excellent blogpost about it. To quote him

A Polylith code-base is structured in a components-first architecture. Similar to LEGO, components are building blocks. A component can be shared across apps, tools, libraries, serverless functions and services. The components live in the same repository; a Polylith monorepo. The Polylith architecture is becoming popular in the Clojure community.

The Polylith architecture has four main components (which equate to folders):

  1. Components: Small building blocks consisting of actual business logic. These can be combined like lego bricks to build more complicated functionality. Importantly, they shouldn’t contain application or infrastructure-level concerns. This keeps them reusable. Components expose their interface in their __init__.py.
  2. Bases: Entrypoints or gateways into your business logic (components). A base can for example be an API or a script. The base should contain as little business logic as possible, delegating the actual implementation to components. That way, components can be reused and business logic can be implemented without being coupled to a specific application. For instance, if you want to build both an API and a CLI tool that mostly contain the same functionality, they can both call the same components.
  3. Projects: Represent all information necessary to build a deployable artifact, combining one or more bases (usually one). For instance, if you’re deploying an API using Docker, your Dockerfile would be here. Project-specific dependencies are located in a project-specific pyproject.toml. This way, infrastructure and application-level concerns are clearly separated.
  4. Development: Code used for development. Can use all dependencies specified in the main pyproject.toml.

As an example, we have a project speech_to_text_postprocessing_app, which uses a base with the same name, which in turn uses components such as text_transforms and auto_punctuator.

The good

  • It’s very easy to reuse code, imports simply work.
  • Developing and iterating is greatly accelerated. The pyproject.toml in the root of the project contains all dependencies. Therefore, we can easily create a notebook and have instant access to all code in the codebase and their dependencies. This is a powerful feeling, as we don’t have to waste time setting up new packages, downloading packages from a package registry and more.
  • David Vujic has taken the effort to create some tooling to make it easy to work with. His tooling is distributed as, for instance, a Poetry plugin. It simplifies creation of new bases, components, and projects, and can show the dependency tree for each project.
  • The lego-brick philosophy is quite natural and simple to work with, promoting modularity.
  • The forced structure ensures consistency across the repo.

The bad

The Polylith architecture was originally designed to build backend systems. In our case, while we have some backend applications, a lot of the code we write is for research purposes. Different projects might need conflicting versions of libraries such as PyTorch. This is hard to accommodate with a shared pyproject.toml file. We’re still looking for better ways to deal with this.

After some discussion about this, David suggests that:

A temporary solution to work around it is excluding some components from the root pyproject.toml, letting the individual project use the one that differs. If there are dependencies that are incompatible and need to be that permanently, then it could be a use case for having the specific project in a separate repo with its own dependencies. If that is needed for a project, then it is a simple task to extract it from the Monorepo.

With Polylith, we also had some difficulty is to keep bases thin, with business logic in components. Some business logic will only be used by a single base. For instance, we have a base priority_queue_functionapp, whose implementation is located in the priority_queue component. This component exports functions that are only used by the priority_queue_functionapp. Since these two modules are coupled already, moving the implementation to the component seems a bit artificial and unnecessary.

The migration

After we decided to move forward with Polylith, we started migrating our existing code to our new monorepo ✨ pytendi ✨. The basic process is to copy all the application code to a new base, factoring out (reusable) implementation logic into components. We also took the opportunity to add tests to some legacy code and improve their structure. In total, the migration took around 3 months, most of which was spent refactoring. I’m happy to say that we’ve migrated all of our production applications successfully!

In the meantime, we’ve onboarded a machine learning engineer and data engineer, both of whom where able to quickly start contributing. While it’s hard to properly measure, I do feel the monorepo has helped them find their way around the codebase more quickly, and has led to increased developer productivity, including myself.

Of course, there are not only upsides. On the flip side:

  • For devs that have never worked with Polylith, the difference between projects, bases, and components is not always as clear. This led to some initial confusion on what should go where.
  • Finding a set of dependencies that work well together can sometimes be tricky as certain projects require specific and obscure versions of dependencies that do not work well with others.
  • Such a big repo can be a bit overwhelming, especially for new joiners. Enforcing strict code standards, especially on PRs, is a must to prevent tech debt being buried deep inside the repo, where future developers will probably find it intimidating to take on without fresh context of all the surrounding components.

Onwards

There’s still a lot to learn and questions to be asked. We’ll have to figure out how we can keep the monorepo organized as more code is added and more devs will be working with it, and how we’ll handle potential scaling issues. A first step in this process will be creating a styleguide, so the code itself becomes more consistent. In the future, we’ll consider migrating to a package manager like uv that hopefully has more options to deal with conflicting dependency constraints. Overall, we’re happy with the migration and the tradeoffs it presents, and are excited to continue evolving our codebase to support our growth.

Special thanks to David Vujic for creating the Python Polylith tools and discussion.

About Attendi

Attendi is an Amsterdam-based startup focussed on bringing best-in-class speech-to-text to healthcare professionals, optimized specifically for healthcare. We’re always looking for good engineers! Interested in what we’re doing? Feel free to drop us an email at omar[at]attendi.nl