Part 1. Pains of big modular Python projects
This series of blog posts post describes the journey of devex and development and packaging tooling in Apache Airflow Ⓡ, in the context of building a huge project with multiple components, where each of those components is pretty much independent of each other and can be worked on separately, but they also have to work together, be tested together and sometimes you want to make changes together on many or even all of those components together.
Press enter or click to view image in full size
This series consist of 4 parts:
- Part 1. Pains of big modular Python projects (this part)
- Part 2. Modern Python packaging standards and tools for monorepos
- Part 3. Monorepo on steroids — modular prek hooks
- Part 4. Shared “static” libraries in Airflow monorepo
Let’s dive straight in in part 1
Challenges of huge modular projects
Traditionally in Python — and other languages, you could attempt to solve this by having separate repositories, and treating those repos as independent and developed separately, and while it helps in case you (or your team) want to work on those components independently, it has a lot of challenges when it comes to bringing those components together. When you work separately on those projects and you want to bring them together, often integration effort required for those separately developed components is just huge or often almost insurmountable
On the other hand trying to keep everything in one repo and source tree has other challenges — code and abstractions are leaking between components, they implicitly start depending on each other, often you end up with spaghetti code that goes across all those components, and quickly you stop understanding what’s going on when your logic is spread across the whole repo. When you want to install different versions of components at the same time and they depend on each other in implicit ways, that becomes simply impossible and you end up practically with a multi-component setup that pretends to be modularised, but in fact is a giant monolith.
Can we eat cake and have it too ? Let’s find out looking at the journey of Apache Airflow where we always followed the monorepo approach and developed custom ways of handling independent components, but due to the recent improvements in Python tooling, it became actually **easy** to eat the cake, and still have it — i.e. have a truly modular application that is kept in a single monorepo and you can work with either the parts or all of it with equal ease, and integration is simply embedded in your daily work, so you do not have to pay separate price for it.
Press enter or click to view image in full size
The Monorepo Challenge in Airflow
Managing the CI/DEV environment for Apache Airflow has been my focus for the last five years. Airflow is a decade-old, colossal project. Its sheer size is daunting, featuring over 700 dependencies — a number that often makes Python developers uneasy. While it began as a monolith, we undertook a significant effort in 2020 with Airflow 2 to meticulously separate it into approximately 60 distinct distributions, encompassing Airflow itself and its various providers.
But as the project grew, so did the pain. We’re now releasing close to 100 distributions often twice a month, and the need to further modularize the “Airflow core” became evident. The problems were clear: reduce complexity, improve maintainer and contributor experience, and embrace a more modern, scalable approach. The community aspect is crucial here — a monorepo offers a unified development experience and shared infrastructure that’s hard to replicate with disparate repositories for a project of Airflow’s scale. At the same time, with projects of this magnitude, people are usually focusing on a small part of it and being able to modularise it and separate it in the way where you could only laser-focus on a particular part of Airflow, while keeping everything in-sync together - is crucial.
If you look at our PyPI repository — you will find that we have 146 projects (distributions) now.
And all of them come from the single repository: https://github.com/apache/airflow
We build and release those distirbutions regularly — from different branches and we need to make sure that they are isolated but also that they work together when we install all of them in a single virtual environment.
How we are solving the challenges — head to those parts to find out:
* Part 2. Modern Python packaging standards and tools for monorepos
* Part 3. Monorepo on steroids — modular prek hooks
* Part 4. Shared “static” libraries in Airflow monorepo