lea
lea is a minimalist alternative to SQL orchestrators like dbt and SQLMesh.
lea aims to be simple and provides sane defaults. We happily use it every day at Carbonfact to manage our BigQuery data warehouse. We will actively maintain it and add features, while welcoming contributions.
Examples
Installation
Use one of the following commands, depending on which warehouse you wish to use:
This installs the lea command. It also makes the lea Python library available.
Configuration
lea is configured via environment variables.
BigQuery
# Required LEA_WAREHOUSE=bigquery # Required LEA_BQ_LOCATION=EU # Required LEA_BQ_DATASET_NAME=kaya # Required, the project where the dataset is located LEA_BQ_PROJECT_ID=carbonfact-dwh # Optional, allows using a different project for compute LEA_BQ_COMPUTE_PROJECT_ID=carbonfact-dwh-compute # Not necessary if you're logged in with the gcloud CLI LEA_BQ_SERVICE_ACCOUNT=<JSON dump of the service account file> # not a path ⚠️ # Defaults to https://www.googleapis.com/auth/bigquery LEA_BQ_SCOPES=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/drive # LOGICAL or PHYSICAL, defaults to PHYSICAL LEA_BQ_STORAGE_BILLING_MODEL=PHYSICAL
DuckDB
# Required LEA_WAREHOUSE=duckdb # Required LEA_DUCKDB_PATH=duckdb.db # Optional LEA_DUCKDB_EXTENSIONS=parquet,httpfs
MotherDuck
# Required LEA_WAREHOUSE=motherduck # Required MOTHERDUCK_TOKEN=<get this from https://app.motherduck.com/settings/tokens> # Required LEA_MOTHERDUCK_DATABASE=bike_sharing # Optional LEA_DUCKDB_EXTENSIONS=parquet,httpfs
DuckLake
# Required LEA_WAREHOUSE=ducklake # Required LEA_DUCKLAKE_DATA_PATH=gcs://bike-sharing-analytics # Required LEA_DUCKLAKE_CATALOG_DATABASE=metadata.ducklake # Optional LEA_DUCKLAKE_S3_ENDPOINT=storage.googleapis.com # Optional LEA_DUCKDB_EXTENSIONS=parquet,httpfs
DuckLake needs a database to manage metadata, which is what LEA_DUCKLAKE_CATALOG_DATABASE is for.
Usage
These parameters can be provided in an .env file, or directly in the shell. Each command also has an --env flag to provide a path to an .env file.
lea run
This is the main command. It runs SQL queries stored in the scripts directory:
You can indicate the directory where the scripts are stored:
lea run --scripts /path/to/scripts
The scripts are run concurrently. They are organized in a DAG, which is traversed in a topological order. The DAG's structure is determined automatically by analyzing the dependency between queries.
File structure
Each script is expected to be placed under a schema, represented by a directory. Schemas can have sub-schemas. Here's an example:
scripts/
schema_1/
table_1.sql
table_2.sql
schema_2/
table_3.sql
table_4.sql
sub_schema_2_1/
table_5.sql
table_6.sql
Each script is materialized into a table. The table is named according to the script's name, following the warehouse convention.
A script may contain multiple SQL statements, separated by semicolons. The last statement has to be a SELECT statement, while the previous ones are expected to be procedural statements (e.g. DECLARE and SET in BigQuery). The way it works under the hood is by creating a session for the script, and running all statements within the same session.
Jinja templating
SQL queries can be templated with Jinja. A .sql.jinja extension is necessary for lea to recognise them.
You have access to an env variable within the template context, which is simply an access point to os.environ.
Development vs. production
By default, lea creates an isolation layer with production. The way this is done depends on your warehouse:
- BigQuery : by appending a
_<user>suffix to schema names - DuckDB : by adding a suffix
_<user>to database file.
In other words, a development environment is used by default. Use the --production flag when executing lea run to disable this behaviour, and instead target the product environment.
The <user> is determined automatically from the login name. It can be overriden by setting the LEA_USERNAME environment variable.
Selecting scripts
A single script can be run:
lea run --select core.users
Several scripts can be run:
lea run --select core.users --select core.orders
Similar to dbt, lea also supports graph operators:
lea run --select core.users+ # users and everything that depends on it lea run --select +core.users # users and everything it depends on lea run --select +core.users+ # users and all its dependencies
You can select all scripts in a schema:
lea run --select core/ # the trailing slash mattersThis also work with sub-schemas:
lea run --select analytics.finance/
There are thus 8 possible operators:
schema.table (table by itself)
schema.table+ (table with its descendants)
+schema.table (table with its ancestors)
+schema.table+ (table with its ancestors and descendants)
schema/ (all tables in schema)
schema/+ (all tables in schema with their descendants)
+schema/ (all tables in schema with their ancestors)
+schema/+ (all tables in schema with their ancestors and descendants)
Combinations are possible:
lea run --select core.users+ --select +core.orders
There's an Easter egg that allows choosing scripts that have been committed or modified in the current Git branch:
lea run --select git
lea run --select git+ # includes all descendantsThis becomes very handy when using lea in continuous integration.
Write-Audit-Publish (WAP)
WAP is a data engineering pattern that ensures data consistency and reliability. It's the data engineering equivalent of blue-green deployment in the software engineering world.
lea follows the WAP pattern by default. When you execute lea run, it actually creates temporary tables that have an ___audit suffix. The latter tables are promoted to replace the existing tables, once they have all been materialized without errors.
This is a good default behavior. Let's say you refresh table foo. Then you refresh table bar that depends on foo. If the refresh of bar fails, you're left with a corrupt state. This is what the WAP pattern solves. In WAP mode, when you run foo's script, it creates a foo___audit table. If bar's script fails, then the run stops and foo is not modified.
Testing while running
There is no lea test command. Tests are run together with the regular script when lea run is executed. The run stops whenever a test fails.
There are two types of tests:
- Singular tests — these are queries which return failing rows. They are stored in a
testsdirectory. - Assertion tests — these are comment annotations in the queries themselves:
#NO_NULLS— checks that all values in a column are not null.#UNIQUE— checks that a column's values are unique.#UNIQUE_BY(<by>)— checks that a column's values are unique within a group.#SET{<elements>}— checks that a column's values are in a set of values.
Here's an example of a query annotated with assertion tests:
SELECT -- #UNIQUE -- #NO_NULLS user_id, -- #NO_NULLS address, -- #UNIQUE_BY(address) full_name, -- #SET{'A', 'B', 'AB', 'O'} blood_type FROM core.users
You can run a single test via the --select flag:
lea run --select tests.check_n_users
Or even run all the tests, as so:
lea run --select tests/ # the trailing slash matters☝️ When you run a script that is not a test, all the applicable tests are run as well. For instance, the following command will run the core.users script and all the tests that are applicable to it:
lea run --select core.users
You may decide to run all scripts without executing tests, which is obviously not advisable:
lea run --unselect tests/ lea run --select core.users --unselect tests/
Skipping unmodified scripts during development
When you call lea run, it generates audit tables, which are then promoted to replace the original tables. This is done to ensure that the data is consistent and reliable. lea doesn't run scripts when the audit table already exists, and when the script hasn't modified since the last time the audit table was created. This is to avoid unnecessary re-runs of scripts that haven't changed.
For instance:
- You execute
lea runto sync all tables from sources, no errors, all tables are materialized. - You modify a script named
core/expenses.sqldepending onstaging/customers.sqlandstaging/orders.sql - You execute
lea run core.expenses+to run again all impacted tables core__expenses___auditis materialized in your data warehouse but the-- #NO_NULLSassertion test on a column fails- After reviewing data in
core__expenses___audit, you edit and fixcore/expenses.sqlto filter out results where NULLs are appearing - You execute
lea run - The
staging/customers.sqlandstaging/orders.sqlscripts are skipped because they were modified beforestaging__customersandstaging__orderswas last materialized - The
core/expenses.sqlscript is run because it was modified aftercore__expenseswas last materialized - All audit tables are wipped out from database as the whole DAG has run successfully ! 🎉
You can disable this behavior altogether:
Warehouse specific features
BigQuery
Clustering
Default clustering
At Carbonfact, we cluster most of our tables by customer. This is done to optimize query performance and reduce costs. lea allows you to automatically cluster tables that contain a given field:
LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug
You can also specify multiple fields, meaning that tables which contain both fields will be clustered:
LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug,brand_slug
For each table, lea will use the clustering fields it can and ignore the others. With the previous configuration, if your table defines account_slug and not brand_slug, it will cluster by account_slug.
Table specific clustering
You can also define clustering fields for a specific table:
SELECT
account_slug,
-- #CLUSTERING_FIELD
object_kind,
value
FROM my_table
If you define specific clustering fields for a table, they will be added in addition to the default ones. This is done to ensure that project-wide clustering fields are kept up to date in each table.
Big Blue Pick API
Big Blue is a SaaS product to monitor and optimize BigQuery costs. As part of their offering, they provide a Pick API. The idea is that some queries should be run on-demand, while others should be run on a reservation. Big Blue's Pick API suggests which billing model to use for each query.
We use this at Carbonfact, and so this API is available out of the box in lea. You can enable it by setting the following environment variables:
LEA_BQ_BIG_BLUE_PICK_API_KEY=<get is from https://your-company.biq.blue/settings.html> LEA_BQ_BIG_BLUE_PICK_API_URL=https://pick.biq.blue LEA_BQ_BIG_BLUE_PICK_API_ON_DEMAND_PROJECT_ID=on-demand-compute-project-id LEA_BQ_BIG_BLUE_PICK_API_REVERVATION_PROJECT_ID=reservation-compute-project-id
Contributing
Feel free to reach out to max@carbonfact.com if you want to know more and/or contribute 😊
We have suggested some issues as good places to get started.
License
lea is free and open-source software licensed under the Apache License, Version 2.0.