GitHub - carbonfact/lea: 🏃‍♀️ Minimalist SQL orchestrator

lea

lea is a minimalist alternative to SQL orchestrators like dbt and SQLMesh.

lea aims to be simple and provides sane defaults. We happily use it every day at Carbonfact to manage our BigQuery data warehouse. We will actively maintain it and add features, while welcoming contributions.

Examples

Installation

Use one of the following commands, depending on which warehouse you wish to use:

This installs the lea command. It also makes the lea Python library available.

Configuration

lea is configured via environment variables.

BigQuery

# Required
LEA_WAREHOUSE=bigquery
# Required
LEA_BQ_LOCATION=EU
# Required
LEA_BQ_DATASET_NAME=kaya
# Required, the project where the dataset is located
LEA_BQ_PROJECT_ID=carbonfact-dwh
# Optional, allows using a different project for compute
LEA_BQ_COMPUTE_PROJECT_ID=carbonfact-dwh-compute
# Not necessary if you're logged in with the gcloud CLI
LEA_BQ_SERVICE_ACCOUNT=<JSON dump of the service account file>  # not a path ⚠️
# Defaults to https://www.googleapis.com/auth/bigquery
LEA_BQ_SCOPES=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/drive
# LOGICAL or PHYSICAL, defaults to PHYSICAL
LEA_BQ_STORAGE_BILLING_MODEL=PHYSICAL

DuckDB

# Required
LEA_WAREHOUSE=duckdb
# Required
LEA_DUCKDB_PATH=duckdb.db
# Optional
LEA_DUCKDB_EXTENSIONS=parquet,httpfs

MotherDuck

# Required
LEA_WAREHOUSE=motherduck
# Required
MOTHERDUCK_TOKEN=<get this from https://app.motherduck.com/settings/tokens>
# Required
LEA_MOTHERDUCK_DATABASE=bike_sharing
# Optional
LEA_DUCKDB_EXTENSIONS=parquet,httpfs

DuckLake

# Required
LEA_WAREHOUSE=ducklake
# Required
LEA_DUCKLAKE_DATA_PATH=gcs://bike-sharing-analytics
# Required
LEA_DUCKLAKE_CATALOG_DATABASE=metadata.ducklake
# Optional
LEA_DUCKLAKE_S3_ENDPOINT=storage.googleapis.com
# Optional
LEA_DUCKDB_EXTENSIONS=parquet,httpfs

DuckLake needs a database to manage metadata, which is what LEA_DUCKLAKE_CATALOG_DATABASE is for.

Usage

These parameters can be provided in an .env file, or directly in the shell. Each command also has an --env flag to provide a path to an .env file.

`lea run`

This is the main command. It runs SQL queries stored in the scripts directory:

You can indicate the directory where the scripts are stored:

lea run --scripts /path/to/scripts

The scripts are run concurrently. They are organized in a DAG, which is traversed in a topological order. The DAG's structure is determined automatically by analyzing the dependency between queries.

File structure

Each script is expected to be placed under a schema, represented by a directory. Schemas can have sub-schemas. Here's an example:

scripts/
    schema_1/
        table_1.sql
        table_2.sql
    schema_2/
        table_3.sql
        table_4.sql
        sub_schema_2_1/
            table_5.sql
            table_6.sql

Each script is materialized into a table. The table is named according to the script's name, following the warehouse convention.

A script may contain multiple SQL statements, separated by semicolons. The last statement has to be a SELECT statement, while the previous ones are expected to be procedural statements (e.g. DECLARE and SET in BigQuery). The way it works under the hood is by creating a session for the script, and running all statements within the same session.

Jinja templating

SQL queries can be templated with Jinja. A .sql.jinja extension is necessary for lea to recognise them.

You have access to an env variable within the template context, which is simply an access point to os.environ.

Development vs. production

By default, lea creates an isolation layer with production. The way this is done depends on your warehouse:

BigQuery : by appending a _<user> suffix to schema names
DuckDB : by adding a suffix _<user> to database file.

In other words, a development environment is used by default. Use the --production flag when executing lea run to disable this behaviour, and instead target the product environment.

The <user> is determined automatically from the login name. It can be overriden by setting the LEA_USERNAME environment variable.

Selecting scripts

A single script can be run:

lea run --select core.users

Several scripts can be run:

lea run --select core.users --select core.orders

Similar to dbt, lea also supports graph operators:

lea run --select core.users+   # users and everything that depends on it
lea run --select +core.users   # users and everything it depends on
lea run --select +core.users+  # users and all its dependencies

You can select all scripts in a schema:

lea run --select core/  # the trailing slash matters

This also work with sub-schemas:

lea run --select analytics.finance/

There are thus 8 possible operators:

schema.table    (table by itself)
schema.table+   (table with its descendants)
+schema.table   (table with its ancestors)
+schema.table+  (table with its ancestors and descendants)
schema/         (all tables in schema)
schema/+        (all tables in schema with their descendants)
+schema/        (all tables in schema with their ancestors)
+schema/+       (all tables in schema with their ancestors and descendants)

Combinations are possible:

lea run --select core.users+ --select +core.orders

There's an Easter egg that allows choosing scripts that have been committed or modified in the current Git branch:

lea run --select git
lea run --select git+  # includes all descendants

This becomes very handy when using lea in continuous integration.

Write-Audit-Publish (WAP)

WAP is a data engineering pattern that ensures data consistency and reliability. It's the data engineering equivalent of blue-green deployment in the software engineering world.

lea follows the WAP pattern by default. When you execute lea run, it actually creates temporary tables that have an ___audit suffix. The latter tables are promoted to replace the existing tables, once they have all been materialized without errors.

This is a good default behavior. Let's say you refresh table foo. Then you refresh table bar that depends on foo. If the refresh of bar fails, you're left with a corrupt state. This is what the WAP pattern solves. In WAP mode, when you run foo's script, it creates a foo___audit table. If bar's script fails, then the run stops and foo is not modified.

Testing while running

There is no lea test command. Tests are run together with the regular script when lea run is executed. The run stops whenever a test fails.

There are two types of tests:

Singular tests — these are queries which return failing rows. They are stored in a tests directory.
Assertion tests — these are comment annotations in the queries themselves:
- #NO_NULLS — checks that all values in a column are not null.
- #UNIQUE — checks that a column's values are unique.
- #UNIQUE_BY(<by>) — checks that a column's values are unique within a group.
- #SET{<elements>} — checks that a column's values are in a set of values.

Here's an example of a query annotated with assertion tests:

SELECT
    -- #UNIQUE
    -- #NO_NULLS
    user_id,
    -- #NO_NULLS
    address,
    -- #UNIQUE_BY(address)
    full_name,
    -- #SET{'A', 'B', 'AB', 'O'}
    blood_type
FROM core.users

You can run a single test via the --select flag:

lea run --select tests.check_n_users

Or even run all the tests, as so:

lea run --select tests/  # the trailing slash matters

☝️ When you run a script that is not a test, all the applicable tests are run as well. For instance, the following command will run the core.users script and all the tests that are applicable to it:

lea run --select core.users

You may decide to run all scripts without executing tests, which is obviously not advisable:

lea run --unselect tests/
lea run --select core.users --unselect tests/

Skipping unmodified scripts during development

When you call lea run, it generates audit tables, which are then promoted to replace the original tables. This is done to ensure that the data is consistent and reliable. lea doesn't run scripts when the audit table already exists, and when the script hasn't modified since the last time the audit table was created. This is to avoid unnecessary re-runs of scripts that haven't changed.

For instance:

You execute lea run to sync all tables from sources, no errors, all tables are materialized.
You modify a script named core/expenses.sql depending on staging/customers.sql and staging/orders.sql
You execute lea run core.expenses+ to run again all impacted tables
core__expenses___audit is materialized in your data warehouse but the -- #NO_NULLS assertion test on a column fails
After reviewing data in core__expenses___audit, you edit and fix core/expenses.sql to filter out results where NULLs are appearing
You execute lea run
The staging/customers.sql and staging/orders.sql scripts are skipped because they were modified before staging__customers and staging__orders was last materialized
The core/expenses.sql script is run because it was modified after core__expenses was last materialized
All audit tables are wipped out from database as the whole DAG has run successfully ! 🎉

You can disable this behavior altogether:

Warehouse specific features

BigQuery

Clustering

Default clustering

At Carbonfact, we cluster most of our tables by customer. This is done to optimize query performance and reduce costs. lea allows you to automatically cluster tables that contain a given field:

LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug

You can also specify multiple fields, meaning that tables which contain both fields will be clustered:

LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug,brand_slug

For each table, lea will use the clustering fields it can and ignore the others. With the previous configuration, if your table defines account_slug and not brand_slug, it will cluster by account_slug.

Table specific clustering

You can also define clustering fields for a specific table:

SELECT
  account_slug,
  -- #CLUSTERING_FIELD
  object_kind,
  value
FROM my_table

If you define specific clustering fields for a table, they will be added in addition to the default ones. This is done to ensure that project-wide clustering fields are kept up to date in each table.

Big Blue Pick API

Big Blue is a SaaS product to monitor and optimize BigQuery costs. As part of their offering, they provide a Pick API. The idea is that some queries should be run on-demand, while others should be run on a reservation. Big Blue's Pick API suggests which billing model to use for each query.

We use this at Carbonfact, and so this API is available out of the box in lea. You can enable it by setting the following environment variables:

LEA_BQ_BIG_BLUE_PICK_API_KEY=<get is from https://your-company.biq.blue/settings.html>
LEA_BQ_BIG_BLUE_PICK_API_URL=https://pick.biq.blue
LEA_BQ_BIG_BLUE_PICK_API_ON_DEMAND_PROJECT_ID=on-demand-compute-project-id
LEA_BQ_BIG_BLUE_PICK_API_REVERVATION_PROJECT_ID=reservation-compute-project-id

Contributing

Feel free to reach out to max@carbonfact.com if you want to know more and/or contribute 😊

We have suggested some issues as good places to get started.

License

lea is free and open-source software licensed under the Apache License, Version 2.0.