lea is a simple SQL runner. You write SQL scripts, organize them in folders, and lea takes care of the rest: it figures out the dependency order, runs them concurrently, and materializes the results as tables in your data warehouse.
Think of it as a minimalist alternative to dbt or SQLMesh. We use it every day at Carbonfact to manage our BigQuery data warehouse.
TLDR
You organize SQL scripts in a scripts/ directory:
scripts/
staging/
customers.sql
orders.sql
core/
revenue.sql
Each script is a SELECT statement:
-- scripts/core/revenue.sql SELECT customers.name, SUM(orders.amount) AS total FROM staging.customers JOIN staging.orders ON orders.customer_id = customers.id GROUP BY 1
Then you run them:
lea parses the SQL, sees that core.revenue depends on staging.customers and staging.orders, and runs them in the right order. Each script becomes a table in your warehouse.
Table of contents
- TLDR
- Table of contents
- Installation
- Configuration
- Usage
- Warehouse specific features
- Examples
- Contributing
- License
Installation
This installs the lea command. It also makes the lea Python library available.
Configuration
lea is configured via environment variables. They can be provided in an .env file, or directly in the shell. Each command also has an --env flag to provide a path to an .env file.
BigQuery
# Required LEA_WAREHOUSE=bigquery # Required LEA_BQ_LOCATION=EU # Required LEA_BQ_DATASET_NAME=kaya # Required, the project where the dataset is located LEA_BQ_PROJECT_ID=carbonfact-dwh # Optional, allows using a different project for compute LEA_BQ_COMPUTE_PROJECT_ID=carbonfact-dwh-compute # Not necessary if you're logged in with the gcloud CLI LEA_BQ_SERVICE_ACCOUNT=<JSON dump of the service account file> # not a path # Defaults to https://www.googleapis.com/auth/bigquery LEA_BQ_SCOPES=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/drive # LOGICAL or PHYSICAL, defaults to PHYSICAL LEA_BQ_STORAGE_BILLING_MODEL=PHYSICAL
DuckDB
# Required LEA_WAREHOUSE=duckdb # Required LEA_DUCKDB_PATH=duckdb.db # Optional LEA_DUCKDB_EXTENSIONS=parquet,httpfs
MotherDuck
# Required LEA_WAREHOUSE=motherduck # Required MOTHERDUCK_TOKEN=<get this from https://app.motherduck.com/settings/tokens> # Required LEA_MOTHERDUCK_DATABASE=bike_sharing # Optional LEA_DUCKDB_EXTENSIONS=parquet,httpfs
DuckLake
DuckLake needs a database to manage metadata, which is what LEA_DUCKLAKE_CATALOG_DATABASE is for.
Local storage:
LEA_WAREHOUSE=ducklake LEA_DUCKLAKE_CATALOG_DATABASE=metadata.ducklake LEA_DUCKLAKE_DATA_PATH=/path/to/data
S3:
LEA_WAREHOUSE=ducklake LEA_DUCKLAKE_CATALOG_DATABASE=metadata.ducklake LEA_DUCKLAKE_DATA_PATH=s3://my-bucket/data LEA_DUCKLAKE_S3_ENDPOINT=storage.googleapis.com
GCS (requires HMAC keys):
LEA_WAREHOUSE=ducklake LEA_DUCKLAKE_CATALOG_DATABASE=metadata.ducklake LEA_DUCKLAKE_DATA_PATH=gcs://my-bucket/data LEA_DUCKLAKE_GCS_KEY_ID=GOOG1E... LEA_DUCKLAKE_GCS_SECRET=...
Usage
Selecting scripts
By default, lea run runs all scripts. You can select specific scripts:
lea run --select core.revenue lea run --select core.revenue --select core.users
You can select all scripts in a schema:
Graph operators let you include dependencies or dependents:
lea run --select core.revenue+ # revenue and everything that depends on it lea run --select +core.revenue # revenue and everything it depends on lea run --select +core.revenue+ # both directions
You can also select scripts that have been modified in the current Git branch:
lea run --select git
lea run --select git+ # modified scripts and their dependentsThis is handy in continuous integration.
You can exclude scripts with --unselect:
lea run --unselect tests/
Development vs. production
By default, lea creates a development environment isolated from production by appending _<user> to dataset names. Use --production to target the production environment:
The <user> is determined from the login name. It can be overridden with the LEA_USERNAME environment variable.
Jinja templating
SQL queries can be templated with Jinja. Use a .sql.jinja extension.
You have access to env (i.e. os.environ) and a load_yaml function:
{% set taxonomy = load_yaml('core/taxonomies/product.yaml') %} SELECT {% for dim in taxonomy.dimensions %} MAX(IF(key = '{{ dim.key }}', value, NULL)) AS {{ dim.column }}, {% endfor %} account_slug FROM core.raw_attributes
Testing
There is no separate lea test command. Tests run alongside regular scripts during lea run. The run stops whenever a test fails.
There are two types of tests:
- Singular tests are queries stored in a
tests/directory. They fail if they return any rows. - Assertion tests are comment annotations in the queries themselves:
SELECT -- #UNIQUE -- #NO_NULLS user_id, -- #NO_NULLS address, -- #UNIQUE_BY(address) full_name, -- #SET{'A', 'B', 'AB', 'O'} blood_type FROM core.users
Available tags: #NO_NULLS, #UNIQUE, #UNIQUE_BY(<by>), #SET{<elements>}.
When you run a script, all applicable tests are run as well.
Write-Audit-Publish (WAP)
lea follows the WAP pattern by default. When you execute lea run, scripts are first materialized into temporary ___audit tables. These are promoted to replace the real tables only once everything has succeeded without errors.
This prevents partial updates. If script bar depends on foo and bar fails, foo is not modified either.
Skipping unmodified scripts
lea doesn't re-run scripts when the audit table already exists and the script hasn't been modified since. This avoids unnecessary work during development. You can disable this with:
Quack mode
Quack mode runs your scripts locally with DuckDB instead of your cloud warehouse. This makes local iteration much faster and doesn't incur any cloud costs.
lea run --select core.users --quack
lea automatically pulls the necessary upstream tables from your warehouse into a DuckLake instance, and only pulls what's missing. SQL is transpiled to DuckDB automatically.
You'll need to configure a DuckLake instance for storage, in addition to your regular warehouse configuration.
Local storage:
LEA_QUACK_DUCKLAKE_CATALOG_DATABASE=quack.ducklake LEA_QUACK_DUCKLAKE_DATA_PATH=/path/to/quack/data
S3:
LEA_QUACK_DUCKLAKE_CATALOG_DATABASE=quack.ducklake LEA_QUACK_DUCKLAKE_DATA_PATH=s3://my-bucket/quack/data LEA_QUACK_DUCKLAKE_S3_ENDPOINT=storage.googleapis.com
GCS (requires HMAC keys):
LEA_QUACK_DUCKLAKE_CATALOG_DATABASE=quack.ducklake LEA_QUACK_DUCKLAKE_DATA_PATH=gcs://my-bucket/quack/data LEA_QUACK_DUCKLAKE_GCS_KEY_ID=GOOG1E... LEA_QUACK_DUCKLAKE_GCS_SECRET=...
You can push the DuckLake tables back to your warehouse with --quack-push:
lea run --quack --quack-push
To inspect the resulting DuckLake tables in a browser, use the built-in UI command:
This opens the DuckDB UI with your DuckLake catalog attached, so you can easily browse them.
"Quack mode" is mostly intended for cheap local development with a rapid feedback loop. But you could also run your production refreshes with DuckDB and upload the artifacts to your warehouse with --quack-push. This can be a great way to save on compute costs.
Warehouse specific features
BigQuery
Default clustering
lea can automatically cluster tables that contain given fields:
LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug,brand_slug
For each table, lea uses whichever configured clustering fields are present and ignores the rest.
Table specific clustering
You can also define clustering fields for a specific table:
SELECT account_slug, -- #CLUSTERING_FIELD object_kind, value FROM my_table
Table-specific clustering fields are added in addition to the default ones.
Script-specific compute projects
You can route specific scripts to different compute projects:
LEA_BQ_SCRIPT_SPECIFIC_COMPUTE_PROJECT_IDS={"dataset.schema__table": "reservation-project-id"}Scripts not listed use the default LEA_BQ_COMPUTE_PROJECT_ID.
Big Blue Pick API
Big Blue provides a Pick API that suggests whether to run a query on-demand or on a reservation. lea supports this out of the box:
LEA_BQ_BIG_BLUE_PICK_API_KEY=<get from https://your-company.biq.blue/settings.html> LEA_BQ_BIG_BLUE_PICK_API_URL=https://pick.biq.blue LEA_BQ_BIG_BLUE_PICK_API_ON_DEMAND_PROJECT_ID=on-demand-compute-project-id LEA_BQ_BIG_BLUE_PICK_API_REVERVATION_PROJECT_ID=reservation-compute-project-id
Examples
Contributing
Feel free to reach out to max@carbonfact.com if you want to know more and/or contribute.
We have suggested some issues as good places to get started.
License
lea is free and open-source software licensed under the Apache License, Version 2.0.