Trino (aka Presto or Starburst) is an open-source component for querying multiple databases simultaneously. It can execute distributed queries in many different data sources.
dbt is the leading ELT framework that everybody uses :)
I use Trino and dbt for medallion architecture implementation. Medallion architecture is a data design pattern for organizing data in a data lake and a data warehouse to incrementally and progressively improve the structure and quality of data as it flows through each stage of the architecture, from bronze ⇒ silver ⇒ gold tables.
Each stage (bronze, silver, goal) has its own Trino schema mapped to an underlying database. In this article, I’ll use a stock market example that maps the bronze and silver data stages to Apache Spark schemas and the gold stage to a Postgres database.
Press enter or click to view image in full size
I use three dbt projects for ELT data transformation in each data stage. The dbt is connected to the Trino database and executes cross-schema queries to transfer data from one stage to another.
Press enter or click to view image in full size
For example, a gold dbt project’s model that moves data from the silver stage to the gold stage performs queries from a silver schema table.
SELECT
dt,
symbol,
price_predicted
FROM warehouse.silver.predicted_dataThis query physically moves data from Spark to Postgres.
Get ZD’s stories in your inbox
Join Medium for free to get updates from this writer.
The dbt project execution is orchestrated by Dagster.
Press enter or click to view image in full size
Trino supports many interesting data sources like local files, ElasticSearch, MongoDB, etc. So you can also use it as a connector that extracts data from an external data source and stores it in the bronze stage.
You can review the complete example with all dbt code, deployment, and configuration in this GitHub repository.
Thank you for reading, and let me know what you think!