Substrait: Cross-Language Serialization for Relational Algebra
github.comI struggle a bit to understand what Substrait wants to accomplish. They want to create a representation of data transformations (eg group by…) and relational algebra at what level? So it’s like a BLAS for relational algebra?
They are targeting the same problem as ANSI SQL or ZetaSQL, but at a lower level. The standardized SQL dialects allow a common way for authors to express a computation. Query engines parse SQL into a "Logical Plan" (LP), which is a graph (generally a DAG) of transformation nodes, and edges representing data flow. For example, `SELECT a FROM t WHERE b > 0` would become something similar to:
PROJECT([COLUMN("a")]) <- FILTER(EXPRESSION(COMPARE(COLUMN("b"), LITERAL(0), GREATER_THAN))) <- READ_TABLE(["a", "b"], TABLE("t"))
Currently, most engines have their own logical plan representation, with its own serialization, etc. This prevents several things that would be nice: 1. Alternative front-ends to get the LP that are not parsing SQL (eg, dataframes or builders). For example, Spark allows multiple ways to create an LP: SQL, Scala Dataset, etc, but this only works for Spark. 2. Standardized tools to analyze, display, store, etc LPs. 3. Cross-communication to different back-ends that will convert the LP to a lower-level representation (eg, execution plan or distributed plan). 4. Reusing other engines LP (they have a lot of similarities), so a new engine will have to re-invent the wheel.
Substrait is providing a standard, cross-language and cross-engine compatible Logical Plan (although people disagree on exactly the definition of LP so they describe it differently). Apache Calcite is another project that is addressing the same space, although it has more components (a SQL parser, etc). The Substrait project has historical ties to Calcite, and they describe why they chose their approach.
Another way of looking at it is that modern query engines have four macro steps: 1. Getting an LP (usually from parsing SQL). 2. Converting the LP to a "physical execution plan", possibly a distributed plan. 3. If distributed, requisitioning workers and distributing the execution sub-plans to them. 4. The workers (if distributed) or the engine (if single-box) executing the execution plan, by performing the actual computations on actual bytes.
(Note that this is a cartoon version and any given engine will have differences, eg Presto/Trino does not have a clear distinction between LP and Physical Plan.)
1) and 4) are broadly similar across engines, while 2) and 3) vary widely partially because of different requirements (reliability, latency, etc). Projects such as Apache Arrow and Velox (https://github.com/facebookincubator/velox) are making common tools for 4), and as mentioned ANSI SQL, ZetaSQL, Calcite, and Substrait are making common tools for 1).
I also don't understand what problem this solves.
Me neither. That's why I posted here to see if someone could give some insights on this.