Dux: Distributed DuckDB-native DataFrames for Elixir

5 min read Original article ↗

Distributed DuckDB-native DataFrames for Elixir.
Pipelines compile to SQL. The BEAM handles the rest.

Lazy pipelines compile to SQL CTEs. DuckDB handles the compute. The BEAM handles distribution. Graph analytics, cross-source joins, Nx interop — one library, one engine, fully exposed.


{:dux, "~> 0.3.0"}

the api

Verbs that pipe

Every operation is a function on %Dux{}. Filter, mutate, group, summarise, join — they all return a new struct. Nothing hits DuckDB until you call compute/1.

elixir

require Dux

Dux.from_parquet("s3://data/sales/**/*.parquet")
|> Dux.filter(amount > 100 and region == ^selected_region)
|> Dux.mutate(revenue: price * quantity)
|> Dux.group_by(:product)
|> Dux.summarise(total: sum(revenue), orders: count(product))
|> Dux.sort_by(desc: :total)
|> Dux.to_parquet("results.parquet", compression: :zstd)

Bare identifiers are columns. amount, region, price — the Query macro captures the expression as AST.

^ is interpolation. Elixir values become parameter bindings ($1, $2). SQL injection safe by construction.

DuckDB reads S3 directly. httpfs extension, no ExAws, no Req. Parquet glob expansion is native.

Everything is lazy until to_parquet/2 triggers compute/1. The full pipeline compiles to a single SQL query with CTEs.

under the hood

Elixir expressions compile to SQL

Each verb becomes a CTE. The pipeline is a chain. DuckDB handles all pushdown optimisation.

elixir

Dux.from_parquet("sales.parquet")
|> Dux.filter(year > 2024)
|> Dux.mutate(revenue: price * quantity)
|> Dux.group_by(:region)
|> Dux.summarise(total: sum(revenue))
|> Dux.compute()

sql

WITH __s0 AS (
  SELECT * FROM read_parquet('sales.parquet')
  WHERE year > 2024
),
__s1 AS (
  SELECT *, (price * quantity) AS revenue
  FROM __s0
),
__s2 AS (
  SELECT region, SUM(revenue) AS total
  FROM __s1 GROUP BY region
)
SELECT * FROM __s2

No LazySeries intermediate. Query macro → AST traversal → SQL. One step.

distribution

DuckDB per node. The BEAM for everything else.

Spark proved that DataFrames can be distributed. But Spark carries the weight of the JVM — slow startup, heavy serialisation, complex cluster management. The BEAM already has all of this, battle-tested for 30+ years.

elixir

# Workers auto-register via :pg
workers = Dux.Remote.Worker.list()

Dux.from_parquet("s3://lake/events/**/*.parquet")
|> Dux.distribute(workers)
|> Dux.filter(amount > 100)
|> Dux.group_by(:region)
|> Dux.summarise(total: sum(amount))
|> Dux.compute()

No function serialisation. %Dux{} is plain data — an AST of operations. Ship it to any BEAM node, compile to SQL there. The code is already deployed.

No cluster manager. libcluster + :pg. Nodes discover each other, join process groups, start working.

Arrow IPC over BEAM distribution. Same bytes in memory and on wire. 6 bytes framing overhead. Backpressure, node monitoring, fault tolerance — all for free.

Lattice-based streaming merge. SUM, COUNT, MIN, MAX — partial results fold as workers complete. Progressive results before the last worker finishes.

graph analytics

A graph is two dataframes

Vertices + edges. Every graph algorithm reduces to joins and aggregations. PageRank is iterative join → group_by → sum → normalise. Shortest path uses DuckDB's USING KEY recursive CTEs (SIGMOD 2025).

elixir

graph = Dux.Graph.new(vertices: users, edges: follows)

# Who is most influential?
graph
|> Dux.Graph.pagerank(iterations: 20)
|> Dux.sort_by(desc: :rank)
|> Dux.head(10)

# Shortest path via USING KEY recursive CTE
graph |> Dux.Graph.shortest_paths(start_node)

# Connected components, triangle counting
graph |> Dux.Graph.connected_components()
graph |> Dux.Graph.triangle_count()

# Graph results are %Dux{} — pipe into any verb
graph
|> Dux.Graph.pagerank()
|> Dux.join(departments, on: :id)
|> Dux.group_by(:dept)
|> Dux.summarise(avg_rank: avg(rank))
|> Dux.compute()

cross-source

Postgres + S3 + Parquet in one pipeline

Attach databases, reference tables, join across sources. DuckDB pushes filters down to remote databases. No ETL, no staging.

elixir

Dux.attach(:warehouse, "postgresql://prod/analytics", type: :postgres)

orders    = Dux.from_parquet("s3://exports/orders/*.parquet")
customers = Dux.from_attached(:warehouse, "public.customers")

orders
|> Dux.join(customers, on: :customer_id)
|> Dux.filter(order_date > ~D[2026-01-01])
|> Dux.group_by(:region)
|> Dux.summarise(revenue: sum(amount))
|> Dux.compute()

Postgres, MySQL, SQLite, Iceberg, Delta Lake, DuckLake — all via DuckDB ATTACH.

interop

Nx tensors, zero-copy

DuckDB stores Arrow columnar data internally. After compute/1, numeric columns become Nx tensors via pointer passing. Dux implements Nx.LazyContainer.

elixir

tensor = Dux.to_tensor(df, :price)
# #Nx.Tensor<f64[1000] [...]>

escape hatch

When the macro doesn't cover it

The _with variants accept raw DuckDB SQL strings. Window functions, CASE WHEN, PIVOT, arbitrary CTEs — anything DuckDB can do.

elixir

# Window functions
Dux.mutate_with(df, rank: "ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)")

# Any DuckDB SQL
Dux.from_query("PIVOT sales ON product USING SUM(amount) GROUP BY region")

architecture

What's in the box

api

Dux

The module is the dataframe. All verbs live here. %Dux{} is a source + ops list — plain Elixir data, freely serialisable.

compiler

Query → SQL

Macro captures expressions as AST. Direct traversal to DuckDB SQL. ^ interpolations become $1, $2 bindings.

engine

DuckDB via ADBC

Pure Elixir ADBC wrapper. Zero Rust. Precompiled driver downloaded at compile time. Arrow columnar storage.

distribution

Coordinator

Partition, fan-out, merge. PipelineSplitter decomposes ops into worker-safe + coordinator-only. Arrow IPC wire format.

runtime

BEAM

:pg for discovery. :erpc for fan-out. Supervisor trees for fault tolerance. No JVM, no cluster manager.

graph

Dux.Graph

PageRank, shortest paths, connected components, triangle counting. Algorithms are verb compositions. Distributable.

livebook

Rich rendering out of the box

Add kino_dux to your Livebook setup cell. Lazy pipelines render with source provenance, accumulated operations, and generated SQL. Computed results become interactive data tables.

elixir

Mix.install([
  {:dux, "~> 0.3.0"},
  {:kino_dux, "~> 0.1"}
])

Smart cells for SQL preview, data source browsing, chart building, and FLAME cluster configuration.

io

Read and write anything DuckDB can

elixir

# Read
Dux.from_csv("data.csv", delimiter: "\t")
Dux.from_parquet("data/**/*.parquet")
Dux.from_ndjson("events.ndjson")
Dux.from_excel("sales.xlsx", sheet: "Q1")
Dux.from_query("SELECT * FROM read_parquet('s3://bucket/data.parquet')")

# Write
Dux.to_csv(df, "output.csv")
Dux.to_parquet(df, "output.parquet", partition_by: [:year, :month])
Dux.to_excel(df, "report.xlsx")
Dux.to_ndjson(df, "output.ndjson")
Dux.insert_into(df, "pg.public.events", create: true)

S3, HTTP, Postgres, MySQL, SQLite, Excel, DuckLake — all via DuckDB extensions. No separate libraries.