Distributed DuckDB-native DataFrames for Elixir.
Pipelines compile to SQL. The BEAM handles the rest.
Lazy pipelines compile to SQL CTEs. DuckDB handles the compute. The BEAM handles distribution. Graph analytics, cross-source joins, Nx interop — one library, one engine, fully exposed.
{:dux,
"~> 0.3.0"}
the api
Verbs that pipe
Every operation is a function on %Dux{}. Filter, mutate,
group, summarise, join — they all return a new struct. Nothing hits
DuckDB until you call compute/1.
elixir
require Dux Dux.from_parquet("s3://data/sales/**/*.parquet") |> Dux.filter(amount > 100 and region == ^selected_region) |> Dux.mutate(revenue: price * quantity) |> Dux.group_by(:product) |> Dux.summarise(total: sum(revenue), orders: count(product)) |> Dux.sort_by(desc: :total) |> Dux.to_parquet("results.parquet", compression: :zstd)
Bare identifiers are columns.
amount, region, price — the
Query macro captures the expression as AST.
^ is interpolation. Elixir values
become parameter bindings ($1, $2). SQL
injection safe by construction.
DuckDB reads S3 directly.
httpfs extension, no ExAws, no Req. Parquet glob
expansion is native.
Everything is lazy until
to_parquet/2 triggers compute/1. The
full pipeline compiles to a single SQL query with CTEs.
under the hood
Elixir expressions compile to SQL
Each verb becomes a CTE. The pipeline is a chain. DuckDB handles all pushdown optimisation.
elixir
Dux.from_parquet("sales.parquet") |> Dux.filter(year > 2024) |> Dux.mutate(revenue: price * quantity) |> Dux.group_by(:region) |> Dux.summarise(total: sum(revenue)) |> Dux.compute()
sql
WITH __s0 AS ( SELECT * FROM read_parquet('sales.parquet') WHERE year > 2024 ), __s1 AS ( SELECT *, (price * quantity) AS revenue FROM __s0 ), __s2 AS ( SELECT region, SUM(revenue) AS total FROM __s1 GROUP BY region ) SELECT * FROM __s2
No LazySeries intermediate. Query macro → AST traversal → SQL. One step.
distribution
DuckDB per node. The BEAM for everything else.
Spark proved that DataFrames can be distributed. But Spark carries the weight of the JVM — slow startup, heavy serialisation, complex cluster management. The BEAM already has all of this, battle-tested for 30+ years.
elixir
# Workers auto-register via :pg workers = Dux.Remote.Worker.list() Dux.from_parquet("s3://lake/events/**/*.parquet") |> Dux.distribute(workers) |> Dux.filter(amount > 100) |> Dux.group_by(:region) |> Dux.summarise(total: sum(amount)) |> Dux.compute()
No function serialisation. %Dux{} is
plain data — an AST of operations. Ship it to any BEAM node,
compile to SQL there. The code is already deployed.
No cluster manager. libcluster +
:pg. Nodes discover each other, join process groups,
start working.
Arrow IPC over BEAM distribution. Same bytes in memory and on wire. 6 bytes framing overhead. Backpressure, node monitoring, fault tolerance — all for free.
Lattice-based streaming merge. SUM, COUNT, MIN, MAX — partial results fold as workers complete. Progressive results before the last worker finishes.
graph analytics
A graph is two dataframes
Vertices + edges. Every graph algorithm reduces to joins and
aggregations. PageRank is iterative join → group_by → sum → normalise.
Shortest path uses DuckDB's USING KEY recursive CTEs
(SIGMOD 2025).
elixir
graph = Dux.Graph.new(vertices: users, edges: follows) # Who is most influential? graph |> Dux.Graph.pagerank(iterations: 20) |> Dux.sort_by(desc: :rank) |> Dux.head(10) # Shortest path via USING KEY recursive CTE graph |> Dux.Graph.shortest_paths(start_node) # Connected components, triangle counting graph |> Dux.Graph.connected_components() graph |> Dux.Graph.triangle_count() # Graph results are %Dux{} — pipe into any verb graph |> Dux.Graph.pagerank() |> Dux.join(departments, on: :id) |> Dux.group_by(:dept) |> Dux.summarise(avg_rank: avg(rank)) |> Dux.compute()
cross-source
Postgres + S3 + Parquet in one pipeline
Attach databases, reference tables, join across sources. DuckDB pushes filters down to remote databases. No ETL, no staging.
elixir
Dux.attach(:warehouse, "postgresql://prod/analytics", type: :postgres) orders = Dux.from_parquet("s3://exports/orders/*.parquet") customers = Dux.from_attached(:warehouse, "public.customers") orders |> Dux.join(customers, on: :customer_id) |> Dux.filter(order_date > ~D[2026-01-01]) |> Dux.group_by(:region) |> Dux.summarise(revenue: sum(amount)) |> Dux.compute()
Postgres, MySQL, SQLite, Iceberg, Delta Lake, DuckLake — all via DuckDB ATTACH.
interop
Nx tensors, zero-copy
DuckDB stores Arrow columnar data internally. After
compute/1, numeric columns become Nx tensors via pointer
passing. Dux implements Nx.LazyContainer.
elixir
tensor = Dux.to_tensor(df, :price) # #Nx.Tensor<f64[1000] [...]>
escape hatch
When the macro doesn't cover it
The _with variants accept raw DuckDB SQL strings. Window
functions, CASE WHEN, PIVOT, arbitrary CTEs — anything DuckDB can do.
elixir
# Window functions Dux.mutate_with(df, rank: "ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)") # Any DuckDB SQL Dux.from_query("PIVOT sales ON product USING SUM(amount) GROUP BY region")
architecture
What's in the box
api
Dux
The module is the dataframe. All verbs live here.
%Dux{} is a source + ops list — plain Elixir data,
freely serialisable.
compiler
Query → SQL
Macro captures expressions as AST. Direct traversal to DuckDB SQL.
^ interpolations become $1,
$2 bindings.
engine
DuckDB via ADBC
Pure Elixir ADBC wrapper. Zero Rust. Precompiled driver downloaded at compile time. Arrow columnar storage.
distribution
Coordinator
Partition, fan-out, merge. PipelineSplitter decomposes ops into worker-safe + coordinator-only. Arrow IPC wire format.
runtime
BEAM
:pg for discovery. :erpc for fan-out.
Supervisor trees for fault tolerance. No JVM, no cluster manager.
graph
Dux.Graph
PageRank, shortest paths, connected components, triangle counting. Algorithms are verb compositions. Distributable.
livebook
Rich rendering out of the box
Add kino_dux to your Livebook setup cell. Lazy pipelines render with source provenance, accumulated operations, and generated SQL. Computed results become interactive data tables.
elixir
Mix.install([ {:dux, "~> 0.3.0"}, {:kino_dux, "~> 0.1"} ])
Smart cells for SQL preview, data source browsing, chart building, and FLAME cluster configuration.
io
Read and write anything DuckDB can
elixir
# Read Dux.from_csv("data.csv", delimiter: "\t") Dux.from_parquet("data/**/*.parquet") Dux.from_ndjson("events.ndjson") Dux.from_excel("sales.xlsx", sheet: "Q1") Dux.from_query("SELECT * FROM read_parquet('s3://bucket/data.parquet')") # Write Dux.to_csv(df, "output.csv") Dux.to_parquet(df, "output.parquet", partition_by: [:year, :month]) Dux.to_excel(df, "report.xlsx") Dux.to_ndjson(df, "output.ndjson") Dux.insert_into(df, "pg.public.events", create: true)
S3, HTTP, Postgres, MySQL, SQLite, Excel, DuckLake — all via DuckDB extensions. No separate libraries.