GitHub - elixir-dux/dux: Distributed DataFrames for Elixir powered by DuckDB

Note: Dux is under active development and not yet production ready. APIs may change between releases.

DuckDB-native dataframes for Elixir.

Dux gives you a dplyr-style verb API backed by DuckDB's analytical engine, with built-in distributed execution across the BEAM. Pipelines are lazy, operations compile to SQL, and DuckDB handles columnar execution, vectorised aggregation, and predicate pushdown.

require Dux

Dux.from_parquet("s3://data/sales/**/*.parquet")
|> Dux.filter(amount > 100 and region == "US")
|> Dux.mutate(revenue: price * quantity)
|> Dux.group_by(:product)
|> Dux.summarise(total: sum(revenue), orders: count(product))
|> Dux.sort_by(desc: :total)
|> Dux.to_rows()

Performance

Dux pipelines compile to SQL and execute inside DuckDB — no data crosses into Elixir until you materialise. On a 10M-row dataset (Apple M4 Max, 128GB):

Operation	Dux	Explorer (Polars)	Winner
Filter (lazy)	24ms	59ms	Dux 2.5x faster
Filter (eager)	45ms	53ms	Dux 1.2x faster
Mutate (eager)	17ms	28ms	Dux 1.6x faster
Group + Summarise (lazy)	40ms	63ms	Dux 1.6x faster
Group + Summarise (eager)	81ms	88ms	Dux 1.1x faster
Memory per compute	11-15 KB	8-9 KB	Explorer ~1.5x less

Dux is faster than Explorer/Polars on every operation at 10M rows. The lazy path (view-based compute/1) is particularly fast since no data is copied — DuckDB executes the full pipeline when results are read. And Dux can distribute across machines while Polars is single-node.

Design

Dux is the successor to Explorer. That means it borrows its verb design from dplyr and the tidyverse — constrained, composable operations that each do one thing well. If you've used dplyr::filter(), mutate(), group_by() |> summarise(), the Dux API will feel familiar.

Where Dux diverges from Explorer:

The module IS the dataframe. Dux.filter(df, ...) not Dux.DataFrame.filter(df, ...). No Series API — all operations are dataframe-level.
DuckDB is the only engine. No pluggable backends, no abstraction tax. Full access to DuckDB's SQL functions, window functions, recursive CTEs, and 50+ extensions.
Lazy by default. Operations accumulate as an AST in %Dux{}. When you materialise (compute/1, to_rows/1), the whole pipeline compiles to a chain of SQL CTEs and DuckDB optimises end-to-end.
Distributed on the BEAM. %Dux{} is plain data — ship it to any BEAM node, compile to SQL there, execute against that node's local DuckDB. No function serialisation, no cluster manager, no heavyweight RPC.

Installation

def deps do
  [{:dux, "~> 0.3.0"}]
end

Dux is a pure Elixir project. The DuckDB engine is provided via ADBC — a precompiled driver downloaded automatically at compile time. No Rust or C++ compilation needed.

Getting Started

require Dux

# Built-in datasets — no files needed
Dux.Datasets.flights()
|> Dux.filter(distance > 1000)
|> Dux.group_by(:origin)
|> Dux.summarise(avg_delay: avg(arr_delay), n: count(flight))
|> Dux.sort_by(desc: :avg_delay)
|> Dux.head(5)
|> Dux.to_rows()

Every verb (filter, mutate, group_by, summarise, etc.) takes Elixir expressions via macros. Bare identifiers become column names. ^ interpolates Elixir values safely as parameter bindings:

min_amount = 500
Dux.filter(df, amount > ^min_amount and status == "active")

All DuckDB functions work inside expressions — year(), lower(), coalesce(), regexp_matches(), and hundreds more. cond maps to CASE WHEN, in maps to IN:

Dux.mutate(df,
  tier: cond do
    amount > 1000 -> "gold"
    amount > 100 -> "silver"
    true -> "bronze"
  end
)

Dux.filter(df, status in ["active", "pending"])

The _with variants accept raw DuckDB SQL for window functions and other constructs the macro doesn't cover:

Dux.mutate_with(df, rank: "ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)")

IO

Read and write CSV, Parquet, NDJSON, Excel, and database tables:

df = Dux.from_parquet("s3://bucket/data/**/*.parquet")
df = Dux.from_csv("data.csv", delimiter: "\t")
df = Dux.from_excel("sales.xlsx", sheet: "Q1")

Dux.to_parquet(df, "output/", partition_by: [:year, :month])
Dux.to_excel(df, "report.xlsx")
Dux.insert_into(df, "pg.public.events", create: true)

Cross-source queries via DuckDB's ATTACH — Postgres, MySQL, SQLite, Iceberg, Delta, DuckLake:

Dux.attach(:warehouse, "host=db.internal dbname=analytics", type: :postgres)
customers = Dux.from_attached(:warehouse, "public.customers")

Dux.from_parquet("s3://lake/orders/*.parquet")
|> Dux.join(customers, on: :customer_id)
|> Dux.group_by(:region)
|> Dux.summarise(revenue: sum(amount))
|> Dux.to_rows()

Distributed Execution

Mark a pipeline for distributed execution with distribute/2. The same verbs work — Dux handles partitioning, fan-out, and merge automatically:

workers = Dux.Remote.Worker.list()

Dux.from_parquet("s3://lake/events/**/*.parquet")
|> Dux.distribute(workers)
|> Dux.filter(year == 2024)
|> Dux.group_by(:region)
|> Dux.summarise(total: sum(revenue))
|> Dux.to_rows()

Under the hood: the Coordinator partitions files across workers (size-balanced, with Hive partition pruning), each worker compiles and executes SQL against its local DuckDB, and the Merger re-aggregates results. Workers read from and write to storage directly — no data funnels through the coordinator.

Distributed writes work the same way:

Dux.from_parquet("s3://input/**/*.parquet")
|> Dux.distribute(workers)
|> Dux.filter(status == "active")
|> Dux.to_parquet("s3://output/", partition_by: :year)

Attach Postgres and distribute reads with partition_by::

Dux.from_attached(:pg, "public.orders", partition_by: :id)
|> Dux.distribute(workers)
|> Dux.insert_into("pg.public.summary", create: true)

See the Distributed Execution guide for the full architecture — aggregate rewrites, broadcast vs shuffle joins, streaming merge, and fault tolerance.

Graph Analytics

A graph is two dataframes. All algorithms return %Dux{} — pipe into any verb:

graph = Dux.Graph.new(vertices: users, edges: follows)

graph |> Dux.Graph.pagerank() |> Dux.sort_by(desc: :rank) |> Dux.head(10)
graph |> Dux.Graph.shortest_paths(start_node)
graph |> Dux.Graph.connected_components()

Livebook

Add kino_dux for rich rendering and smart cells in Livebook:

Mix.install([
  {:dux, "~> 0.3.0"},
  {:kino_dux, "~> 0.2"}
])

Lazy pipelines render with source provenance, operations, and generated SQL. Computed results become interactive data tables.

Guides

Getting Started — core concepts, expressions, pipelines
Data IO — CSV, Parquet, Excel, NDJSON, database writes
Transformations — filter, mutate, window functions
Joins & Reshape — join types, ASOF joins, pivots
Distributed Execution — architecture, partitioning, distributed IO
FLAME Clusters — ad-hoc Spark-like clusters with Fly.io
Graph Analytics — PageRank, shortest paths, components
Cheatsheet — quick reference for all verbs

License

Dual-licensed under Apache 2.0 and MIT. See LICENSE-APACHE and LICENSE-MIT.