Note: Dux is under active development and not yet production ready. APIs may change between releases.
DuckDB-native dataframes for Elixir.
Dux gives you a dplyr-style verb API backed by DuckDB's analytical engine, with built-in distributed execution across the BEAM. Pipelines are lazy, operations compile to SQL, and DuckDB handles columnar execution, vectorised aggregation, and predicate pushdown.
require Dux Dux.from_parquet("s3://data/sales/**/*.parquet") |> Dux.filter(amount > 100 and region == "US") |> Dux.mutate(revenue: price * quantity) |> Dux.group_by(:product) |> Dux.summarise(total: sum(revenue), orders: count(product)) |> Dux.sort_by(desc: :total) |> Dux.to_rows()
Performance
Dux pipelines compile to SQL and execute inside DuckDB — no data crosses into Elixir until you materialise. On a 10M-row dataset (Apple M4 Max, 128GB):
| Operation | Dux | Explorer (Polars) | Winner |
|---|---|---|---|
| Filter (lazy) | 24ms | 59ms | Dux 2.5x faster |
| Filter (eager) | 45ms | 53ms | Dux 1.2x faster |
| Mutate (eager) | 17ms | 28ms | Dux 1.6x faster |
| Group + Summarise (lazy) | 40ms | 63ms | Dux 1.6x faster |
| Group + Summarise (eager) | 81ms | 88ms | Dux 1.1x faster |
| Memory per compute | 11-15 KB | 8-9 KB | Explorer ~1.5x less |
Dux is faster than Explorer/Polars on every operation at 10M rows. The lazy path (view-based compute/1) is particularly fast since no data is copied — DuckDB executes the full pipeline when results are read. And Dux can distribute across machines while Polars is single-node.
Design
Dux is the successor to Explorer. That means it borrows its verb design from dplyr and the tidyverse — constrained, composable operations that each do one thing well. If you've used dplyr::filter(), mutate(), group_by() |> summarise(), the Dux API will feel familiar.
Where Dux diverges from Explorer:
- The module IS the dataframe.
Dux.filter(df, ...)notDux.DataFrame.filter(df, ...). No Series API — all operations are dataframe-level. - DuckDB is the only engine. No pluggable backends, no abstraction tax. Full access to DuckDB's SQL functions, window functions, recursive CTEs, and 50+ extensions.
- Lazy by default. Operations accumulate as an AST in
%Dux{}. When you materialise (compute/1,to_rows/1), the whole pipeline compiles to a chain of SQL CTEs and DuckDB optimises end-to-end. - Distributed on the BEAM.
%Dux{}is plain data — ship it to any BEAM node, compile to SQL there, execute against that node's local DuckDB. No function serialisation, no cluster manager, no heavyweight RPC.
Installation
def deps do [{:dux, "~> 0.3.0"}] end
Dux is a pure Elixir project. The DuckDB engine is provided via ADBC — a precompiled driver downloaded automatically at compile time. No Rust or C++ compilation needed.
Getting Started
require Dux # Built-in datasets — no files needed Dux.Datasets.flights() |> Dux.filter(distance > 1000) |> Dux.group_by(:origin) |> Dux.summarise(avg_delay: avg(arr_delay), n: count(flight)) |> Dux.sort_by(desc: :avg_delay) |> Dux.head(5) |> Dux.to_rows()
Every verb (filter, mutate, group_by, summarise, etc.) takes Elixir expressions via macros. Bare identifiers become column names. ^ interpolates Elixir values safely as parameter bindings:
min_amount = 500 Dux.filter(df, amount > ^min_amount and status == "active")
All DuckDB functions work inside expressions — year(), lower(), coalesce(), regexp_matches(), and hundreds more. cond maps to CASE WHEN, in maps to IN:
Dux.mutate(df, tier: cond do amount > 1000 -> "gold" amount > 100 -> "silver" true -> "bronze" end ) Dux.filter(df, status in ["active", "pending"])
The _with variants accept raw DuckDB SQL for window functions and other constructs the macro doesn't cover:
Dux.mutate_with(df, rank: "ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)")
IO
Read and write CSV, Parquet, NDJSON, Excel, and database tables:
df = Dux.from_parquet("s3://bucket/data/**/*.parquet") df = Dux.from_csv("data.csv", delimiter: "\t") df = Dux.from_excel("sales.xlsx", sheet: "Q1") Dux.to_parquet(df, "output/", partition_by: [:year, :month]) Dux.to_excel(df, "report.xlsx") Dux.insert_into(df, "pg.public.events", create: true)
Cross-source queries via DuckDB's ATTACH — Postgres, MySQL, SQLite, Iceberg, Delta, DuckLake:
Dux.attach(:warehouse, "host=db.internal dbname=analytics", type: :postgres) customers = Dux.from_attached(:warehouse, "public.customers") Dux.from_parquet("s3://lake/orders/*.parquet") |> Dux.join(customers, on: :customer_id) |> Dux.group_by(:region) |> Dux.summarise(revenue: sum(amount)) |> Dux.to_rows()
Distributed Execution
Mark a pipeline for distributed execution with distribute/2. The same verbs work — Dux handles partitioning, fan-out, and merge automatically:
workers = Dux.Remote.Worker.list() Dux.from_parquet("s3://lake/events/**/*.parquet") |> Dux.distribute(workers) |> Dux.filter(year == 2024) |> Dux.group_by(:region) |> Dux.summarise(total: sum(revenue)) |> Dux.to_rows()
Under the hood: the Coordinator partitions files across workers (size-balanced, with Hive partition pruning), each worker compiles and executes SQL against its local DuckDB, and the Merger re-aggregates results. Workers read from and write to storage directly — no data funnels through the coordinator.
Distributed writes work the same way:
Dux.from_parquet("s3://input/**/*.parquet") |> Dux.distribute(workers) |> Dux.filter(status == "active") |> Dux.to_parquet("s3://output/", partition_by: :year)
Attach Postgres and distribute reads with partition_by::
Dux.from_attached(:pg, "public.orders", partition_by: :id) |> Dux.distribute(workers) |> Dux.insert_into("pg.public.summary", create: true)
See the Distributed Execution guide for the full architecture — aggregate rewrites, broadcast vs shuffle joins, streaming merge, and fault tolerance.
Graph Analytics
A graph is two dataframes. All algorithms return %Dux{} — pipe into any verb:
graph = Dux.Graph.new(vertices: users, edges: follows) graph |> Dux.Graph.pagerank() |> Dux.sort_by(desc: :rank) |> Dux.head(10) graph |> Dux.Graph.shortest_paths(start_node) graph |> Dux.Graph.connected_components()
Livebook
Add kino_dux for rich rendering and smart cells in Livebook:
Mix.install([ {:dux, "~> 0.3.0"}, {:kino_dux, "~> 0.2"} ])
Lazy pipelines render with source provenance, operations, and generated SQL. Computed results become interactive data tables.
Guides
- Getting Started — core concepts, expressions, pipelines
- Data IO — CSV, Parquet, Excel, NDJSON, database writes
- Transformations — filter, mutate, window functions
- Joins & Reshape — join types, ASOF joins, pivots
- Distributed Execution — architecture, partitioning, distributed IO
- FLAME Clusters — ad-hoc Spark-like clusters with Fly.io
- Graph Analytics — PageRank, shortest paths, components
- Cheatsheet — quick reference for all verbs
License
Dual-licensed under Apache 2.0 and MIT. See LICENSE-APACHE and LICENSE-MIT.