A Dab of DuckDB - peterdohertys.website

11 min read Original article β†—

Published on: 4/23/2026

I’ve been hearing about DuckDB for a while now but I hadn’t yet had a good excuse to try it. I proposed it as a possible solution during a call with a prospective client recently, so I thought it was a good time to jump in with both feet and gain a better understanding of how it works and where it could be useful. πŸ˜…

Overview

As DuckDB.org states:

DuckDB is a SQL database that runs everywhere: on your laptop, on a server and even in the browser

This description sells the project quite short. Yes, it’s accurate but one of the major differentiators between it and similar solutions not mentioned is that DuckDB is capable of turning almost any data source into a SQL database. This is a wild concept and it feels magical. With two lines of code, you can turn a directory full of JSON files (also, CSV, TSV, etc., etc.) into a functional database:

-- sample JSON: {"jobs_processed": 95384}
CREATE TABLE job_summaries AS SELECT jobs_processed FROM read_json_auto('data/*.json', columns = {jobs_processed: 'INTEGER'}, ignore_errors = true);
SELECT sum(jobs_processed) FROM job_summaries where jobs_processed not null;

-- jobs processed => 1818170796
-- 22,229 files
-- ~6s to create DB and import data
-- ~0.03s to query / sum

Use Cases

As I understand it, DuckDB is well suited to data sets which:

- fit on a single, local SSD
- can be *much* larger than RAM (e.g. 64GB RAM / 1TB data set)
- are read-oriented
- are analytical (i.e. you're asking questions about and operating on the collection instead of looking up individual records)
    - logs
    - DB exports from legacy systems
    - financial data
    - public records
    - sensor data

Anything else?

Rapid Prototyping

While it’s not necessarily one of DuckDB’s advertised use cases, it’s extremely useful for rapid prototyping. For instance, if I have an historical set of log files generated by A Widget Machine and I want to prototype a system which makes this data discoverable by a web interface, I can do this in short order using code similar to the sample shared above. The added benefit is that because DuckDB’s flavor of SQL is mostly a superset of ANSI SQL, these initial queries can (probably) be rolled into a more traditional RDBMS if you determine that’s a better fit. Something like an ETL pipeline which moves data from the source, through some sync/transformation mechanism (I like Airbyte) through to Postgres as the final destination.

It’s also superior to β€œdocument storage” solutions because while the schema is inferred based on the contents of the source data, it’s still present and you get all of the DB-level type validation you would expect – if you want it. Common operations, like joins, are also considerably simpler to write in SQL. I’ll touch on this more below but you have the ability to make data imports as strict or as permissive as you need to using options like ignore_errors and union_by_name.

Creative Tech

I think DuckDB is also very well suited for use in creative tech, interactive installations and art projects. I can think of a number of such projects I’ve worked on over the years where DuckDB would have enabled exploration and presentation and precluded the need for traditional DB or ETL infrastructure. SQLite is often used in these instances but, as simple as it is, it still requires a modicum of planning in order to carve out a schema, devise a strategy for migrations if the schema changes and a data ingestion pipeline to get data from the source (e.g. CSV of sensor data). DuckDB papers over all of this and lets you start experimenting with and visualizing your data almost immediately.

Using It

I installed DuckDB using the Linux shell script (after carefully reviewing it!) from the Installation page and did not have any issues. There are options for all common platforms.

You can run sql directly, work in the CLI/REPL and there are client libraries for most popular languages.

CLI

duckdb -c "SELECT COUNT(*) FROM 'data/*.json'"

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ count_star() β”‚
β”‚    int64     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚        22229 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Rust Client

fn main() -> Result<()> {
    let conn = Connection::open_in_memory()?;
    conn.execute_batch(
        "
        CREATE TABLE job_summaries AS SELECT jobs_processed FROM read_json_auto('data/*.json', columns = {jobs_processed: 'INTEGER'}, ignore_errors = true),
        "
    )?;
    let total_jobs_processed: i32 = conn.query_row(
        "SELECT sum(jobs_processed) FROM job_summaries where jobs_processed not null",
        [],
        |row| row.get(0),
    )?;
    println!("total_jobs_processed: {}", total_jobs_processed); // total_jobs_processed: 246919842
    Ok(())
}

The source for this example can be found here.

Caveats

I ran into a few unexpected issues around installing, building and linking DuckDB and libduckdb-sys when first attempting to use the Rust client. A common approach is to use the β€œbundled” option when installing the library. From the docs:

Uses a bundled version of DuckDB’s source code and compiles it during build. This is the simplest way to get started and avoids needing DuckDB system libraries.

This worked fine in isolation but I have Vim configured to run :Cargo check on write and this resulted in DuckDB being rebuilt every time I saved. This could probably be worked around in my .vimrc but I instead used the DUCKDB_DOWNLOAD_LIB=1 option to have the build step download the binary once.

This approach also taught me something new about setting environment variables for Cargo, so I didn’t have to modify .bashrc or continually prefix commands. You can create a .cargo/config.toml file in your project and set useful config options, like environment variables for Cargo to use. In this case, I added the following and Cargo began doing the correct thing both from the CLI and Vim.

.cargo/config.toml
...
[env]
DUCKDB_DOWNLOAD_LIB = "1"
UPDATE

I have dug a bit deeper into what was going on here and the issue I was actually facing was more complex than I initially thought it was.

When trying to work around the issue with repeated builds mentioned above when using the β€œbundled” option, I installed the DuckDB shared library from the project’s GitHub release page.

This release contains the shared library and the C headers required by the Crate. Crucially, however, this release does not contain a .pc file which is what pkg-config uses to find these files. Without setting any of the relevant environment variables (DUCKDB_LIB_DIR or DUCKDB_INCLUDE_DIR) for Cargo to use when building my application, the linker was failing because it was trying to use pkg-config to find the correct paths and failing. Interestingly, the runtime linker was able to find the shared library and headers. So, when I was initially using the DUCKDB_DOWNLOAD_LIB flag, it was just papering over the build-time linker issue. At runtime, the system files were being used.

That all being said, I spent some time refining my .cargo/config.toml and I can now explicitly control whether the downloaded or system files are used:

# Option 1:
# Force download and explicitly reference its path
# This is *probably* not necessary if you haven't installed a standalone DuckDB
# release but it was in my case because the Linux amd64 release provided by
# the DuckDB project did not include a .pc file and pkg-config couldn't find
# duckdb.h and libduckdb.so, even though they were in the appropriate directory.
[env]
DUCKDB_DOWNLOAD_LIB = "1"
[build]
rustflags = [
    "-C", "link-arg=-Wl,-rpath,$ORIGIN/../duckdb-download/x86_64-unknown-linux-gnu/1.5.2",
]

# Option 2:
# Instruct the build linker where to look for libduck.so and duckdb.h
# This probably isn't required for most users and could be worked around by
# manually creating a .pc file which pkg-config would use do find these paths.
[env]
DUCKDB_LIB_DIR = "/usr/local/lib"
DUCKDB_INCLUDE_DIR = "/usr/local/include"

Strict vs. Permissive Imports

By default, DuckDB is strict about data types, field presence, etc. when importing data using read_json_auto and friends. You can work around these β€œlimitations” if you need to, though. I can absolutely see this being a real concern if you’re attempting to import decades worth of data whose internal schema may have drifted or that are subject to human error. The following options seem most relevant and I was able to test and verify that the have the intended results in my sample project:

union_by_name

If you are processing multiple files that have different schemas, perhaps because columns have been added or renamed, it might be desirable to unify the columns of different files by name instead. This can be done by providing the union_by_name option.

DuckDB can be clever and forgiving about how it infers the schema when importing data and you may or may not run afoul of issues resulting from schema drift. It comes down to the order in which files are sampled and the sample size DuckDB uses. You can force its hand and surface the class of errors which union_by_name addresses with the following:

{"title": "A", "pages": 12} // one.json
{"title": "B", "pages": 16, "author": "Smith"} // two.json
-- without union_by_name
SELECT * FROM read_json_auto('data/*.json', maximum_sample_files=1);
Invalid Input Error:
    JSON transform error in file "data/02.json", in line 1:
        Object {"title":"B","pages":16,"author":"Smith"} has unknown key "author"
Try increasing 'sample_size', reducing 'maximum_depth', specifying 'columns', 'format' or 'records' manually, setting 'ignore_errors' to true, or setting 'union_by_name' to true when reading multiple files with a different structure.


-- with union_by_name
memory D SELECT * FROM read_json_auto('data/*.json',maximum_sample_files=1,union_by_name=true);
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  title  β”‚ pages β”‚ author  β”‚
β”‚ varchar β”‚ int64 β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ A       β”‚    12 β”‚ NULL    β”‚
β”‚ B       β”‚    16 β”‚ Smith   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
-- NOTICE: author is set to null in rows whose underlying JSON is missing the author key

The takeaway here is that union_by_name results in a superset of all fields encountered with missing fields set to null by default.

ignore_errors

There are cases where CSV files may have multiple structural errors, and users simply wish to skip these and read the correct data. Reading erroneous CSV files is possible by utilizing the ignore_errors option. With this option set, rows containing data that would otherwise cause the CSV parser to generate an error will be ignored. In our example, we will demonstrate a CAST error, but note that any of the errors described in our Structural Error section would cause the faulty line to be skipped.

This will ignore errors which would ordinarily result from files like:

{"jobs_processed": "nine"}

I think it’s also worth noting that just specifying the column and type will result in null column values instead of errors in the case that you try to import a file with missing fields. This all greatly depends on the specifics of particular use cases but some combination of the above should allow users to at least get started working and experimenting with legacy data sets.

Performance

It’s worth calling out a few things around the performance metrics noted above and how you’d use DuckDB in a day-to-day workflow.

The cited figures result from a machine with 64GB of RAM and 12 CPU cores, which DuckDB will happily utilize. You can control resource usage using pragmas on a more granular level, if you need to.

For example, to dial in RAM and CPU:

PRAGMA memory_limit = '4GB';
PRAGMA threads = 4;

There are many other options covered in the docs.

Performance can also be greatly improved by converting JSON to Parquet – either before the initial ingest or for posterity. DuckDB also does not only exist in-memory and I think this approach is probably the exception in day-to-day use. If you know your data set is relatively static, you can use a file-backed database (as opposed to in-memory, like the above) which will result in orders of magnitude better performance.

Summary

I’m very excited by DuckDB and I’m not the only one. Before I was able to publish this post, I saw multiple posts about it appear on Hacker News. This post is very superficial and there are many features and use cases which I haven’t touched on. For example, a feature I’d like to look further into is (basic) full text search. I could see this being extremely powerful in use cases surrounding historical data, log entries, manuals, etc.

As implied above, I have not yet used DuckDB for Real Work and can’t speak to issues like performance at scale, read/write contention, working cross-platform or trying to reuse DuckDB SQL in Postgres. I am going to look for excuses to use it going forward and may write follow up posts as I learn more. Stay tuned!

P.S.

If you enjoyed this post and need help exploring unconventional data sets, standing up cloud infrastructure in a responsible manner or making computers talk to each other, let’s chat!

Also, check out the next post in my series on DuckDB: Full-Text Search with DuckDB

Updates

  • 4/24/2026 - Use β€œcreative tech” verbiage in use cases
  • 4/27/2026 - Add link to experiment repo; update Rust caveat section
  • 5/02/2026 - Added PS section