Settings

Theme

DuckDB Isn't Just Fast

csvbase.com

111 points by calpaterson 2 years ago · 39 comments

Reader

LunaSea 2 years ago

I know I'm repeating my self (it must be my third comment on HN about this topic), but this does not match my experience at all.

DuckDB will error-out with an out-of-memory exception in very simple DISTINCT ON / GROUP BY queries.

Even with a temporary file, an on-disk database and not keeping the initial order.

On any version of DuckDB.

  • belter 2 years ago

    Curious about your post, and DuckDB, since have seen many previous post here on HN about it...I just took 10 min to do some quick tests, and experiment with DuckDB for the first time. :-)

    While I am a Linux user, tried this on a available Windows 10 machine (I know...Yuck!)

    1) Setup and install no prob. Tracked the duckdb process memory usage with PowerShell, like this:

      Get-Process -Name duckdb | Select-Object Name, @{Name="Memory (MB)";Expression={[math]::round($_.WorkingSet64/1MB,2)}}
    
    2) Used this simulated netflix table dataset available from an S3 bucket, as used in this example blog. Installed the aws extension not the one mentioned in the blog: https://motherduck.com/blog/duckdb-tutorial-for-beginners/

    Section in the blog: "FIRST ANALYTICS PROJECT"

    Table has 7100 rows as seen like this:

       SELECT COUNT(*) AS RowCount FROM netflix;
    
    3) Did 8 different queries using DISTINCT ON / GROUP BY

       The one below, being one example:
    
        SELECT DISTINCT ON (Title) 
          Title, 
          "As of", 
          Rank 
        FROM netflix 
        ORDER BY Title, "As of" DESC;
    
    I am not seeing any out of memory or memory leak from these quick tests.

    I also tested, with the parquet file from the thread mentioned below by mgt19937. It is 89475 rows. Did some complex DISTINCT ON / GROUP BY on it, without seeing neither explosive memory use or something similar to a memory leak.

    Do you have a more specific example?

    • sethammons 2 years ago

      > 7100 rows

      For analytic queries, I don't feel that is anywhere near what I've seen at multiple companies who gave opted for columnar storage. That would be at most a few seconds of incoming data.

      With so few rows, I would not be surprised if you could use standard command line tools to get the same results from a text file of similar size in an acceptable time.

      • belter 2 years ago

        Certainly not. We typically would talk about hundreds of millions to billions of rows.

        Just trying to reproduce the use case...Also tried with the Parquet file from the support ticket. Around 90,000 rows...

    • LunaSea 2 years ago

      Quick example:

      ``` COPY ( SELECT DISTINCT ON (b, c) * FROM READ_PARQUET('input.parquet') ORDER BY a DESC ) TO 'output.parquet' ( FORMAT PARQUET, COMPRESSION 'ZSTD' ) ; ```

      Where the input file has 25M rows (500Mb in Parquet format) containing 4 columns, a and b are BIGINTs and c and d are VARCHARs.

      On a Mac Book Pro M1 with 8GB of RAM (16x the original file size), the query will not finish.

      This is a query that could very easily be optimised to take little amounts of space (hash the DISTINCT ON key, and replace in-place the already seen values if the value of "a" is larger than the one that already exists.)

      • belter 2 years ago

        With the information in your example... I created a parquet file with 50 million rows, random data, same data types. Parquet file is 391 MB on disk (NTFS)

        Query will complete, but in aprox 3,5 to 4 min, you will need up to 14 GB of memory. (4 Core, Win10, 32GB RAM).

        You can see below, memory usage in MB, throughout the query, sampled at 15 sec interval.

          duckdb      321.01 -> Start Query
          duckdb     6302.12
          duckdb    13918.04
          duckdb    10963.74
          duckdb     8586.76
          duckdb     7613.86
          duckdb     6749.53
          duckdb     5990.96
          duckdb     5293.35
          duckdb     4205.53
          duckdb     3153.59
          duckdb     1482.86
          duckdb      386.29 -> End Query
        
        So yes, there are some opportunities for optimization here :-)
        • LunaSea 2 years ago

          Thanks for the benchmarks! :)

          Indeed, 14GB seems really high for a 400MB Parquet file, that's a 35x multiple on the base file size.

          Of course, the data is compressed on disk, but even the uncompressed data isn't that large so I believe indeed that quite a lot of optimisations are still possible.

          • wenc 2 years ago

            It’s also the aggregation operation. If there are many unique groups it can take a lot of memory.

            Newer DuckDbs are able to handle out of core operations better. But in general just because data fits in memory doesn’t mean the operation will — and as I said 8GB is very limited memory so it will entail spilling to disk.

            https://duckdb.org/2024/03/29/external-aggregation.html

      • wenc 2 years ago

        What is the error message you’re getting? Or is it simply that the query will not finish? (Does it continue to run?)

        Parquet files are compressed, and many analytic operations require more memory than the on disk size. When you don’t have enough ram DuckDb has to switch to out of core mode which is slower. (It’s the classic performance trade off)

        8gb of ram is not enough usually to expect performance from analytic operations - I usually have minimum 16. My current instance is remote which has 256gb ram. I never run out of ram and DuckDb never fails and runs super fast.

  • sirfz 2 years ago

    In general DuckDB is great but I had similar out-of-memory issues specifically when using "distinct on" and "unnest". In such cases I usually delegate some queries to chdb / clickhouse-local instead.

    DuckDB has been evolving nicely, I especially like their SQL dialect (things like list(column1 order by column2), columns regex / replace / etc) and trust they'll eventually resolve the memory issues (hopefully).

  • sztanko 2 years ago

    Thanks for repeating yourself; this comment potentially influences my decision-making about DuckDB. Good to know about the negatives too.

    At what data volumes does it start erroring out? Are these volumes larger than RAM? Is there a minimal example to reproduce it? Is this ticket related to your issue? https://github.com/duckdb/duckdb/issues/12480

  • JacobiX 2 years ago

    In my experience, recent versions work well for out-of-core computation (I use it frequently). What is the size of the Parquet file that you are using? - I can't find the details in you previous comments.

  • mgt19937 2 years ago

    Sounds interesting. Is there a open issue for this? I found https://github.com/duckdb/duckdb/issues/8505 but it seems that that specific issue is closed.

  • CodeCompost 2 years ago

    Sounds like the very early versions of EF Core which did not translate C# GroupBy into SQL GROUP BY, but instead loaded the /entire/ data set into memory and did the aggregate there.

  • dmw_ng 2 years ago

    I've had similar times with DuckDB, it feels nicer to use on the surface but in terms of perf and actual function I've had a better experience with clickhouse-local.

    • wenc 2 years ago

      Are you using it for simple SQL retrieval or complex analytic queries? They’re both similar for the former use case, but DuckDB — being an analytic engine — supports the latter use case much better.

felipemesquita 2 years ago

DuckDB has great ergonomics for moving data between different databases and making copies for local analysis. The one thing that differed in my experience with it from the author’s is how much of the Postgres sql dialect (and extensions) it supports. Attempting to run my Postgres analytics sql code in duckdb errors out on most json operations - to be fair, the DuckDB json functions have cleaner names than jsonb_path_query - also, DuckDB has no support for handling xml, so all xpath calls fail as well.

  • RobinL 2 years ago

    You may know this already but the postgres extension[1] may help:

    If I understand it correctly, when you use it it:

    - Pulls the minimal data required (inferred from the query) from postgres into duckdb

    - Executes your query using duckdb execution engine

    BUT, if your postgres function is not supported by DuckDB I think you can use the `postgres_execute` [2] to execute the function within postgres itself

    I'm not sure whether you can e.g do a CTE pipeline that starts with postgres_execute, and then executes Duckdb sql in later stages of the pipeline

    [1] https://duckdb.org/docs/extensions/postgres.html#running-sql... [2]https://duckdb.org/docs/extensions/postgres.html#the-postgre...

    • felipemesquita 2 years ago

      Thanks for the suggestion! As I understand, you can only postgres_execute against a running Postgres db. It does work and I’ve used it in my tests, I think I could get around the limitations that I ran into by running a pg instance alongside DuckDB. For now I think I’ll stick with just pg, as I was looking into DuckDB to replace pg in my local analytic workloads: load data from rest apis, dump into a database and use sql in a custom dbt-like pipeline to build the tables for analysis in bi tools. Unfortunately, many endpoints return xml data and much of the sql I’ve already written deals with json, meaning it would have to be adapted to work with DuckDB.

dgan 2 years ago

Really, is this what's getting praised? I mean specifically the first point: the whole "just paste the url into the DB" - thing, + inferring the column names. That looks like the laziest and shakiest basis, and if I ever saw that in production i d be both stunned and scared

  • calpatersonOP 2 years ago

    It is a really useful feature for ad hoc examination of datasets. Not every data-analysis is going into a massive, mission critical production build chain. Quite often you just want to look at the data in a REPL and IMO that sort of usage could bear to get a little easier

  • la_fayette 2 years ago

    Inference for column names and datatypes seems only relevant for csv? Other formats convey that metadata, as e.g. parquet...

    csv metadata inference in duckdb is amazing. There is some research in this domain and duckdb does a great job there, but yes there might be some really strange csv files, which require manual intervention.

  • sagarm 2 years ago

    It would be perfectly reasonable IMO to put your metadata elsewhere and still use DuckDB for processing.

  • buremba 2 years ago

    What's wrong with the URL in the query?

uptime 2 years ago

observablehq.com has built in support for duckdb, and I have found it to be very easy to use. Getting windowing and cte and derived columns is great and being able to just refer to sql query cells as an array of rows makes things much easier for me than breaking out into js right away.

Someone wrote an export function, so I can make a select into a table and grab that as csv to use elsewhere.

I wish for Simon Willison to adopt duckdb as he has with sqlite to see what he would create!

koromak 2 years ago

I very, very nearly migrated to a full Duckdb solution for customer-facing historical stock data. It would have been magical, and ridiculously, absurdly, ungodly fast. But the cloud costs ended up being close to a managed analytics solution, with significantly more moving parts (on our end). But I think thats just our use case, going forward I'd look at duckdb as an option for any large-scale datasets.

Using ECS/EKS containers reading from a segmented dataset in EFS is a really solid solution, you can get sub second performance over 6 billion rows / 10000 columns with proper management and reasonably restrictive queries.

Another option is to just deploy a couple huge EC2 instances that can fully fit the dataset. Costs here were about the same, but with a little more pain in server management. But the speed man, its just unbelievable.

  • vgt 2 years ago

    Co-founder and head of produck at MotherDuck here - would love to chat. We're running DuckDB in a serverless fashion, so you're only paying for what you consume.

    Feel free to reach out to tino@motherduck.com.

  • koromak 2 years ago

    To note, we migrated from Redshfit, which had 7-30 second performance. Our current managed solution is something like 1 - 5. Duckdb just smashes everything else, at least on our data.

  • mritchie712 2 years ago

    Did you check the cost to run it on Motherduck (https://motherduck.com/)?

  • panyang 2 years ago

    What do you mean by segmented dataset in EFS?

aargh_aargh 2 years ago

Great, I didn't know about fsspec!

Woshiwuja 2 years ago

uhm why would you ever use this instead of sqlite

  • maleldil 2 years ago

    DuckDB is optimised for analytical tasks, whereas SQLite isn't.

    • ayhanfuat 2 years ago

      For those not aware: analytical tasks involve a lot of groupings, aggregations, running sums/averages, sorting, ranking etc. Columnar databases like duckdb are more focused on those tasks so you can do these tasks much faster.

      • Woshiwuja 2 years ago

        Ty! Never got into the analytics side of things

        • belter 2 years ago

          Have a look at this section of the AWS Redshift documentation (also a columnar database) to understand the advantages of these types of systems: https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_st...

          Or the advantage of columnar file formats, like ORC or Parquet, for analytical queries. Normally you are only interested in a few columns.

          • PartiallyTyped 2 years ago

            Columnar formats allow for some pretty interesting optimizations with respect to filtering by rewriting and pushing down constraints / checks; and evaluating multiple of them simultaneously, pulling as little as necessary.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection