Throwing lots of data at DuckDB and Athena

3 points by captaintobs 3 years ago · 2 comments

Reader

Nice article, thanks!

It would be great if DuckDB handled this itself, but it seems like to be competitive with Athena on really massive datasets, you need to have a metadata layer that is used to figure out which parquet files in S3 DuckDB actually needs to query and then potentially run those in parallel. This seems to be the architecture of Puffin (which I haven't personally tried using yet).

[1] https://www.boilingdata.com/ [2] https://boilingdata.medium.com/lightning-fast-aggregations-b... [3] https://github.com/sutoiku/puffin

One possible thing to look into would be whether this dataset is partitioned too much. My understanding is that the recommended file size for individual parquet files is 512MB to 1GB, whereas here they are 50MB. It would be interesting to see the impact of the partitioning strategy on these benchmarks.

[4] https://parquet.apache.org/docs/file-format/configurations/ [5] https://www.dremio.com/blog/tuning-parquet/

zX41ZdbW 3 years ago

I've added a benchmark of ClickHouse and Athena to ClickBench:

https://pastila.nl/?0198061e/f2e0e7b2d61d0fe322607b58fc7200b...

Where ClickHouse operates in a "data lake" mode - simply by processing a bunch of parquet files on S3. Obviously it is faster than Athena. But I also want to add Presto, Trino, Spark, Databricks, Redshift Spectrum, and Boilingdata, that are currently missing from the benchmark.

Please help me adding them: https://github.com/ClickHouse/ClickBench

Also, it includes another mode of ClickHouse, named "web" - MergeTree tables hosted on a HTTP server (which is more efficient than parquet). See https://github.com/ClickHouse/web-tables-demo

About R2 - it is currently slow, and also incompatible with S3 (e.g., no multipart uploads).

Settings

Throwing lots of data at DuckDB and Athena

Keyboard Shortcuts