Querying Postgres Tables Directly from DuckDB
duckdb.orgSQLite and DuckDB are excellent demonstrations that the computers we have on our desks (and even in our pockets) are ridiculously fast and capable for almost everything but the largest workloads. They're a stark contrast to framework bloat that has stolen our CPU cycles and given us perceptible lag when typing and scrolling.
It makes one wonder what the software world would like if the whole notion of 'premature optimization is the root of all evil' never existed. Because that advice, well intentioned and perfectly reasonable when applied thoughtfully, gradually morphed into 'optimization is the root of all evil' contributing to this ongoing race between computers speeding up and software slowing down.
And that's a scenario which a cynic would observe was almost certainly anything but undesired by the top players on either side. It's much easier to require what would be considered a supercomputer not long ago, to run a word processor, than it is to create scenarios where such power is meaningfully applied in a mass market product.
A lot of developers dont even realize how fast a web page or text editor could be. When you starting point is a massive farmework, you dont even realize that a computer should be ablr to fetch, parse, render and display in milliseconds. For example: the modern reinvention of serving html from a server as "server side rendering".
I don't think the idea morphed into "any optimization" is evil. But it is the unfortunate consequence of leaving optimization until after functional requirements are met. Same with any kind of tech debt. A mentality of "Let's just get this out the door now any fix it later" results in later meaning never.
It has morphed into "premature pessimization is the gold standard."
The way they do this read is interesting, and especially why they don't just read the files on disk directly.
> The Postgres Scanner uses the standard libpq library, which it statically links in. Ironically, this makes the Postgres Scanner easier to install than the other Postgres clients. However, Postgres’ normal client-server protocol is quite slow, so we spent quite some time optimizing this. As a note, DuckDB’s SQLite Scanner does not face this issue, as SQLite is also an in-process database.
> We actually implemented a prototype direct reader for Postgres’ database files, but while performance was great, there is the issue that committed but not yet checkpointed data would not be stored in the heap files yet. In addition, if a checkpoint was currently running, our reader would frequently overtake the checkpointer, causing additional inconsistencies. We abandoned that approach since we want to be able to query an actively used Postgres database and believe that consistency is important. Another architectural option would have been to implement a DuckDB Foreign Data Wrapper (FDW) for Postgres similar to duckdb_fdw but while this could improve the protocol situation, deployment of a postgres extension is quite risky on production servers so we expect few people will be able to do so.
> Instead, we use the rarely-used binary transfer mode of the Postgres client-server protocol. This format is quite similar to the on-disk representation of Postgres data files and avoids some of the otherwise expensive to-string and from-string conversions. For example, to read a normal int32 from the protocol message, all we need to do is to swap byte order (ntohl).
This might end up being the best way to etl postgres tables to parquet. From everything else that I tried, doing a copy to CSV and then converting to parquet was the fastest but can be a pain when dealing with type conversions.
It is very easy with ClickHouse. All you need is:
SELECT ... FROM postgresql(...) FORMAT Parquet
And you can run this query without installing ClickHouse, using the clickhouse-local command-line tool.
It can be downloaded simply as:
curl https://clickhouse.com/ | sh
I didn't know DuckDB could read parquet natively[0]. Thanks for mentioning this. I recently had the displeasure of trying to deal with parquet files in nodejs and it was a disaster. I might try going through DuckDB instead.
This won't let you ETL from PG to parquet, but I used this in anger the other day https://github.com/manojkarthick/pqrs Worked quite well for my purposes!
I was trying https://github.com/sfu-db/connector-x and hacking around with this https://github.com/spitz-dan-l/postgres-binary-parser but it turned out that a COPY to csv using asyncpg and then converting to parquet was the fastest.
Ah. connector-x claims to be able to do a lot. Seems useful if it can actually deliver on that. Good to know that COPY with asyncpg is a quick approach.
In my case, I had parquet to begin with because I accidentally deleted some production data (oopsies) and when you export a snapshot from RDS to S3, it is in Parquet. Thankfully, I now have a few tricks up my sleeve to quickly restore data, but that was stressful for a bit haha
Exactly, I almost started writing a tool for that, because nothing else I found supports arrays and composite types. Now let's hope DuckDB does support that :)
If you don't use arrays and composites, Spark should be able to do it, right?
Yes, DuckDB supports writing lists and structs to Parquet! https://github.com/duckdb/duckdb/pull/2832
Does that help or do you have any other questions?
Awesome. However, the postgres_scanner has problems with jsonb, it won't even connect the database which contains those :/
Thank you for filing a bug, we will have a look at that I'm sure!
It's more of a feature request. Thank you :)
DuckDB is awesome! I find it the easiest way to ingest data from various sources then query it into a form I can do analytics on.
The datasets I work on are a bit too big for pandas, but spark is way overkill for them. DuckDB lets me efficiently work on them using only a single computer.
DuckDB doesn’t also need to load datasets into memory like pandas? I guess it depends what you’re doing to the data.
DuckDB is brilliant. Something I’m working I uses SQLite, there is an analytical search users can perform, it scans most of the DB with multiple joins. It would take about 45 seconds to run the query. I exported the dataset and imported it into DuckDB. The literal same SQL query now takes only 1.5 seconds.
It really shows the difference in how a column store is so much better for curtain queries.
D CALL postgres_attach('host=blabla.us-east-2.rds.amazonaws.com port=5432 dbname=my_db connect_timeout=10 password=mypass user=my_user');
Error: Invalid Error: IO Error: Unsupported Postgres type intervalCan I expect pretty significant boosts in query speed when using DuckDB on my existing PG database? I'm very intrigued now that it's possible to directly query PG. We don't yet have an analytics DB and instead run all analytics stuff on a read replica. Some of the queries are pretty slow. If I can just slap DuckDB on top and speed things up, I'd be really thrilled.
I wasn't aware of DuckDB but now that I read about I wonder should I replace my SQLite drivers with it, since it seems more versatile?
It’s pretty cool, but I wouldn’t switch over unless you know you need a column store approach.
I would say, maybe not switch over if you need something that has been rock-solid for years and years.
If you want a shit-ton of built-in features DuckDB is pretty damn fast even with non-column store optimized queries. Surprisingly so! I have done side-by-side tests and you really have to have a lot of rows for it to start to make a difference. And their built-in functions are awesome. And they (company/group) are awesome and very responsive!
Having said that DuckDB is still young enough to make me nervous putting in critical application stacks and I feel about Sqlite the same way I feel about PostgreSQL ... it is solid as hell and will just work.
I will say that I have not had any stability issues in any of my personal projects that I use DuckDB in, it is just that I know for a fact that Sqlite is solid.
one thing to be aware of is that duckdb doesn't currently allow concurrent readers and writers (unlike sqlite). depending on your use case, that might be a showstopper.
Nice job, this will be useful to many, I think. The duckdb sure seems to have a knack for figuring out what their audience needs.
This is exciting but it left me wondering if the approach will remain to scale with larger TPC-H scale factors? Scale factor of 1 is honestly very small.
Also, I didn't quite understand if DuckDB in order to achieve this must:
1. Read the Postgres row formatted data
2. Transform the row formatted data into its internal columnar representation
3. Keep the representation in memory
You can think of the attach operation as creating views in DuckDB with Postgres tables underneath! DuckDB will then query those Postgres rows (using the typical Postgres wire protocol, except in binary mode).
No data is persisted in DuckDB unless you do an insert statement with the result of the Postgres scan. DuckDB does process that data in a columnar fashion once it has been pulled into DuckDB memory though!
Does that help?
Yes, that's what I thought, thanks for an explanation.
What happens if the dataset size you want to post-process is let's say 1TB of size, or for that matter any size that is larger than the physical amount of memory available to DuckDB?
Wonder if there is anything like this for ClickHouse.
edit: to query pg from clickhouse not clickhouse from duckdb
Yes, ClickHouse can query Postgres for maybe two years already.
https://clickhouse.com/docs/en/integrations/postgresql/postg...
You can connect to Postgres using a table function:
SELECT ... FROM postgresql(...)
You can create a table with ENGINE = PostgreSQL and use it like a normal table.
You can create a database with ENGINE = PostgreSQL and it will represent all the tables from the PostgreSQL database.
And finally, you can replicate data from PostgreSQL in realtime with CDC into ClickHouse tables.
Similar features available for MySQL, MongoDB and SQLite.
What is surprising: if you query MySQL from ClickHouse, the queries work faster despite the fact that the data is simply read from MySQL.
And two more use-cases: - you can create a table in ClickHouse pointing to multiple MySQL servers for sharding and failover, so ClickHouse can be used to query sharded MySQL; - you can create a key-value dictionary on top of MySQL tables and use it in JOINs.
I was looking into exporting a database from DuckDB to Postgres for use with Google Data Studio, but whenever I was exporting DuckDB was attaching decimals to integers. This was throwing off the entire import into Postgres. Think I’ll just try this out.
Crazy idea, but...use Python client in Django app and for some of the larger/worst-performing queries, use DuckDB for those. Can even use ORM to construct the queries then get raw sql and execute with Duck. Crazy?
Does the inverse of this also exist?
Unofficially maybe? There's a 3rd party duckdb_fdw plugin for postgres that allows querying DuckDB from Postgres: https://github.com/alitrack/duckdb_fdw
You can use the DuckDB FDW (linked in the blog post) for this! https://github.com/alitrack/duckdb_fdw