Introducing pg_lake: Integrate Your Data Lakehouse with Postgres

6 min read Original article ↗

Today, we are thrilled to announce that Snowflake is open sourcing pg_lake, a powerful set of PostgreSQL extensions that allow Postgres to easily work with your data lakehouse.

With pg_lake, you can query, manage and write to Iceberg tables using standard SQL. Crunchy Data spent several years developing pg_lake as the foundation for its enterprise-ready warehouse offerings before joining Snowflake earlier this year. pg_lake underpinned Crunchy Bridge for Analytics, which introduced new analytical capabilities to Postgres and later became Crunchy Data Warehouse.1 Today we’re open sourcing that same functionality under the Apache license so the broader Postgres community can benefit. 

Now you can use Postgres to:

  • Manage Iceberg tables directly in Postgres: pg_lake introduces a new Iceberg table type where Postgres itself acts as the catalog. This means you can easily, cheaply and durably create Iceberg tables directly from Postgres with full Postgres transaction semantics. 
  • Query raw data files in your data lake: Most data lakes consist of CSV/JSON/Parquet files in S3, which are passed between different systems. With pg_lake you can easily query data files and directories that are already in S3, or insert them into Iceberg tables. You can also query external Iceberg tables, Delta tables and various geospatial file formats.
  • Flexible data import/export: You can load data directly from an S3 bucket or http(s) URL into Iceberg or regular Postgres tables, and you can write query results back to S3 to create advanced data pipelines.

Putting pg_lake into action

All of these capabilities come with the Postgres we all know and love. 

Query files in your data lake

Have a CSV in S3 or some Parquet data in Azure blob storage you’re looking to query? It’s as simple as:

create foreign table hits ()
server pg_lake
options (path 'az://your_container_name/hits.parquet');

SELECT count(*)
FROM hits;

pg_lake will automatically detect the column types of your files (including CSVs) and allow you to easily query the data. 

Load data with the familiar COPY command

If you want to load data directly from a bucket, pg_lake supports the Postgres COPY mechanism to easily load data without requiring additional tools. Just like directly querying your data (above), loading data from your data lake is as simple as: 

COPY hits 
FROM 's3://your_bucket_name/your_file_name.xx';

This works well for easily moving data in and out of Postgres. There is a great simplicity in not having to create ETL pipelines, and instead just loading data as it exists, or querying it where it lies. 

Create and manage native Iceberg tables

What about working with open formats such as Iceberg? With pg_lake you have a new ability to define a table as an Iceberg table. This manages a full catalog for your Iceberg tables directly in Postgres, while still giving you full Postgres semantics, including transactions. You can move data from Postgres, with its row-based tables, to your Iceberg, which uses columnar tables, in a fully transactional manner. Let’s take a look at creating an Iceberg table from Postgres:

create table logs (
   /* column names and types */
)
using iceberg;

-- Query it like any other table
SELECT date_trunc('day', event_time) AS day, count(*)
FROM logs
GROUP BY day
ORDER BY day;

Do you already have data sitting in Parquet that you want to use to bootstrap your Iceberg table? We can create an initial Iceberg table directly from that, then use the COPY utility to load additional data. Let’s see how that looks with data related to taxi trips in New York City:

create table trips_yellow ()
using iceberg
with (load_from = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet');

\d

┌───────────────────────┬─────────────────────────────┬───────────┬──────────┬─────────┬─────────────┐
│        Column         │            Type             │ Collation │ Nullable │ Default │ FDW options │
├───────────────────────┼─────────────────────────────┼───────────┼──────────┼─────────┼─────────────┤
│ vendorid              │ integer                     │           │          │         │             │
│ tpep_pickup_datetime  │ timestamp without time zone │           │          │         │             │
...
│ airport_fee           │ double precision            │           │          │         │             │
└───────────────────────┴─────────────────────────────┴───────────┴──────────┴─────────┴─────────────┘

copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-02.parquet';
copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-03.parquet';
copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-04.parquet';

-- or, load files programmatically (and transactionally) in a function or DO block
do $do$
declare month text;
begin
  for month in select to_char(m, 'YYYY-MM') from generate_series('2024-03-01', '2024-10-31', interval '1 month') m loop
    execute format($$copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_%s.parquet'$$, month);
  end loop;
end
$do$;

That’s it. pg_lake gives you a native Postgres experience for interacting with open data lakehouse formats like Iceberg and Parquet.

The journey to pg_lake

The story of pg_lake began with a simple observation. The team behind it, first at Crunchy Data and now at Snowflake, has a deep history of building foundational Postgres extensions like PostGIS, Citus, pg_cron, pg_partman and many more. This background gave the team a front-row seat to a recurring challenge customers faced: Their data was split between their Postgres database and object storage, and bridging that gap was a constant struggle.

The team knew it could address this with a native and intuitive Postgres experience.

Crunchy Data decided to invest in this capability, and when it first launched Crunchy Bridge for Analytics more than 18 months ago, other solutions were virtually nonexistent. The team didn’t know it then, but it was helping build a movement toward using Postgres as part of a data platform that combines transactional and analytical capabilities. As this space matured and Crunchy Data was acquired by Snowflake and launched Snowflake Postgres, the value of making Postgres more interoperable and unified with analytics became undeniable.

Why we are open sourcing pg_lake

We believe Postgres is a core component of the modern data stack — it's far more than just a relational database. With powerful features for JSON, geospatial (PostGIS), and vector search (pgvector), Postgres is already an operational data platform. Meanwhile, the data lakehouse has emerged as the standard for managing analytical data at scale. 

We’ve also seen that companies increasingly use a mix of operational and analytical capabilities to deliver modern applications and customer experiences. Snowflake offers an exceptional data platform to do this, and by expanding the capabilities of Postgres, we know we’ll see more people gain the power to develop modern, data- and AI-driven agents and applications.

By open sourcing this technology, we aim to:

  • Establish a standard: We want to help create a robust, open standard for a more unified Postgres that benefits the entire ecosystem.
  • Empower developers: Modern applications and AI demand a mix of operational and analytical capabilities. With pg_lake, the Postgres community will be able to unlock new use cases and accelerate innovation.
  • Commit to Postgres: Snowflake is deeply committed to the success of Postgres. This release is a testament to our commitment to pushing the boundaries of what's possible with the world’s most beloved open source database.

As pg_lake and Snowflake Postgres become part of the broader Snowflake ecosystem, we look forward to a future where the gap between operational and analytical data is finally closed.

Get started today

We are incredibly excited to share pg_lake with you. It’s ready for you to explore and use now.


1  These blogs are an archive from Crunchy Data, published prior to its acquisition by Snowflake.