Add Apache Iceberg Streaming Writes and Batch Reads (MVP) by zliang-min · Pull Request #928 · timeplus-io/proton

This PR introduces MVP-level support for streaming writes and batch reads with Apache Iceberg tables, fully implemented in C++ (no JNI). While existing C++ projects like ClickHouse and DuckDB focus on read-only Iceberg integration, this implementation adds native write capabilities, enabling end-to-end data pipelines directly from SQL.

Key Highlights

Streaming writes: Continuously write data to Iceberg tables via materialized views or direct INSERT statements.
Zero Java dependencies: Native C++ integration leveraging Apache Arrow for file I/O and AWS SDK for S3/Glue.
SQL-first workflows: Manage Iceberg catalogs, tables, and writes using familiar SQL syntax.

What’s Working (MVP) ✅

https://docs.timeplus.com/iceberg

Catalog & Setup

Support for Iceberg REST Catalog (verified with AWS Glue and S3 Table).
Create new Iceberg tables via SQL.
Support AWS SigV4 authentication for catalog(Glue)/storage(s3).

Write Operations

Append data via INSERT INTO or streaming materialized views.
AWS S3 storage with environment/IAM credentials.

Read Operations

Batch read entire Iceberg tables (v1/v2 formats).

Usage Example

-- Connect to a Iceberg database managed by AWS Glue, using AK/SK/IAM from the host
CREATE DATABASE demo
SETTINGS  type='iceberg', warehouse='(aws-12-id)',  
  catalog_type='rest', catalog_uri='https://glue.us-west-2.amazonaws.com/iceberg',
  storage_endpoint='https://bucket.s3.us-west-2.amazonaws.com',  
  rest_catalog_sigv4_enabled=true,
  rest_catalog_signing_region='us-west-2',
  rest_catalog_signing_name='glue';

-- Switch to the Iceberg database namespace
USE demo;

-- List existing Iceberg tables
SHOW STREAMS;

INSERT INTO demo.existing_table VALUES(..)

-- Or create a new Iceberg table and use MV to write data
CREATE STREAM transformed(
  timestamp datetime64,
  org_id string,
  float_value float,
  array_length int,
  max_num int,
  min_num int
);
  
-- Stream data to Iceberg  
CREATE MATERIALIZED VIEW mydb.mv_write_iceberg INTO demo.transformed AS
SELECT now() AS timestamp, org_id, float_value,
       length(`array_of_records.a_num`) AS array_length,
       array_max(`array_of_records.a_num`) AS max_num,
       array_min(`array_of_records.a_num`) AS min_num
FROM mydb.msk_stream
SETTINGS s3_min_upload_file_size=1024;

What’s Next (Help Wanted!) 🔧

Write Improvements

DELETE and UPSERT operations.
Partitioning support (bucket, truncate).
INSERT OVERWRITE operations.
Merge-on-read for updates/deletes.

Read Improvements

Streaming incremental reads (snapshot tracking).
Time travel queries.

Catalog & Security

Support S3 Table (Done in preview 3)
Support Apache Gravitino catalog (Done in preview 3)
Support Apache Polaris catalog
Database/Hive catalog

Maintenance

Snapshot management, version/branch/tag management
Schema evolution enhancements

Try it now:

We are still working on the test cases and fixing CI issues. Before we create a new Timeplus Proton release with this PR merged in, you can install Timeplus Enterprise 2.8 on Linux or macOS. Please follow the guide at https://docs.timeplus.com/enterprise-v2.8#2_8_0

You can use the web console at http://localhost:8000/ to run SQL.

Use the SQL examples above to connect to the Iceberg databases and read/write data.

You can also use this docker image on Linux/macOS/Windows:
docker.timeplus.com/timeplus/timeplusd:2.8.14. For example, start a container with the AWS AK/SK from the env var:

docker run --name timeplus_iceberg -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -d -p 7587:7587 -p 8463:8463 docker.timeplus.com/timeplus/timeplusd:2.8.14

Demo video: how to use Timeplus to read data from Amazon MSK(Managed Service for Kafka), apply stream processing, then write to S3 in the Iceberg table format, then query with Athena: https://www.youtube.com/watch?v=2m6ehwmzOnc

Contribute:

Review the code (focus on IcebergSink.cpp/IcebergSource.cpp)
Test with your Iceberg setup and share feedback
Help tackle the "What’s next" list!

Tech notes:

Built on Apache Arrow C++ for Parquet/ORC file handling.
Minimal runtime dependencies (no Hadoop/JVM).
AWS SDK integration for Glue/S3 auth.

Note: Starting from preview3, the syntax for catalog configuration is changed from ENGINE to SETTINGS.