Show HN: QuestDB with Python, Pandas and SQL in a Jupyter notebook – no install
play.questdb.ioHi, I'm Adam Cimarosti, one of the core engineers at QuestDB.
We built play.questdb.io to make it easy for anyone to try our database. No installation.
There's a Jupyter Lab notebook, data, sample code, queries and graphs.
We'd love to hear what you think.
This is really cool -- congrats on the launch! Similarly, the team over at AuthZed has created a playground for SpiceDB[0], by using WebAssembly and Monaco.
We debated for hours whether or not to go the notebook route. I'm sure y'all did something similar; would you care to share your reasons for going with the notebook?
We actually do have a web based demo which is https://demo.questdb.io/ preloaded with millions of rows of data.
That one focuses on SQL queries though.
The notebook in https://play.questdb.io/ offers a more rounded experience to try out any and all of our features.
You can use the notebook to try out data ingestion, dropping partitions and more that is simply not possible in a more sandboxed environment.
The other part is that we value our Python users and wanted to provide an example of how to use our database in conjunction with other tools commonly used in the data science space to slice and dice time series data.
The demo does not work at all: https://github.com/questdb/questdb/issues/1525
I think this is a little bit harsh; you are pointing to one specific query that does not work out of a dataset of 1.6 billion rows exposed live to the internet and could sustain tens of thousands of users concurrently. In any case, we're grateful for the cto of clickhouse to point those things so we can improve the product further. But generalisations such as "the demo does not work at all" is not a fair comment, nor beneficial IMO.
Sorry, but it does not work for 1.5 years already. Every time I check it does not work.
Reading your pitch here, i'd love to have a vague idea what questdb is and why I should care.
Most databases store the latest state of something. We don't. We ingest events. After all, life is a function of time :-) The whole world ticks and we take those ticks and store them. If part of your application tracks anything happening over time (trades, ocean pollution levels, ships moving, rocket simulation metrics.. or whatever else then it makes sense to store those events in a time series database. What we provide, primarily, is two basic pieces of functionality: (1) Taking in lots of events FAST. Our ingestion rate is high (and we also integrate with things like Kafka, Pandas -- see the notebook, etc). Each of our time series tables (we support regular ones too) comes with a special timestamp column. (2) Specialized SQL to make sense of data that's changed over time, such as grouping and resampling by time and more. Take a look at our docs for things like SAMPLE BY, LATEST ON, ASOF JOIN, LT JOIN and more. On disk, we also guarantee that all records are sorted by time and this gives us great query performance for these time-based types of queries.
PS. We're also wire-compatible with PostgreSQL.
I was once in the market for time series databases, but all I could find required down sampling of older data. I don't know if this has changed, and to be fair I haven't been looking for quite some time, but does yours allow for keeping data with the captured precision in perpetuity (or until my hard drive fills up)? My guess from the way you describe your approach is yes, but I wanted to check.
Yes. We're pretty good with large volumes of data.
Eventually all local drives fill up though.
When ingesting data we partition data by time. By default we partition by day. This give you the flexibility to detach partitions, store them somewhere slower and cheaper with more capacity for longer term storage and reattach them later if need be.
Built on top of our open source primary product, we also have a cloud variant of QuestDB which runs on AWS. One of the things that we're building there is cold storage. It will automate this process onto S3 such that if a query ever needs to access this older data it will re-enstate it automatically for you with no admin overhead.
Thanks for the detailed reply, I was curious as well. How does this compare with InfluxDB? I was actually looking into a way to store my own financial data of US equities for backtesting and experimentation awhile back. I never did get any further than the planning phase but this seems like it would almost be ideal for that use case.
[One edit, adding one additional paragraph at the end]
Note that I'm one of the co-founder of QuestDB, but let me try to be as objective and un-biased as possible. Under the hood, InfluxDB and QuestDB are built differently. Both storage engines are column-oriented. InfluxDB's storage engine uses a Time-Structured Merge Tree (TSM), while QuestDB uses a linear data structure (arrays). A linear data structure makes it easier to leverage modern hardware with native support for CPU's SIMD instructions [1]. Running close to the hardware is one of the key differentiators of QuestDB from an architectural standpoint.
Both have a Write-Ahead Log (WAL) that makes the data durable in case of an unexpected failure. Both use the InfluxDB Line Protocol to ingest data efficiently. Hats off to InfluxDB's team, we found the ILP implementation very neat. However, QuestDB's implementation of ILP is over TCP rather than HTTP for performance reasons. QuestDB is Postgres Wire compatible, meaning that you could also ingest via Postgres, although for market data it would not be the recommended way.
One characteristic of QuestDB is that data is always ordered by time on disk, and out-of-order data is dealt with before touching the disk [2]. The data is partitioned by time. For queries spanning time intervals, the relevant time partitions & columns are lifted to memory, while others are left untouched. This makes such queries (downsampling, interval search etc) particularly fast and efficient.
From a developer experience standpoint, one material difference is the language: InfluxDB has got its own native language, Flux [3], while QuestDB uses SQL, with a bunch of native SQL extensions to manipulate time-series data efficiently: SAMPLE BY, LATEST ON, etc [4]. QuestDB also includes SQL Joins and time-series join (ASOF Join) popular for market data. Since QuestDB speaks the postgresql protocol, developers can use their standard Postgres libraries to query from any language.
From a performance perspective, InfluxDB is known to struggle with ingestion and queries alongside high-cardinality datasets [5]. QuestDB deals with such high cardinality datasets better and is particularly good at ingesting data from concurrent sources, with a max throughput can now reach nearly 5M rows/sec on a single machine. Benchmarks on TSBS [6] with the latest version will follow soon.
InfluxDB is a platform, meaning that they provide an exhaustive offering around the database, while QuestDB is less mature. QuestDB is not yet fully compatible with several tools (say a dashboard like metabase for example), as some popular ones have been prioritised instead (Grafana, Kafka, Telegraf, Pandas dataframes). The charting capabilities of InfluxDB's console are excellent, while QuestDB users would mostly rely on Grafana instead.
[Adding this via post edit #1] One area Influx currently has edge is storage overhead. QuestDB does not support compression yet. Time-series data can often be compressed well [7]. Chances are QuestDB will use more disk space to store the same amount of data.
Hope this helps!
[1] https://news.ycombinator.com/item?id=22803504 [2] https://questdb.io/blog/2021/05/10/questdb-release-6-0-tsbs-... [3] https://docs.influxdata.com/influxdb/cloud/query-data/get-st... [4] https://questdb.io/blog/2022/11/23/sql-extensions-time-serie... [5] https://docs.influxdata.com/influxdb/cloud/write-data/best-p... [6] https://github.com/timescale/tsbs [7] https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
So I guess it would be fair to say you compete with Timescale and Clickhouse as a timeseries database?
yes correct - although Clickhouse is more of an OLAP database. Timescale is built on top of Postgres, while QuestDB is built from scratch with Postgres wire compatibility. You can run benchmarks on https://github.com/timescale/tsbs
I take it you did not visit the link?