Show HN: Time Series Benchmark TurboPFor,TurboFloat,TurboFloat LzX,TurboGorilla

github.com

3 points by powturbo 3 years ago · 3 comments

Reader

powturboOP 3 years ago

Gorilla [1] and Gorilla based algos [2] are simply overrated.

Store these values as 32 bits floats instead of 64 bits and you get instant 50% reduction without any compression.

This is valid for allmost all time series data.

Most of time series databases (ex. DuckDB) are storing floating point data as 64 bits.

They are reporting some extraordinary compression ratio by using a gorilla/chimp like algorithm.

However as shown in this benchmark, lot of time series data (ex. temparature, climate data, stocks,...)

don't have more than 1 or 2 fixed decimal digits and can be stored losslessly in 16/32 bits integers.

Integer compression [1] algorithms can then be used, which results in significant compression ratio and several times faster than the gorilla like algorithms.

TurboGorilla, the fastest Gorilla (or chimp) based algo in c, cannot exceed 1GB/s in decompression, wherea TurboPFor is in the order of 10Gb/s, TurboBitByte is >100Gb/s.

-[1] https://github.com/powturbo/TurboPFor-Integer-Compression

-[2] https://www.vldb.org/pvldb/vol8/p1816-teller.pdf

-[3] https://www.vldb.org/pvldb/vol15/p3058-liakos.pdf

zX41ZdbW 3 years ago

As of today, it's almost pointless using Gorilla or DoubleDelta codecs for time series if the performance constraints make ZSTD an affordable choice.
See https://clickhouse.com/blog/optimize-clickhouse-codecs-compr...
- powturboOP 3 years ago
  
  Thanks for the confirmaation. If your query engine can't exploit fine grained access, then general purpose compressors like lz4 or zstd with large compressed blocks are the better choice for numerical data, especially in combination with transpose (+delta).

Settings

Show HN: Time Series Benchmark TurboPFor,TurboFloat,TurboFloat LzX,TurboGorilla

Keyboard Shortcuts