Show HN: Zeekstd – Rust Implementation of the ZSTD Seekable Format

github.com

214 points by rorosen 16 days ago


Hello,

I would like to share a Rust implementation of the Zstandard seekable format I've been working on.

Regular zstd compressed files consist of a single frame, meaning you have to start decompression at the beginning. The seekable format splits compressed data into a series of independent frames, each compressed individually, so that decompression of a section in the middle of an archive only requires zstd to decompress at most a frame's worth of extra data, instead of the entire archive.

I started working with the seekable format because I wanted to resume downloads of big zstd compressed files that are decompressed and written to disk on the fly. At first I created and used bindings to the C functions that are available upstream[1], however, I stumbled over the first segfault rather quickly (it's now fixed) and found out that the functions only allow basic things. After looking closer at the upstream implementation, I noticed that is uses functions of the core API that are now deprecated and it doesn't allow access to low-level (de)compression contexts. To me it looks like a PoC/demo implementation that isn't maintained the same way as the zstd core API, probably that's also the reason it's in the contrib directory.

My use-case seemed to require a complete rewrite of the seekable format, so I decided to implement it from scratch in Rust using bindings to the advanced zstd compression API, available from zstd 1.4.0.

The result is a single dependency library crate[2], and a CLI crate[3] for the seekable format that feels similar to the regular zstd tool.

Any feedback is highly appreciated!

[1]: https://github.com/facebook/zstd/tree/dev/contrib/seekable_f... [2]: https://crates.io/crates/zeekstd [3]: https://github.com/rorosen/zeekstd/tree/main/cli

rwmj - 15 days ago

Seekable formats also allow random reads which lets you do trickery like booting qemu VMs from remotely hosted, compressed files (over HTTPS). We do this already for xz: https://libguestfs.org/nbdkit-xz-filter.1.html https://rwmj.wordpress.com/2018/11/23/nbdkit-xz-curl/

Has zstd actually standardized the seekable version? Last I checked (which was quite a while ago) it had not been declared a standard, so I was reluctant to write a filter for nbdkit, even though it's very much a requested feature.

simeonmiteff - 15 days ago

This is very cool. Nice work! At my day job, I have been using a Go library[1] to build tools that require seekable zstd, but felt a bit uncomfortable with the lack of broader support for the format.

Why zeek, BTW? Is it a play on "zstd" and "seek"? My employer is also the custodian of the zeek project (https://zeek.org), so I was confused for a second.

[1] https://github.com/SaveTheRbtz/zstd-seekable-format-go

stu2010 - 15 days ago

This is cool, I'd say that the most common tool in this space is bgzip[1]. Have you thought about training a dictionary on the first few chunks of each file and embedding the dictionary in a skippable frame at the start? Likely makes less difference if your chunk size is 2MB, but at smaller chunk sizes that could have significant benefit.

[1] https://www.htslib.org/doc/bgzip.html

mbreese - 15 days ago

I’m trying to learn more about the seekable zstd format. I don’t know very much about zstd, aside from reading the spec a few weeks ago. But I thought this was part of the spec? IIRC, zstd files don’t have to have just one frame. Is the norm to have just one large frame for a file and the multiple frame version just isn’t as common?

Gzip can also have multiple “frames” concatenated together and be seamlessly decrypted. Is this basically the same concept? As mentioned by others bgzip uses this feature of gzip to great effect and is the standard compression in bioinformatics because of it (and is sadly hard coded to limit other potentially useful Gzip extensions).

My interest is to see if using zstd instead of gzip as a basis of a format would be beneficial. I expect for there to be better compression, but I’m skeptical if it would be enough to make it worthwhile.

threeducks - 15 days ago

Assuming that frames come at a cost, how much larger are the seekable zstd files? Perhaps as a graph based on frame size and for different kinds of data (text, binaries, ...).

tyilo - 15 days ago

I already use zstd_seekable (https://docs.rs/zstd-seekable/) in a project. Could you compare the API's of this crate and yours?

ncruces - 15 days ago

How's tool support these days to create compress a file with seekable zstd?

Given existing libraries, it should be really simple to create an SQLite VFS for my Go driver that reads (not writes) compressed databases transparently, but tool support was kinda lacking.

Will the zstd CLI ever support it? https://github.com/facebook/zstd/issues/2121

- 15 days ago
[deleted]
wyager - 14 days ago

I have a project where I want two properties which are not inherently contradictory, but don't seem to be available together:

1. Huge compression window (like 100+MB, so "chunking" won't work)

2. Random seeking into compressed payload

Anyone know of any projects that can provide both of these at once?

throebrifnr - 15 days ago

Gz has --rsyncable option that does something similar.

Explanation here https://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/

conradev - 15 days ago

This is really cool! It strikes me as being useful for genomic data, which is always stored in compressed chunks. That was the first time I really understood the hard trade-off between seek time and compression.

mgraczyk - 15 days ago

Maybe a dumb question, but how do you know how many frames to seek past?

For example say you want to seek to 10MB into the uncompressed file. Do you need to store metadata separately to know how many frames to skip?

Dwedit - 14 days ago

CHD (compressed hunk of data) is another format that supports seeking, and allows LZMA compression. It's intended for disk images from CD systems, but can be used for other cases.

Imustaskforhelp - 15 days ago

Seekable format is so cool! Like I used to think things like having a zip file which can be paused and recontinued from the moment as one of my friend had this massive zip file (ahem) and he said it said 24 hours and I was like pretty sure there's a way...

And then kinda learned about criu and I think criu can technically do it but IDK, I in fact started to try to create the zip project in golang but failed it over... Pretty nice to know that zstd exists

Its not a zip file but technically its compressed and I guess you can technically still encode the data in such a way that its essentially zip in some sense...

This is why I come on hackernews.

b0a04gl - 15 days ago

how do you handle cases where the seek table itself gets truncated or corrupted? do you fallback to scanning for frame boundaries or just error out? wondering if there's room to embed a minimal redundant index at the tail too for safety

sitkack - 13 days ago

This needs to be standardized, even if defacto and added to ZIP64.

77pt77 - 15 days ago

BTW, something similar can be done with zlib/gzip.

DesiLurker - 14 days ago

great I can use it to pipe large logfiles and store for later retrival. is there something like zcat also?