Settings

Theme

Quickwit 0.8: Indexing and Search at Petabyte Scale

quickwit.io

115 points by vvoyer 2 years ago · 30 comments

Reader

dracyr 2 years ago

Never had the chance to use Quickwit at a $DAYJOB (yet?), but I really appreciate the fact that it scales down quite well too. Currently running it on my homelab, after a number of small annoyances using Loki in a single-node cluster, and it's been working very well with very reasonable resource usage.

I also decide to use Tantivy (the rust library powering/written by Quickwit) for my own bookmarking search tool by embedding it in Elixir, and the API and docs have been quite pleasant to work with. Hats of to the team, looking forward to what's coming next!

godber 2 years ago

We did some experimentation with quickwit about a year ago, writing about 1m docs/second of data into it for several months. It worked well and was pretty straight forward to learn and operate. If we didn’t also manage our own S3/Ceph it might be a big win, once feature complete. It’s definitely worth a look.

  • lrx 2 years ago

    I think you can use quickwit with a self-hosted S3-compatible object store.

halvorbo 2 years ago

Amazing to see how far Tantiviy has come. Remember using and making some smaller contributions to this 3 years ago - slop to phrase queries for example. Curious how the design has changed to enable large scale production usage.

up2isomorphism 2 years ago

13.4GB/s with 200x6 vcpus, gives 11MB/s per core, it is good but hard to say impressive.

  • francoismassot 2 years ago

    Building the inverted index is quite CPU-intensive, and we are also merging index files called "splits".

    • kikimora 2 years ago

      I never being able to understand why log indexing has to build inverted index. Decent columnar store with partitioning by date should be enough to quickly filter gigabytes of logs.

      • nh2 2 years ago

        Because you want to find all occurrences of "error abc123" over the last year, immediately?

      • fulmicoton 2 years ago

        Quickwit co-founder here... I actually agree. For a few GBs, done right, columnar works fine AND is cost efficient.

        After all, it does not matter much if a log search query answers in 300ms or 1s. However, there are use cases where a few GB just does not cut it.

        The tale saying that you can always prune your dataset using timestamp and tags is simply not always valid.

        • kikimora 2 years ago

          Can you share your experience of when columnar fails?

          It is possible to scan NVMe at a speed of multiple GB/sec, scans can be parallel and happen on multiple disks, over compressed data (10 Gb of logs ~ 1Gb to scan), data can be segmented and prefaced with Blum filters, to quickly check if a segment is worth scanning.

          • nh2 2 years ago

            I'm not the person you asked, but say you have 10 TB of logs.

            Assuming 3 GB/s SSD, 10 SSDs, and a compression as you suggested of 10x, a query for finding a string in the text would take 10000 / 3 / 10 / 10 = 33 seconds.

            With an index, you can easily get it 100x faster, and that factor gets larger as your data grows.

            In general it's just that O(log(n)) wins over O(n) when n gets large.

            I didn't take your Bloom filter idea into consideration as it is not immediately obvious how a Bloom filter can support all filter operations that an index can. Also, the index gives you the exact position of the match, when the bloom filter only gives you existence, thus potentially still resulting in a large read amplication factor of a scan in the segment vs direct random access.

            • kikimora 2 years ago

              I’m thinking of how a data lake with parquet files can be structured. Each parquet file has header with summary statistics about the data, it has Blum filters too. A scanner would inspect files falling into the requested time range, for each file it would check headers, find ranges of data to scan. This is the theory, in which the scanner is not so much slower than index access while also allowing for efficient aggregations over log data.

  • fulmicoton 2 years ago

    What is your frame of reference?

    • dist1ll 2 years ago

      Per-core store bandwidth is at least 14GB/s on Zen3, 35GB/s for non-temporal stores. Parsing JSON can be done at +2GB/s.

      It's very healthy to take maximum bandwidth limits into consideration when reasoning about performance. For instance, for temporal stores, the bottlenecks you see are due to RAM latency and memory parallelism, because of the write-allocate. The load/store uarch can actually retire way more data from SIMD registers.

      So there's already some headroom for CPU-bound tasks. For instance 11MB/s is very slow for JIT baseline compiler. But if your particular problem demands arbitrary random access that exceed L3 regularly, maybe that speed is justified.

      • fulmicoton 2 years ago

        What we do is CPU bound and we are not just parsing JSON here.

        The largest work we do is building an inverted index. Oversimplified, it is equivalent to this:

          inverted_index = defaultdict(list)
          for (doc_id, doc_json) in enumerate(doc_jsons):
            c = json.loads(payload)
            for (field, field_text) in c.items():
              for (position, token) in enumerate():
                inverted_index[token].push((doc, position))
        
        serialize_in_compressed_way_that_allows_lookup(inverted_index)

        You can implement it in a couple of hours in the language of your choice to get a proper baseline.

        I am sure we can still improve our indexing throughput... but I have never seen any search engine indexing as fast as tantivy.

        If someone knows a project I should know of, I'd be genuinely keen on learning from it.

        • dist1ll 2 years ago

          I'm curious, what is your frame of reference with regards to maximum speed of building inverted indices? Like, what is the maximum throughput you'd expect for this type of task, and what is your reasoning for it?

arisudesu 2 years ago

musl support would be highly appreciated.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection