Settings

Theme

Scribe: Transporting petabytes per hour via a distributed, buffered queueing

engineering.fb.com

124 points by gluegadget 6 years ago · 43 comments

Reader

dekhn 6 years ago

braggy PR is misleading: the 25GB/s coming from CERN is after they filter the data down from 600TB/s because there are no commercial systems that can capture data at higher rates.

  • breck 6 years ago

    This is a good point!

    Just for fun, for more perspective on big data, a human body generates around 1-10M new cells per second, and a cell contains about 10-100GB of information. So a single human is generating 1-100PB/s of data just in the new cells! (Give or take a few OOM)

    • throwaway_bad 6 years ago

      Are you trying to quantify the "information" by the size of the DNA? I think this is a pretty meaningless number to multiply since most of the DNA will be exact copies and DNA alone doesn't capture all the information about a cell.

      OTOH the amount of "information" needed to perfectly simulate a cell is probably unbounded. Just a corollary of the fact that we currently don't know how to perfectly simulate reality. Even a single "real" number can take up infinite space.

      • glenvdb 6 years ago

        > OTOH the amount of "information" needed to perfectly simulate a cell is probably unbounded. Just a corollary of the fact that we currently don't know how to perfectly simulate reality.

        This is a very good point. The 'information' in a cell isn't the base pairs in its DNA, but all the atoms that make up the whole cell. And then each atom encapsulates properties such as position, velocity, charge, van der Waals radius etc.

        However this considers atoms with classical mechanics. In a quantum mechanical representation it would be very different again and you can start asking really hairy questions about whether information can be created or destroyed.

      • dekhn 6 years ago

        It's worth reading the prior literature: Markus Covert has gotten pretty good at predicting quantitative phenotypes using whole cell simulations (with very limited cell representations, basically just feature matrices).

        https://www.cell.com/abstract/S0092-8674(12)00776-3

      • breck 6 years ago

        Just back of the envelope estimates if you were to do things like scRNAseq, metabolomics, genomics, etc, on every cell. Infeasible but just as a thought experiment. Most DNA is the same, but not exact, and therein lies the rub (cancer). The point on unbounded though is a good one.

  • gnufx 6 years ago

    I'm not sure what that means, but the Cori filesystem is rated at 700GB/s and Summit's 2.5TB/s. See https://docs.nersc.gov/filesystems/cori-scratch/ and https://www.olcf.ornl.gov/olcf-resources/compute-systems/sum...

    • dekhn 6 years ago

      it's pretty simple. the physical data acquisition devices (ATLAS is an example) collect data at rates in the 100s of terabytes/sec https://home.cern/science/computing/processing-what-record)

      No storage system can store that data (and most of it is not useful) so they have a series of hardware triggers and buffers that reduce the data down to roughly what modern (general purpose) hardware is capable of handling. They tune the thresholds to match what consumer hardware is capable of.

      With regard to supercomputer filesystems: nobody wants to use GPFS. CERN's EOS sustained (theoretical) 3.3TB/sec in Apr 2015, so it's not like they're uncompetitive with the largest supercomputer...

      • gnufx 6 years ago

        I know how data collection works, but it sounded as if 25GB/s was regarded as high compared with filesystems you can buy.

        Obviously some people do want GPFS, if they can afford it, but Cori uses Lustre. I don't mean to claim that either is ideal for streaming high rate event data, of course.

        • adev_ 6 years ago

          > Obviously some people do want GPFS, if they can afford it, but Cori uses Lustre

          Data model at CERN does not match the one of a supercomputer. CERN data are not processed locally but distributed and spread to ~100 of participating institute in the experiment.

          Moreover, "personal opinion", GPFS is crap. It's an old relic from the 90s that has so many quirk and problem of design that it would deserves an entire conference on it. Plus the fact it's proprietary and expensive.

          The only reason that make GPFS still alive is that for a long time, the only alternative was Lustre, and Lustre is even worst.

        • dekhn 6 years ago

          lustre is crap.

          Every single supercomputer meeting I've been to (I've been part of the community for years, they often invite me to their meetings to give an industry perspective), people are just continuously complaining about the filesystems, and it's GPFS and Lustre at the top of the list.

  • packetslave 6 years ago

    it's adorable how you call it "braggy PR" when almost every major technology company these days (FB, Google, Amazon, Uber, Pinterest, etc., pretty much everybody except Apple) has an engineering blog where they share possibly-interesting work they've done.

londons_explore 6 years ago

Keep in mind the network cost of petabytes per hour cross continent.

Those of us who don't own our own cross-ocean fiber can't afford to design systems like this.

  • carapace 6 years ago

    Never underestimate the bandwidth of a cargo ship full of SSDs.

    (I'm paraphrasing an old, old joke.)

    • noir_lord 6 years ago

      Fairly high latency though I guess.

      • trhway 6 years ago

        using jet for trans ocean data delivery gives you several hours latency - acceptable for the logs - at the cost on the scale like $0.1-0.3/TB (really depends on the napkin used for the estimations)

        • derefr 6 years ago

          Big capital costs in setting up the number of parallel writers required on one end and readers required on the other, though. And presumably human "IT teamster" labor, hooking and unhooking drives.

  • gnulinux 6 years ago

    Sounds like a startup idea.

100k 6 years ago

Naming is hard! Facebook used to have a _different_ Scribe (https://en.wikipedia.org/wiki/Scribe_(log_server)).

We used it at a company I worked for, but it had long-since been deprecated, so I was confused when I saw this Scribe.

  • johndoe345 6 years ago

    This is the same scribe. Facebook closed-sourced it because it was too hard to maintain an open version and a version that addresses Facebook's needs.

    • naringas 6 years ago

      I wonder why couldn't they just make the open source version address their needs...

      • javiermaestro 6 years ago

        The version that was open-sourced kept evolving and integrating with other internal systems at Facebook. That's what made it hard to continue open-sourcing it (why the open-sourced version was discontinued) and why the current version is also hard to open-source.

        Maybe one day we'll have a version available. In any case, one of the larger parts of the system (LogDevice) is open source :)

        (disclaimer: I work in Scribe)

        LogDevice: https://engineering.fb.com/core-data/open-sourcing-logdevice...

  • saisundar 6 years ago

    >"The growing number of complex components made it difficult to retain an open source version stripped of our internal specifications. This complexity is the main reason that we archived the open source Scribe project. The current version of Scribe’s architecture is detailed below, with a focus on the components that comprise the data plane. ". Quoted from the article.

  • gnud 6 years ago

    Scribe is also an enterprise integration/ETL tool. For _many_ years before FB existed.

dividuum 6 years ago

Is it normal for these internal system to not implement any kind of access control? From the post it seems every reader can access every stream?

  • javiermaestro 6 years ago

    That's actually not the case, there's access control :)

    The article just focuses on certain areas of the system and doesn't go into the security and privacy parts, that's all.

    (I work in Scribe)

  • sjg007 6 years ago

    Aka not a product requirement... Users are trusted etc... Works great until it doesn't..

pvlak 6 years ago

why can't Producer/Scribed write directly to LogDevice storage. Any reasons for routing through WriteService.

  • thetrooperer 6 years ago

    That's a good question. There are multiple reasons for this. I'll briefly mention two of them. One is the high fan in ratio - millions of machines are writing relatively small blobs of data, so the middle layer serves as an aggregator (which saves backend's IOPs, number of connections, etc). Another reason is the volume of metadata - it would be inefficient to keep all the LogDevice-level metadata on each of the producer hosts.

    • pvlak 6 years ago

      Will the WriteService(Aggregator) make sense for environments having thousands of machines(not in millions) and they are all within the DataCenter. In our company, we are moving away from this design of having aggregators, to directly writing to Storage whereever possible, as it reduces the message loss.

      On the volume of metadata held by Producers, will there be any significant difference between holding WriteService & LogDevice meta.

      • thetrooperer 6 years ago

        The devil is in the details probably, but if you have a single datacenter and all writes are coming from thousands of machines ("edge"), yes, it may make more sense to set up a single LogDevice / Kafka cluster and have all the edge hosts write to it directly.

bradhe 6 years ago

So, if I'm reading this correctly, 2.5GB/s of log data being generated? If we assume (aggressively) that they have 5mil machines in their infrastructure, doesn't that mean that each machine would have to be generating 500kB/s of log data?

Despite that, I find the claims to be underwhelming. So your system can process massive amounts of data by scaling massively horizontally...neat.

  • javiermaestro 6 years ago

    The number in the article is 2.5 TB/s, not GB/s :)

    (disclaimer: I work in Scribe)

    • bradhe 6 years ago

      Right—sorry. But point still stands. Under what circumstances was that much data being generated from (what I’m assuming is) normal logging?

      • javiermaestro 6 years ago

        I'm not following. I understood from your first comment that you think the amount of data is low ("underwhelming") and from your last comment that it's a lot ("that much data").

        In any case, the data is "whatever needs to be logged".

        And it's not "server logs", which is what I'm interpreting from your comment. Scribe transports most data at Facebook to be processed by real-time systems (e.g. Puma, Scuba) and also "batch systems" (data warehouse). So, it's quite a lot, being "the ingestion pipe" for Facebook.

        Does this answer your question? :-?

        Puma: https://research.fb.com/publications/realtime-data-processin...

        Scuba: https://research.fb.com/publications/scuba-diving-into-data-...

        • bradhe 6 years ago

          > So, it's quite a lot, being "the ingestion pipe" for Facebook.

          I see. I walked away from the article with the impression that it was meant to be a log aggregation service a la flume, splunk, or logstash.

          > the amount of data is low ("underwhelming") and from your last comment that it's a lot ("that much data").

          I was remarking on the numbers in regard to generation, not consumption. Based on the article, my estimate is pointing out that generating 2.5TB/s of transactional logs and telemetry data using "millions" of machines would be technically possible but not reasonably practical...and thus likely not real ;). But, you corrected my understanding: That number isn't based on a different use case.

ninju 6 years ago

How does this compare to a robust Splunk infrastructure?

  • packetslave 6 years ago

    For what you'd pay for a Splunk license that can handle petabytes/hour of data, it would probably be cheaper to just buy Facebook and use Scribe. :-)

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection