Commanding infinite streaming storage with Apache Kafka and Pyrostore

68 points by lbradstreet 8 years ago · 28 comments

Reader

I like it. Personally, one of my biggest problems with Kafka is its operational complexity. I’ve just had one too many instances of Kafka brokers getting stuck while doing an upgrade and things like that.

Additionally, I would really, really like to be able to use it as an Event Store, easily accessible by anyone in the org with infinite data retention. I know Kafka kind-of sort-of provides this functionality, but it doesn’t work in practice.

This appears to be a solution to this problem. Will be interesting to see whether it gains traction.

linkmotif 8 years ago

> I know Kafka kind-of sort-of provides this functionality, but it doesn’t work in practice.
How so?
- ryanworl 8 years ago
  
  One potential problem is a Kafka partition’s size is limited to the size of the smallest machine in the replica set. This means if you want infinite retention you have to potentially over-partition so they never get too big, keep buying bigger machines and disks, or deal with a repartition of all data.
  An simple way to get around this problem is dumping messages into a file and putting that file in S3 named something like “topic-partition-offset” where offset is the offset of the first message contained within that file. You can then read those forward starting from offset zero and go until you reach the end, then start reading from Kafka for recent data.
  The drawback is this isn’t integrated with Kafka so you’re now maintaining what is effectively two different systems for the same data. It also means the key-based compaction won’t work either and you’d have to re-implement that on top of the files in S3 as well.
  - beepbeepbeep1 8 years ago
    
    I've used LVM in AWS to present a single volume to Kafka >16TB which is their max EBS size. They report you can attach up to 40 EBS volumes.
    Growing LVM with XFSs has worked well, 0 downtime and around 60 seconds.
    Allows you to over provision just enough you do not have to babysit the drives or pay $$$ for unused disc.
    If you stripe the volumes you'll also distribute your IOPs in AWS.
    Outside AWS LVM still applies. Kafka's JBOD is useless without easy / auto rebalancing.
    This week onsite at a client's I discovered ScaleIO which can present up to a 1PB volume and does clever sharding/replication in the background.
  - linkmotif 8 years ago
    
    > or deal with a repartition of all data.
    Why is this difficult? Is mirroring clusters operationally problematic? If one cluster gets too small, in theory can’t you spin up another cluster, and mirror the first onto the second. Then when they are in sync direct writes to the new cluster?
    
    ryanworl 8 years ago
    
    That sounds possible, but it would both involve downtime or potentially ordering and data duplication issues if you mess it up. Dynamic expansion and contraction of partition count should be a feature that doesn’t require recreating the entire cluster like essentially every other data product in the world.
    
    gazarsgo 8 years ago
    
    There's no downtime in mirroring a cluster with MirrorMaker, and it's become common to isolate producers from consumers to separate clusters, see https://medium.com/netflix-techblog/kafka-inside-keystone-pi...
    Ordering of messages is completely unaffected, it's the routing of future events that's affected when you increase partition count. This is critical for some use cases (windowing of data for analytics purposes, for example) but irrelevant to others.
    Data duplication issues? This sounds like FUD also, but common guidance is to design your events to be idempotent, or utilize Kafka's new exactly-once delivery.
    You can expand partition counts for a topic dynamically.
    You can't currently decrease partition counts, because given the current design that could orphan both data and consumers.
    
    jbs40 8 years ago
    
    The architectural pendulum is starting to swing away from co-location of storage and compute (the trend of the last 10+ years) to decoupling of storage and processing to avoid exactly these issues, but legacy architectures hang on for a while.
    In the streaming and messaging space, Apache Pulsar (pulsar.apache.org) is a more recent solution that has an architecture that decouples processing and storage. That gives you nice properties like independent scaling of storage and processing, infinite data retention, dynamic resizing and others.
    
    gazarsgo 8 years ago
    
    What pendulum do you see? From here, architectural patterns are clearly converging on a "distributed mainframe" model between containerization and lambda/kappa architectures...
    I think Joyent's Manta was ahead of its time in colocating compute and storage and I suspect we'll see more along this vein with the recent open sourcing of FoundationDB.
    
    jbs40 8 years ago
    
    I was thinking more specifically of the internal architectures of data processing platforms, especially the categorizations that emerged from the MPP database world. The "shared nothing" architecture has been dominant in databases (and is also the core architecture of Hadoop), designed around "co-locating data and compute". Kafka largely follows that architecture as well, using local disk on the compute nodes as its persistent storage layer.
    A lot of new data processing platforms, from Snowflake in the data warehouse world to AWS Athena to Apache Pulsar in the broader data processing world, have moved to decoupled architectures.
    Containerization and container management frameworks (e.g. Kubernetes) certainly do change the meaning of "local" storage, will be interesting to see how that plays out.
    
    beepbeepbeep1 8 years ago
    
    You can but even in TB's of data never mind PB's it can take weeks to sync across.
    Also running two clusters to handle large volumes of data, it's big money. Even a small modest cluster with around 20TB+ of data was north of 30K a month on AWS. That's a full app cluster though with consumers/producers aswell as brokers.
- stingraycharles 8 years ago
  
  It’s difficult to search through, query, run projections. Also the API assumes you want to stream realtime data, rather than query historical data.
  - sidlls 8 years ago
    
    Use the correct tool for the job: hook an analytics DB up to the Kafka pipe and store the data for future queries. Kafka was never intended to support your use case.
    If your inbound data that you'd like to put to Kafka isn't large, by the way, just write straight to the DB. It's irritating to see Kafka used where it's not necessary. It adds complexity to an infrastructure and the cost for doing so has to be justifiable.
  - linkmotif 8 years ago
    
    I think you can use kafka streams to create whatever projections you want. Kafka can handle millions of small records per second. You can use the Kafka streams api to process all your data either into local rocks db stores or write the data to whatever database you want. This requires some work but then you have a good thing going if it works.
  - lbradstreetOP 8 years ago
    
    As a side note, we provide tooling to maintain Amazon Athena/Hive,Presto support over your Pyrostore archived data in S3, while maintaining its ability to be streamed in-order.

tomconnors 8 years ago

Everything Distributed Masonry does is very interesting. Wish I had more excuses to use your stuff at work.

Storing all data forever in a single source of truth is awesome until regulation like GDPR comes along. Do you have plans to support excision or is your guidance on personal data to avoid putting it into a system like Kafka/Pyrostore?

insensible 8 years ago

You might enjoy reading Greg Young's https://leanpub.com/esversioning, which covers this topic.
It covers several strategies, three of which are:
* Encrypt it and then throw away the key to forget it
* Store private data outside the event with the event just pointing to it
* Delete events (on systems that support this)
lbradstreetOP 8 years ago

We will be launching support for native excision and data anonymization soon, as these are extremely important to storing streaming data for the long term.
Workarounds for excision in Kafka, such as key compaction, are often not possible to use as they depend on the key scheme used.

taherchhabra 8 years ago

Integration with Azure Managed Disks : Due to the ingestion heavy nature, the disks attached to the nodes on the cluster often result as the bottleneck. Traditionally, to scale this bottleneck, more nodes need to be added. Azure Managed Disks is a technology that provides cheaper, scalable disks that are a fraction of the cost of a node. HDInsight Kafka has integrated with these disks to provide upto 16 TB/node instead of the traditional 1 TB. This results in an exponentially higher scale, while reducing costs in the inverse, exponential manner.

https://azure.microsoft.com/en-us/services/hdinsight/apache-...

Is this same approach as pyro ?

lbradstreetOP 8 years ago

Our approach archives topics to cheap, highly durable and available object stores, while keeping the data available for blending between warehoused and live data sets.
This reduces operational complexity significantly vs scaling nodes up, dealing with rebalancing, under replicated partitions, etc.

lmsp 8 years ago

This is what Apache Pulsar (https://pulsar.incubator.apache.org/) already provides - infinite streaming storage, with simple/flexible messaging streaming API and kafka compatible

chrisjc 8 years ago

Very interesting and reminds me of Pravega (http://pravega.io/). Seems like unbounded streams will be the next big step in streaming technology.

https://www.youtube.com/watch?v=cMrTRJjwWys

mavdi 8 years ago

These are the guys behind www.onyxplatform.org. That alone tells me this is legit stuff. We will give it a try.

dominotw 8 years ago

> tradeoffs in our operation of Kafka have lossy effects on stream-ability. Balancing costs and operational feasibility, we ask Kafka to forget older data through retention policies.

What does ' lossy effects on stream-ability. ' mean here. Stream slows down, data loss or something else?

lbradstreetOP 8 years ago

Pyrostore co-founder here. When practitioners archive their data from Kafka to other storage products (S3, SQL database, etc) today, they are giving up on the log ordered structure of the data their ability to consume their data in its original ordering, with its original offsets and timestamps. Pyrostore structures and indexes your data in S3 in order to provide a consumer that implements the Kafka consumer interfaces, ensuring you are always able to stream from hot and cold storage alike.

ah- 8 years ago

I wonder if this would ever be integrated into Kafka proper. Shipping out historical chunks onto infinite storage seems like a generally sensible thing.

This would be even better if it didn't need a modified client.

rad_gruchalski 8 years ago

I did suggest a potential solution a while ago: https://medium.com/@rad_g/the-case-for-kafka-cold-storage-32... Relevant JIRA ticket: https://issues.apache.org/jira/plugins/servlet/mobile#issue/...

Settings

Commanding infinite streaming storage with Apache Kafka and Pyrostore

Keyboard Shortcuts