Scaling Shopify's Black Friday Live Map: An Apache Flink Redesign
shopify.engineeringMany people abuses Redis as pub/sub queue. I wonder how is Redis topology though. Becuase in a native redis-cluster, keys are hashed. So pub/sub still writes to the same instance if the key is same.
And if the Redis is the enterprise edition or the open-source one. For OSS edition, there needs to be a gateway for HTTP API. More often than not, I saw the client implementation being faulty rather than the Redis itself.
Shopify has a huge scale though, probably smaller issues wound up and become problems...
Redis was being used as a message bus here, so messages were being queued up and delivered in order using sorted sets. Overall, the performance issue was not the fault of Redis but rather how it was being used. The proposal (at the bottom of the post) does not suffer from these issues.
Shopify uses GCP. Curious why flink over beam/dataflow?
Great question. We actually tried Dataflow before pivoting completely to Flink. Dataflow has many benefits, and I had a great experience for batch use cases at a previous company, but for stateful streaming the team identified limitations. On top of that Flink is open sourced, which is a huge benefit for us because we can more easily debug deeper issues and contribute to its direction. That said, it is more work to use Flink compared to Dataflow out of the box —- our platform team is looking to reduce this friction and hopefully contribute to Flink in the process.
One of the authors here - happy to answer questions.