Monitoring 9600 banks at scale

blog.plaid.com

96 points by jeandenis 8 years ago · 13 comments

Reader

Interesting writeup. This is also a major issue for us at TradeIt (we do something similar but for stock brokers and portfolio/trading) as the brokers we integrate are not always...ahem..."robust". We've found that our upstream users really appreciate that often we can tell them about brokers' service outages before the brokers even announce it (when the brokers even bother). Sometimes the brokers don't even realize their system is malfunctioning until we poke them to ask what's going on.

Our throughput numbers are much lower and and our integrations are much fewer than Plaid, so we have been able to get away with keeping a close eye on Graphite/Grafana for spikes in request failures/timeouts. Seems like eventually we will need to implement some kind of statistical monitoring and alerting.

funkymatt 8 years ago

grafana has that ability built in!

divxflounder 8 years ago

Great article! I'm definitely taking an action item to look into Prometheus. I own DevOps/Monitoring and Alerting my org and it's really cool to see how other companies skin this cat.

I saw Cloudwatch in the pipeline, which is an Amazon product. I know I'm going to make a very controversial statement here, but - why Amazon? With volumes like yours, your scale will eventually hit the point where your cost skyrockets.

Regarding the metrics themselves, you might already do this, but I highly recommend splitting your metrics into a 50th, 95th, and 99th percentile in your Grafana graphs. This will give you a solid idea of not only what your customers experience on average, but edge cases as well.

Do you have a regular forum with how you are reviewing said metrics and pre-solving problems? We're still trying to solve this in multiple teams where I work and have noticed that some teams are great at it and other teams are a little more reactive.

Love to see this stuff :)

jeeyoungk 8 years ago

One of the authors here. Thanks for enjoying the article!
Re: AWS. We're not at a point where we are overburdened by the AWS spending. Many things are more efficient with AWS, as we have a fairly small engineering team. We use various different AWS products (Aurora, Kinesis, to name a few) that we are utilizing.
Regarding metrics & percentiles - Yes I agree. 99th percentile is what we try to look at the most, as most other metrics tend to be deceiving.
Regular forums - This is something that we need to improve on as we move forward. The blog post mostly describes the infrastructure we've built, but it takes time and effort to become a metric-driven organization.
- Terretta 8 years ago
  
  Pretty unofficial here, but I prefer engineering channel to biz dev channel... Drop me a note, loop in whoever would be interested? I’ve been meaning to get our companies better acquainted — your fantastic write up reminded me.

syastrov 8 years ago

Nice write up. I love reading these kinds of postmortems.

Unlike a lot of those I read, it sounds like you actually set out with a good set of requirements and really understood the problem.

I had a good experience using Prometheus as well for a smaller project (server monitoring). It’s interesting to know that it can handle so many metrics and scale so well to more complex problem areas.

joyzheng 8 years ago

One of the blog authors here -- thanks!
> I had a good experience using Prometheus as well for a smaller project (server monitoring). It’s interesting to know that it can handle so many metrics and scale so well to more complex problem areas.
Yep, we started out with a pretty simple prometheus setup too (two instances scraping the same metrics, just for redundancy) but have been adding federated instances and doing some pre-aggregation to scale; the nice part is that we've been able to do it pretty gradually by updating the config (e.g. splitting out one bucket of metrics into a separate node for scraping at a time).
- tigre100 8 years ago
  
  We took a similar journey with Prometheus @ Improbable. We found federation to have its limits & wanted a global query view as well as a few other nice features: https://improbable.io/games/blog/thanos-prometheus-at-scale

lordxenu 8 years ago

How do you get the data from banks? Are you scraping the webpage after the user logs in? Not many banks I know of have public apis.

throwawaymath 8 years ago

Yes, for any bank that doesn't provide them with API access they're scraping the login pages. They even do this for banks which implement anti-scraping measures.

Rainymood 8 years ago

How do you guys handle user log-in credentials? I mean, you're basically logging into their bank, right?

wbh1 8 years ago

Really enjoyed this write-up. I'm currently in the process of scaling out a Prometheus-based replacement for an old Nagios setup that was scaled to its limit and posts like this just make me that much more excited for Prometheus as a technology.

beamatronic 8 years ago

With that many integrations, some small set must be broken at any given time. How do you handle this without scaling a support staff accordingly?

Settings

Monitoring 9600 banks at scale

Keyboard Shortcuts