How do you folks monitor external dependencies at scale
Most software companies would have external dependencies like cloud platforms, APMs , Functional/Business integrations. How do you folks monitor these external dependencies errors , latency etc ? RSS feeds etc are there but there are days were AWS is slow to update status page, latency in CDN are not that obvious in the middle of night when you get paged. Makes me wonder what other folks are doing about this ? In my experience the best way to monitor is to have passive monitoring for all dependencies (error rate, latency, response time, throughput) across all touch points, and then to have active monitoring (health checks, acks/nacks) for all the things which are performing the passive monitoring, which are usually your services or applications. After that, you usually want to set some sort of anomaly watermarks, either manually based off a baseline or use one of the many anomaly detection solutions available. I've found many issues with providers this way, often before they even knew. It's also helped inform decisions to migrate to alternative providers or services when we are able to measure what the improvements would actually be, rather than relying on hand waving and marketing materials. This is all pretty easy stuff, but of requires discipline and the resources to invest in instrumenting everything. You need some level of buy in from leadership and it's all the more difficult if you have a toilsome ops or oncall rotation. If you are large enough and can afford it, I recommend empowering at least one reliable engineer to be tasked to solve the problem across the stack. The real problems are when you're operating a service you don't really own (i.e. a vendor) and there are issues related to how it interacts with something else. The only real solution, aside from getting the thing fixed or abandoning it, is to shim or proxy the dependencies such that you can instrument it as a black box. For example, if your vendor gives you a .jar that you configure to use S3, run a local proxy for S3 as a side car and collect stats there. This is a contrived example, but the concept should be clear. Often you can't even do this, as vendors hardcode stuff like AWS, and forget it if you're using something managed like Databricks or Snowflake. Agree. When I worked at AWS, each of the service teams (e.g. Route53, VPC) would monitor their dependencies ruthlessly. Response time, error rate, etc As a service owner, you keep other teams on the hook, ensuring they are performing as expected. https://metrist.io/ is working on exactly this Thank you for mentioning us. My name is Jeff, one of the co-founders of Metrist. We think a lot about this problem and have been building the solution for the past few months. We are still in beta, currently focused on AWS and Azure, but we do have a number of other services supported, and a few more we can enable for anyone that wants to be an early beta user. We have a big vision beyond our functional checks, and would welcome any feedback. Looks great. Interested in following this and wondering when you folks plan to move out of Beta to GA We expect to have both free and paid products in GA around mid-2022. +1 for Metrist. Great product! You could gather and periodically log a latency histogram for each type of request: https://www.fsl.cs.sunysb.edu/project-osprof.html This covers all the examples you gave except CDN latency. You could try polling CDN endpoints from an external client that also gathers histograms. We use https://statusgator.com with great success Active monitoring means you have a program which: - performs queries, measurements, log anomalies, etc - collects the data and examines it - sends alerts via independent channels when appropriate For example, you might have an end-to-end query that is a curl call to an authenticated API which returns the free space allocated to a database table. When conducted from a host outside your network, this tests all of: - reachability of your service - minimum performance of your service - availability of your database You then need to establish what the normal values look like, so that you can send an alert when the range is exceeded in a bad way. Do something like this for everything critical. You can also do negative checks: run a job that pushes some tiny amount of data to your monitor once a minute. If you haven't received a push in the last 90 seconds, send an alert. If you host your service on cloud A, use a different cloud to run the performance checks. If you use Gmail for email, use another service to send alerts. I would have thought that monitoring would be against application level metrics directly and AWS level stuff is more informational next step in pinpointing the source? You ideally want both. If you understand the performance profile of your dependencies (even if it's degraded), because you control the application you can compensate, through retries or batching or caching or request pooling, etc. The last thing you want to be doing when an issue arises is debugging your whole stack to find the root cause while your customers sit on their hands angrily. Better to OODA Loop faster when a dependency degrades, or is it better to assume it will degrade, build in circuit breakers, and auto-recovery? Why not both?