Settings

Theme

Setup AWS Cloudwatch Monitoring and Alerts Using Bash Scripts

themythicalengineer.com

39 points by sks147 5 years ago · 33 comments

Reader

amirkdv 5 years ago

We've been looking at making CloudWatch (CW) alarms an automated part of our infra. Here are some findings that may help:

- The semantics of CW seem convoluted. But once you stare at API docs for long enough, the core concepts are easy to grok: Metrics (regularly submitted from machine to CW), Alarms (abstractions for defining the logic of an alarm based on behavior of Metrics), and SNS Topics (could be just an email address, for what to do when an Alarm goes off).

- Once you get the data model right, all implementations (click ops, terraform, bash via awscli, boto3, etc) are all visibly identical.

- Some Metrics come for free, e.g. CPU usage is reported by any EC2 instance to CW. For some other Metrics, notably disk and memory usage, you need to configure your instance to report them to CW. This is where the OP's monitoring scripts come in.

- The monitoring scripts and the cron config the OP refers to are deprecated [0]. Instead there's a new CloudWatch Agent [1]: you install the package on your EC2 instances, provide a configuration file to it, and you're set.

[0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scri...

[1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

coredog64 5 years ago

High CPU alerts are terrible alerts. If I'm paying per instance, I want CPU utilization to be high. If it's low, I'm wasting money. So now what I need is an alert where it's not high, but somewhere between "high and too high". You know, like when there's an arbitrary spike because the Java is doing some GC. Or you have a one minute spike of traffic that fires an Ops Genie alert at 2am but auto-clears between when the on-call engineer wakes up and when they log in to check.

For the love of $DIETY, if you're going to set up CloudWatch monitoring, create custom metrics that map to your business outcomes and alert when those go off the rails.

  • sks147OP 5 years ago

    You might want to have two separate alerts for this problem, one labelled WARNING and the other CRITICAL, such as 60 percent CPU usage as a warning and 85 percent CPU usage as a critical situation. You can have two separate SNS Topics for warning and critical alerts. Warning alerts can be thrown to a slack channel and Critical alerts can be configured to invoke the Pager.

  • vergessenmir 5 years ago

    Custom Cloudwatch metrics are expensive to write to making them useful for coarse grained high level service metrics. If you can afford it go ahead but setting up some other cloud native monitoring service may be the way to go.

  • jeppesen-io 5 years ago

    Certainly not perfect, but I've had very good success alerting load avg over 120 to 150 percent of core count

    What's nice, is it catches A variety disk issues as well

    I'm sure not perfect for all cases but for me, most of them

  • dimitar 5 years ago

    If you are running some software that requires an instance, but is in not expected to create load you can put it in a burstable, and setup such an alert, so you know when it is time to upgrade.

orf 5 years ago

Not sure why you’d ever do this instead of using terraform.

  • hughrr 5 years ago

    As someone who spends two hours a day dealing with buggered terraform state and upgrading terraform and dealing with terraform bugs I can see it.

    It’s one of those things that really works pretty well but there are enough edge cases to make it slightly soul sucking.

    • gizdan 5 years ago

      This sounds like a lack of understanding of terraform. We use Terraform pretty heavily and I've rarely seen bad states across our whole org, and the few that I do see are usually people who don't know the core concepts (often non-devops engineers).

      Terraform has its faults, but it is the best in its class, especially when you need to manage infrastructure beyond a single cloud provider (e.g. we manage our datadog monitors and dashboard, pagerduty alerts and much more). The only other thing that would probably thrash it is pulumi, which has similar concepts, except you can many different languages as opposed to HCL (no CDK doesn't count because it is very immature still and last I checked it only supported one or two languages).

      • throwawaaarrgh 5 years ago

        I work with a large group of engineers that manage a very large array of infrastructure. We see weird Terraform issues all the time. There's a multitude of ways that Terraform gets into a bad state and has to be fixed manually (in production). Even a Terraform expert runs into them, because it's not necessarily an issue "with Terraform", but with a buggy provider, or some feature of Terraform which wasn't tested well in certain scenarios, etc.

        Terraform allows for too much complex configuration/operation, the codebases change too frequently, there's not enough testing, and even extremely simple operations fail in a way that can't be reverted automatically. In practice the tool is clunky, complicated, difficult, and unreliable. Whenever I run "terraform apply" I know I am rolling dice, and plan for how I'm going to recover everything if I need to (which was what Terraform was supposed to prevent!)

        But at the same time, if lots of people need to manage the same infra, you really have to use some common tool. Bash scripts are a great fix for small isolated problems, but they don't scale.

      • hughrr 5 years ago

        I completely agree with your points there and that is probably the issue.

    • ldoughty 5 years ago

      Agreed, I swapped my team from teraform to Ansible to SAM... SAM has been the most reliable and resilient and stable for my use cases (general serverless)

      • rantwasp 5 years ago

        SAM is cloudformation. cloudformation is the thing to use if you’re on the AWS cloud

        • void_mint 5 years ago

          CloudFormation is without a doubt the worst cloud technology I have ever used.

          • coredog64 5 years ago

            It's at least second or third worst. Worst would be writing your own deployment tool that does what CloudFormation (or TF or Pulumi) do. Second worst would be writing a tool that uses a templating language to generate CloudFormation and only using that.

            • void_mint 5 years ago

              Yeah I wasn't really considering home rolled stuff. Officially supported tech.

          • rantwasp 5 years ago

            lol. what’s next? you used terraform and it was awesome?

    • orf 5 years ago

      Would it be more soul sucking than emulating it with bash, as the article is almost suggesting?

      • hughrr 5 years ago

        About the same. Just nice for stuff to suck in a different way occasionally.

    • manderson89 5 years ago

      If you don't like Terraform then you should use CloudFormation, not bash scripts.

      • hughrr 5 years ago

        Oh no that’s even worse.

        • manderson89 5 years ago

          If you prefer imperative infrastructure creation to declarative then I think you're doing something wrong. Both Terraform and CloudFormation are quite easy to manage compared to writing and managing scripts (bash or otherwise).

          • hughrr 5 years ago

            I’m only having a gripe. I use terraform because it’s the least bad tool, not because it’s the best. I wish for better.

  • miyuru 5 years ago

    Yes.

    Using terraform for this is great is because it removes the unwanted alarms.

    I had to create alarms when the instances auto scale and wrote a python script using cdktf and now the Jenkins job handles it. It even updates the cloudwatch dashboard.

  • qvrjuec 5 years ago

    Or CDK... If you're writing code to generate infra why jump through more hoops than you'd need to

  • sks147OP 5 years ago

    Until the team onboards a terraform expert, these scripts might be helpful and cheaper to implement.

    • orf 5 years ago

      “Until the team reads the terraform QuickStart, these scripts will continue to make their infra a hellscape to manage”

ranguna 5 years ago

AWS Cloudwatch Monitoring & Alerts using CDK ?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection