Settings

Theme

Things We Forgot to Monitor

word.bitly.com

232 points by jehiah 12 years ago · 65 comments

Reader

AznHisoka 12 years ago

Also: 1) Maximum # of open file descriptors

2) Whether your slave DB stopped replicating because of some error.

3) Whether something is screwed up in your SOLR/ElasticSearch instance so it doesn't respond to search queries, but respond to simple heartbeat pings.

4) If your Redis db stopped saving to disk because of lack of space, or not enough memory, or you forgot to set overcommit memory.

5) If you're running out of space in a specific partition you usually store random stuff like /var/log.

I've had my ass bitten by all of the above :)

  • contingencies 12 years ago

    6) Free inodes (as distinct from space) per filesystem.

    • caw 12 years ago

      Similar to free inodes, you should also check for maximum number of directories. dir_index option helps, but I've seen it become a problem.

      • mnw21cam 12 years ago

        There's a maximum number of directories? On what filesystem is that?

        • caw 12 years ago

          ext3 without dir_index has a limit of 32K directories in any one directory.

          Where I saw it crop up was 32K folders under /tmp on a cluster system. So no it's not a limit on number of directories entirely (that's inodes), but rather how many subdirectories you can have.

          http://en.wikipedia.org/wiki/Ext4#Features <-- Fixes 32K limit

          • otterley 12 years ago

            ext3/4 has really poor large-directory performance, even with dir_index, especially if you are constantly removing and readding nodes. I would highly recommend XFS for large-directory use cases.

        • 0x0 12 years ago

          I got bit by this once, i think it was related to a maximum of 32k hardlinks per inode, which effectively sets a limit of 32k subdirs since each subdir has a hardlink to ".."

  • Gracana 12 years ago

    > Maximum # of open file descriptors

    Augh. I ran one of my servers hard into that wall, and now it's something I watch. At least I learned from that mistake.

    • apaprocki 12 years ago

      Related to this, if you've ever built/run anything on Solaris, you probably found out the hard way that even in modern times, fdopen() in 32-bit apps only allows up to 255 fds because they oh so badly want to preserve ages old ABI. Funny bug to hit at runtime in production when you aren't aware of this compatibility "feature".

    • wtracy 12 years ago

      I learned the hard way that MySQL creates a file descriptor for every database partition you create. Someone had a script that created a new partition every week...

      • pbhjpbhj 12 years ago

        So after 5000 years you were running out?

        • wtracy 12 years ago

          I forget the details, but practically speaking the database keeled over after some 200 or 500 files were open at the same time.

  • teddyh 12 years ago

    X) Number of cgroups. We were getting slow performance, apparently related to slow IO, but nothing stood out as being the culprit. Turns out, since vsftpd was creating cgroups and not removing them, the pseudo-filesystem /sys/fs/cgroup had myriads of subdirectories (each representing a cgroup), and whenever something wanted to create a new cgroup or access the list of cgroups, this counted as listing that pseudo-directory, which counted as IO.

    Fixed by using the undocumented option isolate_network=NO in vsftpd.conf.

  • DrJ 12 years ago

    Feels like this list (and the original post) are problems caused by:

    * lack of proper/default monitoring advocated for your tools (2), (4).

    * Choosing poor (default/recommended) settings (1), (4).

    * Keeping stateless server/instances when you don't need to (5), (6).

    * Not tracking performance as part of monitoring (3), (4)

    Albeit, I have made the same mistakes too.

    edit: formatting

otterley 12 years ago

Swap rate (as opposed to space consumed) is probably the #1 metric that monitoring agents fail to report.

One thing that drives me nuts is how frequently monitoring agents/dashboards report and graph only free memory on Linux, which gives misleading results. It's fine to report it, but to make sense of it, you have to stack free memory along with cached and buffered memory, if you care about what's actually available for applications to use.

Another often-overlooked metric that's important for web services in particular is the TCP accept queue depth, per listening port. Once the accept queue is drained, remote clients will get ECONNREFUSED, which is a bad place to be. This value is somewhat difficult to attain, though, because AFAIK Linux doesn't expose it.

  • InclinedPlane 12 years ago

    > One thing that drives me nuts is how frequently monitoring agents/dashboards report and graph only free memory on Linux, which gives misleading results. It's fine to report it, but to make sense of it, you have to stack free memory along with cached and buffered memory, if you care about what's actually available for applications to use.

    Even that is misleading. It's actually non-trivial to find out exactly how much "freeable" memory one has on a linux system these days as not all the cached memory bits are truly freeable.

  • rodgerd 12 years ago

    Even then there's some wrinkles; the anon shared memory used by e.g. the Oracle SGA will show up as cached memory, but evicting it is a no-no.

  • justincormack 12 years ago

    Yes I can't find the socket backlog anywhere in Linux. FreeBSD exposes it via kqueue http://www.freebsd.org/cgi/man.cgi?query=kqueue through the data item in EVFILT_READ.

  • marcosdumay 12 years ago

    Swap rate still looks like the wrong metric. It'd be better to have the rate of swap lookups, excluding all writes.

    • otterley 12 years ago

      swap-in rate, to be more specific. swap-outs aren't incredibly worrisome.

      • acdha 12 years ago

        That's backwards: things like mmap() will generate page-in activity during normal operation. page-outs means that the operating system had to evict something to satisfy other memory requests, which is what you really want to know.

        • otterley 12 years ago

          swapouts and pageouts aren't identical in Linux, and are instrumented separately (pswpout and pgpgout, respectively; see /proc/vmstat). mmap() and other page-ins won't be counted under the swap statistics.

          A pageout might suggest memory pressure, but not nearly as much as a swapout does. (pgmajfault is a better indicator.) Writing dirty pages is just something the kernel does even when there's no memory pressure at all. Also, unfortunately you can't use pgpgout for anything useful as ordinary file writes are counted there.

bradleyland 12 years ago

Interestingly, an out-of-the-box Munin configuration on Debian contains nearly all of these. I recommend setting up Munin and having a look at what it monitors by default, even if you don't intend to use it as your monitoring solution.

  • hansjorg 12 years ago

    Installation on Debian/Ubuntu is also as simple as installing the munin package (munin-node for subsequent hosts) and pointing a webserver at the right directory.

    Extremely valuable when something is acting up.

tantalor 12 years ago

Some people, when confronted with a problem, think “I know, I'll send an email whenever it happens.” Now they have two problems.

  • marcosdumay 12 years ago

    I really don't get where you are going with that.

    Are you arguing that alerts are useless, and we must fix the issue for once? Because if so, I'd point that some things can not be fixed (because the Earth is finite, we don't know all things, etc) and you are better alerted sooner, rather than later.

    Now, if you are arguing that email is not the right medium for an alert, well, what medium is better? Really, I can't think of any single candidate. Yeah, email may go down, that's why you complement it with some system external to your network (a VPS is cheap, a couple of them in different providers is almost flawless, and way cheaper than any proprietary dashboard). Yes there is some delay involved, that should be of a few minutes at most, because you create some addresses specifically for the alerts, and make all hell break loose then a message gets there. Some standard IM protocol that federated between all your net (and external point of control), could be reached from anywhere, and had plenty of support on all kinds of computers would be better, but it does not exist.

    • alister 12 years ago

      I got the GP's point immediately: He means that system administrators already get an enormous volume of email. Send them another email and it'll get ignored, deleted, or put at the bottom of a gigantic to-do list.

      For airline pilots, an excessive number of warnings themselves (bells, alarms, audible warnings) are known to distract the pilots and cause errors.

    • aryastark 12 years ago

      I think you're being obtuse.

      Once you start sending emails for things, you start sending emails for everything. It's easy to fall into the trap of not accurately categorizing what is critical (like real, real, critical, I mean it this time guys!) and what are merely statuses. So what happens is everything starts being ignored, and your systems become obscure black boxes again.

      • hueving 12 years ago

        I think you were the one being obtuse. There is no assumption that you will start receiving useless email status updates. In fact, most reasonable monitoring tools only email when a status changes to a problem state.

        • dredmorbius 12 years ago

          most reasonable monitoring tools

          20+ years of experience tells me most monitoring tools aren't reasonable.

          • hueving 12 years ago

            Then don't use them? My point is that there is nothing wrong with email alerts, so the statement about them being a problem sounds like a misconfiguration or a failure to understand how to setup email filters.

            • dredmorbius 12 years ago

              there is nothing wrong with email alerts

              You're wrong.

              As a sysadmin, I typically receive something on the order of 1,000 to 10,000 emails daily (the specifics vary by the system(s) I'm admining). Staying on top of my email stream is a significant part of my job, both in not ignoring critical messages which have been lost, misfiled, or spamfiltered, and in getting bogged down in verbose messages which convey no real information.

              Alerts which tell me nothing have a negative value: they obscure real information, they don't convey useful information, and each person who comes on to the team has to learn that "oh, those emails you ignore", write rules to filter or dump them, etc.

              Worse: if the alerts might contain useful information, that fact has to be teased out of them.

              The problem with emails such as that is that they're logging or reporting data. They should be logged, not emailed, and with appropriate severity (info, warning, error, critical). Log analysis tools can be used to search for and report on issues from there.

              As I said: in a mature environment, much of my work goes into removing alerts, alert emails, etc., which are well-intentioned but ultimately useless.

              • hueving 12 years ago

                >As a sysadmin, I typically receive something on the order of 1,000 to 10,000 emails daily

                Sorry, but you're not a very good sysadmin then. You have chosen poor tools or do not understand how to distill the information. Knowing that, I can see why you think email alerts don't work. They are effectively broken FOR YOU.

                • ersii 12 years ago

                  And you don't think vendors have a responsibility to reflect upon the way they do alerts and/or service monitoring?

                  It's usually not the system administrators that get to decide what the Corporate Overlords purchases or who they do business with. So I think it's pretty unfair to blame the admins for "choosing poor tools".

    • InclinedPlane 12 years ago

      The point being: delegating prioritization and categorization to a human in real-time is lazy and dangerous. As much as possible humans should only receive notifications when something requires action or is too complex to determine that programatically.

    • rhizome 12 years ago

      Some standard IM protocol that federated between all your net (and external point of control), could be reached from anywhere, and had plenty of support on all kinds of computers would be better, but it does not exist

      I would recommend an SMS sent via GSM modem for out-of-band emergency notifications.

  • dredmorbius 12 years ago

    Hospitals have a similar problem -- too many devices with too many alarms. As many as 10,000/day in a busy nursing floor.

    NPR covered this a few days back, I've written on it at more length:

    http://www.npr.org/blogs/health/2014/01/24/265702152/silenci...

    http://www.reddit.com/r/dredmorbius/comments/1x0p1b/npr_sile...

  • rmc 12 years ago

    "What if the email goes down? I know I'll send an email"

    • mafro 12 years ago

      Keyboard missing, press F1 to continue

    • dredmorbius 12 years ago

      That's actually a case where sending a regular ping mail to several sentinal systems which report on the LACK of an email can be useful.

      Reminds me of a few times the email queues got backed up to hell and beyond. Fuck you, Yahoo.

dredmorbius 12 years ago

The corollary of this post is "things we've been monitoring and/or alerting on which we shouldn't have been".

Starting at a new shop, one of the first things I'll do is:

1. Set up a high-level "is the app / service / system responding sanely" check which lets me know, from the top of the stack, whether or not everything else is or isn't functioning properly.

2. Go through the various alerting and alarming systems and generally dialing the alerts way back. If it's broken at the top, or if some vital resource is headed to the red, let me know. But if you're going to alert based on a cascade of prior failures (and DoS my phone, email, pager, whatever), then STFU.

In Nagios, setting relationships between services and systems, for alerting services, setting thresholds appropriately, etc., is key.

For a lot of thresholds you're going to want to find out why they were set to what they were and what historical reason there was for that. It's like the old pot roast recipe where Mom cut off the ends of the roast 'coz that's how Grandma did it. Not realizing it was because Grandma's oven was too small for a full-sized roast....

Sadly, that level of technical annotation is often lacking in shops, especially where there's been significant staff turnover through the years.

I'm also a fan of some simple system tools such as sysstat which log data that can then be graphed for visualization.

jlgaddis 12 years ago

Be sure to monitor your monitoring system as well (preferably from outside your network/datacenters)! If you don't have anything else in place, you can use Pingdom to monitor one website/server for free [0].

I was off work for a few months recently (motorcycle wreck) and removed my e-mail accounts from my phone. Now, I have all my alerts go to a specific e-mail address and those are the only mails I receive on my phone. It has really helped me overcome the problem of ignoring messages.

[0]: https://www.pingdom.com/free/

comice 12 years ago

We monitor outgoing smtp and http connections from anything that requires those services.

And the best general advice I have is split your alerts into "stuff that I need to know is broken" and "stuff that just helps me diagnose other problems". You don't want to be disturbing your on-call people for stuff that doesn't directly affect your service (or isn't even something you can fix).

mnw21cam 12 years ago

Also, are your backups working.

jsmeaton 12 years ago

We had a perfect storm of problems only 2 weeks ago.

1. A vendor tomcat application had a memory leak, consumed all the RAM on a box, and crashed with an OOM

2. The warm standby application was slightly misconfigured, and was unable to take over when the primary app crashed

3. Our nagios was configured to email us, but something had gone wrong with ssmtp 2 days prior, and was unable to contact google apps

3a. No one was paying any attention to our server metric graphs / We didn't have good enough "pay attention to these specific graphs because they are currently outside the norm"

A very embarrassing day for us that one.

We're now working on better graphing, and have set up a basic ssmtp check to SMS us if there is an issue. Monitoring is hard.

  • berkay 12 years ago

    You may want to check OpsGenie heartbeat monitoring, or essentially implement the same idea yourself. Our heartbeat monitoring expects to receive messages (via email or API) from monitoring tools periodically and notifies you via push/SMS/phone if we don't receive it over 10 minutes. I think this pattern is very useful to ensure that alert notifications is working.

  • marcosdumay 12 years ago

    > and have set up a basic ssmtp check to SMS us if there is an issue.

    And what will happen when the network (or the alert server) is down?

    You must put some check outside your network, with independent infrastructure. Adding another protocol on the same net is still subject to Murphy law.

    • berkay 12 years ago

      Independent infrastructure is a good idea but not always feasible for everyone. At OpsGenie, to resolve this problem, we came up with a solution we refer as "heartbeat monitoring". This basically allows monitoring tools to send periodic heartbeat messages to us that indicate that the tools is up and can reach us. If we don't receive heartbeat messages from them in 10 minutes, we generate an alert and notify the admins. Not out of band management but does the trick to prevent situations like jsmeaton described.

      http://support.opsgenie.com/customer/portal/articles/759603-...

sp332 12 years ago

You're using icanhazip.com in production? I see from a quick Google search that Puppy Linux seems to use it in some scripts, but how reliable is it?

  • jphines 12 years ago

    $ curl -i -k -L icanhazip.com

    HTTP/1.1 200 OK

    Date: Mon, 10 Feb 2014 20:13:28 GMT

    Server: Apache

    Content-Length: 15

    Content-Type: text/plain; charset=UTF-8

    X-RTFM: Learn about this site at http://bit.ly/14DAh2o and don't abuse the service

    X-YOU-SHOULD-APPLY-FOR-A-JOB: If you're reading this, apply here: http://rackertalent.com/

    X-ICANHAZNODE: icanhazip2.nugget

    Would seem only fair. :D

  • toomuchtodo 12 years ago

    jsonip.com is also usable in production.

baruch 12 years ago

About reboot monitoring, I suggest to use kdump to dump the oops information and save it for later debugging and understanding of the issue. It may even be an uncorrectable memory or pcie error you are seeing and the info is logged in the oops but is hard to figure otherwise. Also, if you consistently hit a single kernel bug you may want to fix it or workaround it.

lincolnpark 12 years ago

Also, are your API endpoints working properly.

  • dredmorbius 12 years ago

    Can you expand on that?

    • AznHisoka 12 years ago

      Ha I can.

      Sometimes api providers change the damn response format. Or their urls change. Or they blocked your ip without notifying you.

      • dredmorbius 12 years ago

        Thanks.

        I was thinking some sort of end-point test myself, hadn't considered the specific case of APIs.

jlgaddis 12 years ago

I have gear in three different facilities and I'm typically visiting any of them unless I'm installing hardware or replacing it. Shortly after starting at $job, I realized there was no monitoring of the RAID arrays in the servers we have. That could have ended badly.

herokusaki 12 years ago

How oversold your VPS provider's server is commonly blamed for slowdown but rarely measured.

stephengillie 12 years ago

Between PRTG and Windows, almost all of that is handled for us. And PRTG can call OMSA by SNMP.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection