Aggregating Everything with collectd

5 min read Original article ↗

The first step in any tiered collectd setup is getting collectd processes in the first tier to ship data to collectd processes in the second tier. Collectd ships with a few plugins that can help you do this, most obviously the network plugin. The trouble, at least for our setup, is that the network plugin operates over UDP. We perform service discovery by registering instances to load balancers with fixed endpoints, but our load balancer, ELB, only supports TCP. And since we wanted to run the second tier of collectd within our normal deployment environment that uses ephemeral instances created by autoscaling groups, we would have to introduce a new service discovery mechanism to support the network plugin.

Not wanting to increase our system’s complexity, we reviewed the other available plugins for an alternative to using UDP and found the AMQP plugin. The purpose of this plugin is not, as you might expect, to monitor RabbitMQ, but instead to ship data between collectd processes using AMQP exchanges. And since we already use RabbitMQ in production, it was simple to setup another server to broker collectd messages. The first tier and second tier AMQP plugin configs look something like this:

## first tier<Plugin "amqp">
<Publish "collectd-amqp">
Host "collectd.rabbitmq.private.adstage.io"
Port "5672"
VHost "/"
User "guest"
Password "guest"
Exchange "collectd-amqp"
</Publish>
</Plugin>
## second tier<Plugin "amqp">
<Subscribe "collectd-amqp">
Host "collectd.rabbitmq.private.adstage.io"
Port "5672"
VHost "/"
User "guest"
Password "guest"
Exchange "collectd-amqp"
ExchangeType "fanout"
Queue "collectd-amqp"
QueueAutoDelete false
</Subscribe>
</Plugin>

Update 2017–01–30: Note the use of a named queue on the second tier with auto-delete set to false. We had to add this to the config to make the system robust to unexpected second-tier collectd restarts. Without it we found that the automatically generated queues sometimes failed to delete and would continue to accumulate messages that were never processed, overloading our RabbitMQ server. Using a named, undeleted queue also allows you to process messages sent while the second-tier collectd is not running, at the cost of those messages queuing up in RabbitMQ, which for us is desirable because it eliminates any gaps in monitoring during machine restarts. Collectd will still automatically manage the named queue for you, so no need to configure anything yourself in RabbitMQ.

With the first tier shipping data to the second tier, we now needed to aggregate it. This is where the CPU aggregation example came in handy. For each first tier plugin (CPU, disk, etc.), we setup an aggregation plugin config to group together the hosts’ time series. Since we wanted to group our metrics by autoscaling group, we had each first tier host’s collectd report its Host in collectd as the autoscaling group name rather than the actual host name. Then we could group by “Host” in our aggregations like so:

## second tier<Plugin "aggregation">
<Aggregation>
Plugin "df"
Type "df_complex"
GroupBy "Host"
GroupBy "TypeInstance"
CalculateAverage true
CalculateMinimum true
CalculateMaximum true
</Aggregation>
</Plugin>

Now we had our second tier collectd aggregating our first tier collectd metrics, but unfortunately it was still writing the first tier metrics through to graphite/Sumo Logic. In a smaller setup this might be okay, but we were generating tens of thousands of metrics per minute, and most of those were individual host metrics no one was ever going to look at because we work in terms of autoscaling groups, not individual hosts. It would be nice to keep them, but for us it wasn’t economical. So we needed a way to filter out the first tier metrics and only send the second tier generated aggregation metrics on to Sumo Logic.

Collectd’s filter chains are designed specifically to aid in this scenario. They provide a generic way of defining logic about how data flows between plugins in collectd. Following Librator’s CPU example, we initially created a rule for every plugin to capture its data and send it through aggregation.

## second tier<Chain "PostCache">
<Rule>
<Match regex>
Plugin "^df$"
</Match>
<Target write>
Plugin "aggregation"
</Target>
Target stop
</Rule>
Target "write"
</Chain>

The trouble with rules like these is that you have to write one for every plugin you use in the first tier. It also blacklists rather than whitelists metrics you want to send to write plugins, so if someone turns on a new plugin in the first tier it can write tens of thousands of metrics per minute into Sumo Logic unless they also configured aggregation in the second tier correctly. To address these issues, we switched to a chain that only sends aggregated data to graphite and sends everything else to the aggregation plugin.

## second tier<Chain "PostCache">
<Rule>
<Match regex>
Plugin "^aggregation$"
</Match>
<Target write>
Plugin "write_graphite/SumoLogic"
</Target>
Target stop
</Rule>
<Target write>
Plugin "aggregation"
</Target>
</Chain>
<Plugin "write_graphite">
<Node "SumoLogic">
Host "localhost"
Port "2003"
Protocol "tcp"
LogSendErrors true
StoreRates true
AlwaysAppendDS false
EscapeCharacter "_"
</Node>
</Plugin>

Now when we turn on a new plugin in collectd we only have to update the first tier config and add an aggregation rule in the second tier. Then the second tier collectd process ships the data to a Sumo Logic agent via the graphite protocol, and Sumo Logic gives us pretty charts.

The entire setup process took us a few days, but hopefully you can setup a similar tiered collectd install in just a few hours by following our example. If you see any ways we could improve our system, please let us know in the comments. And if you’re interested in working on site reliability challenges like this for our growing marking automation platform, be sure to check out the AdStage careers page.