Cloudwatch-collectd Integration

Introduction

For monitoring to work in the long term, it's got to be easy to add things to it. In fact, it wants to be so easy to add things that to do so is less work than fixing the thing that needs monitoring. If it isn't, then it'll likely not get added and instead stay as something inside the sysadmin's head - where it's no use to anyone else, and prone to, er, "corruption".

If your monitoring platform happens to be AWS Cloudwatch, then there are a couple of ways you can easily add new metrics from the software you have running in an EC2 instance or inside a container.

AWS Cloudwatch has an agent that you run on your instance. It periodically collects some basic system metrics and sends them to Cloudwatch, and it has a variety of 'hooks' to let you add more metrics as you like.

CloudWatch agent can run statsd and/or collectd. Both have their benefits and downsides, but there's actually no real problem to having both enabled just in case you want them. The main differences between them are that statsd is just a "listener", so you need something else to send metrics to it. Collectd is a standalone process which has its own means to schedule the collection of metrics and can then report them in various ways (ie. to Cloudwatch Agent).

Collectd

We can use Collectd's scheduling to periodically run checks on our system, collect metrics and them publish them. Indeed, Collectd has dozens of plugins which can get metrics from a variety of sources, so it's possible you may just be able to use one of them instead of having to write your own.

If Collectd's plugins are not suitable for you, it has a Python plugin which allows us to run arbitrary Python code on Collectd's schedule and to report metrics that way. This means we have a way to monitor just about anything, any way we want to do it. What's more, there are some Collectd Python projects around to monitor common products, saving us having to write too much ourselves.

We should also mention Collectd has the means to do some alerting and other tasks based on the metrics it is collecting. We haven't explored any of that, preferring instead to do all of that in Cloudwatch.

To enable Collectd in Cloudwatch, it first needs to be installed on the system:

dnf install collectd
dnf install collectd-python

Here we've installed the Python plugin as well, which isn't mandatory, but as we'll see is pretty useful.

Next we have to configure Collectd. We can do this by creating a file in /etc/collectd.d/collectd-aws-default.conf:

Interval 60

LoadPlugin network
<Plugin network>
  Server "127.0.0.1" "25826"
</Plugin>

Here we're telling Collectd to schedule checks every 60 seconds and send metrics to port 25826.

Next we tell CloudWatch to ingest metrics from Collectd. We put the following into /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/collectd.json:

{
   "metrics":{
      "metrics_collected":{
         "collectd":{
            "collectd_security_level":"none",
            "metrics_aggregation_interval":60
         }
      }
   }
}

Now we need to restart both Cloudwatch Agent and Collectd to pick up the config changes.

Collectd Plugins and Checks

With the plumbing in place, we can now build out some checks to start putting some metrics into Cloudwatch. We'll just mention some Collectd plugins that may be worth looking into:

fhcount
load
processes

These plugins are pretty generically useful, but the processes plugin has the option to monitor specific processes and produce additional metrics for them.

The Python plugin (which we installed already) is also super useful. We also found a Github project which connects to MySQL (or MariaDB) and pulls hundreds of metrics from it. The project is a little old, there's an outstanding request to update it to Python 3, but but it's still pretty useful. To run this project as a Python plugin, we have to put the project somewhere on the system (we picked /opt/collectd/pyhon/mysql_collectd.py) and then configure Collectd. We made a new file, /etc/collectd.d/mysql.conf:

LoadPlugin python

<Plugin python>
  ModulePath "/opt/collectd/python"
  Import mysql_collectd
  <Module mysql_collectd>
    Host "localhost"
    Port 3306
    User "monitoring"
    Password "somepassword"
    Verbose false
  </Module>
</Plugin>

After a restart of Collectd, it then connects to MySQL once per minute, pulls hundreds of metrics and pushes them to Cloudwatch. In fact, you get many more metrics than you do with an RDS instance this way (assuming you want to spend the time running MySQL/MariaDB yourself, which on this occasion our client did).

Futures

With the above working pattern, it's possible to have every Ansible role you use drop the necessary monitoring code into /opt/collectd/python and put a suitable config stub into /etc/collectd.d, and then have Ansible restart Collectd if necessary. This way any application you install will also install its own monitoring hooks and so will always be monitored if ever it's installed.

The possibilities here are pretty much endless. It's possible to monitor the internals of pretty much any application, get process metrics, monitor the existence or absense of things, the expiration time of TLS certificates or license files, and pretty much anything else the system might need.

Conclusions

Here we've shown how to incorporate Collectd into a Cloudwatch Agent environment to be (easily) able to collect metrics from just about any application or aspect of the system and push them into Cloudwatch. From there it's easy to setup alerts or thresholds on any of the monitored metrics.

If you need help with monitoring, AWS, Cloudwatch or anything else, Pre-Emptive can help - just get in touch.

References

Tags: aws cloudwatch monitoring metrics collectd