Monitoring Philosophy

Observation

Amongst our many skills, one of the things we've been doing recent is mentoring and guiding one of our clients own sysadmins. The person in question has been has been running a reasonably successful infrastructure, but they haven't been using much automation, configuration management or monitoring. They've been "automating" a few steps like setting up a new server by running a handful of Bash scripts to install or configure a few of the applications, and they run another Bash script to check it's all working after they're finished.

In our experience, this is a relatively common approach, even in some fairly large outfits. These days though, it really doesn't scale - you simply can't run a large infrastructure with the levels of complexity required this way. Hence, we're helping to tighten up as much as possible.

Automation

We'll talk about the value of automating your usual sysadmin tasks with Configuration Management tools such as Ansible, Puppet or Chef another time. Suffice to say, if you're having to log on to servers to get them set up, you're already "losing".

Another area of automation is "verification". That is, to check that a server actually does what you need it to do. You might run some test requests through it, or try logging on from somewhere else to check it works or whatever. These are all perfectly valid post-installation checks, but if you can automate them you can get a lot more value from them. Not only can you run them immediately after making a change, but you can run them fairly continuously to ensure that your system or service still works like you think it should be.

Monitoring

Running checks of more or less any sort on a reasonably continuous basis and reporting the results is monitoring as most people would understand it. There's nothing wrong with having a cron job run something and email you if there's a problem, but you can get a world of additional value from that little check, and a million others by using a proper monitoring platform.

The advantage of a monitoring platform is that you get to see everything in one place, and any problem, anywhere shows up. Modern systems also graph various metrics and maintain histories for services, so if something does go wrong, you've got a wealth of "meta" information to help you figure out if this is a one-off or a regular problem, if it's caused by something you can control or not or whatever else - all just by looking at the monitoring UI. It's also a great way to keep management appraised of what's going on, and an absolute must-have if you want to have a "NOC" or first line responders to triage issues before getting other teams (or you) involved.

Server Testing

As a slight aside, we can talk about server testing (ie. the testing you perform after a change, perhaps with ServerSpec, or Gherkin-based frameworks, or perhaps just by manually trying a few things). These tests usually confirm that your deployment has worked as expected and that the system and application are behaving as it should.

We'd generally argue that whilst all testing is good, at least some of this sort of testing should become part of a monitoring strategy. If monitoring performs these tests, then the monitoring platform becomes a "trusted third party" that is checking the work of the systems administrator. When the work is done well, the "lights go green", and so the administrator knows they've completed the work.

In some cases, server testing is too heavy-weight or otherwise too intrusive to run during productive hours, in which case we'd advocate for a lighter weight version of them to run at those times. It's also possible they could run less frequently than the regular monitoring cadence.

Application Testing

Following on from Server Testing is the testing of the application itself. Even if an application has a special monitoring hook or endpoint (eg /status), that doesn't actually confirm that users can log on, or that they can view the reports page or whatever else is important in the application. Here, some Application Testing can help. The aim is to actually try some of the most common operations that users perform to confirm that they are working. You definitely don't need to run through a full QA test for this, but instead focus on a handful of key operations (for most applications, logging on is probably the main thing to test).

Again, these tests needn't run as frequently as the rest of your monitoring. However, the aim is of course to know that user-facing problems are occurring as soon as possible (and before the majority of your users!).

Developers usually benefit from application testing as much as systems administrators. This sort of monitoring can capture metrics about performance or the number of times various things have happened. These sorts of metrics can help developers to tune and refine their software in the future. In some cases, developers may have a monitoring platform of their own to do this sort of work. Some of these platforms also "watch" running processes and so can tie metrics to lines of code as they execute. Such systems, called Application Performance Monitors (APMs), are usually more specialised to developers and are less useful for systems administrators. Whilst having umpteen different monitoring platforms is usually a bad things, here is an example where it actually makes sense.

As a slight aside, we once heard of an application that was tested once every few minutes by a series of user interactions (eg. log on, look at a statement, view payments, log out again). The developers worked out that with just seven interactions they could actually test all of the dependencies of their application. This simple check then meant that it would always flag up an issue if anything the application needed wasn't working so they could get it fixed before the real users even noticed.

Continuous Improvement

Having converted all of random Bash scripts and other "little" tests and checks that lying around into monitoring checks, the sysadmin now has a "third party" checking their work. You can get to a situation where after a change you can be genuinely confident that the service is up and working correctly and there are no far-flung dependents that will raise an error whenever they next try to use it.

Having the "third party" (monitoring system) checking your work is actually a great validation. It's a bit like Test Driven Development, in so much as it all starts out "green", you do some work, during which things go "red". When you're finished though, it all goes back to "green" confirming that you left it all in a working state. You didn't assess it yourself, so you didn't apply any unconscious bias or its-late-and-I'm-tired to it - it's an objective view of the success of your work.

Of course, in the early days you're probably not checking enough details to really be sure that a service is working. That's okay, at some point you'll check the monitoring is all "green" but someone will complain about something. You'll need to fix the underlying problem, but after that you can add in a new monitoring check to look at whatever you just fixed so that next time you'll know that it's working before you finish up your change. Having a monitoring platform that makes this easy (really easy) is pretty crucial here - if it is in any way difficult or time consuming to add checks to your monitoring platform, then you can be sure it won't happen - and that means bumping into the same problems time and time again.

In fact, adding in monitoring whenever anything happens is where the real value comes in. If someone says something like "it's running a bit slow", then putting in some monitoring means you get graphs of the speed of the application. Next time someone says it's a bit slow, you can check the graph and can see if it really is slow, or if perhaps it's a problem at the users end of things.

Likewise, even if someone or something has a problem you can't do anything about, it's still worth some monitoring. For example, if a particular (remote) DNS query seems to take a long time occasionally, there may nothing you can do about it immediately. For fear of sending too many requests to the remote DNS server you may even choose not to monitor it directly. However, by monitoring DNS more generally, if further problems occur, you can at least prove that the problem is NOT with anything in your own infrastructure, which gives you clear information to use when working with the remote service provider.

Conclusions

Whatever you choose to monitor (or not monitor), the point here is that you can (and should) end up monitoring dozens, if not hundreds of different touch-points on every server in the estate. Picking a monitoring platform that makes it easy to add these checks is absolutely crucial to making it happen. If there's too much friction, people will "forget" and you won't actually move forwards.

For help picking a monitoring platform, implementing one, or working out what to monitor, Pre-Emptive can help. Contact us to see how we can help you.

Image credit: https://flic.kr/p/bum5vd