The art of good alerting

Posted by : Steven Acreman | Monitoring Wisdom, Monitoring, Alerts

In a previous blog post I promised to write about the difference between monitoring and alerting. Often they get bundled together when people talk about 'monitoring systems' as a whole. Monitoring is really just about collecting data. Storing state change information, time series metrics, maybe even some strings or json. Alerting is about getting notified for things that you care about. As is usually the case it goes much deeper than that.

The art of good alerting

In my career I've been at companies where we've had absolutely no alerting, I've been at other companies where I've been bombarded with 5000 Nagios emails in an hour, and I've experienced almost everything in between. We'd go from hardly any coverage and pristine green dashboards up to several thousand checks in a short period of time, get inundated by alerts and then have to scale back the coverage until we could hear ourselves think. It became a bit of a joke always taking 2 steps forward, panicking, and then taking 1 step back so that the person on-call wouldn't get woken up at 3am accidentally. You can of course mitigate this by better change control, more testing etc etc, but the bottom line is that if you're monitoring a bunch of things then you probably have other stuff you can spend your time on. Stuff that doesn't make you take two steps forward and one step back, or consume your day with repetitive testing.

So what's the answer to this issue of getting flooded by alerts? Situations vary and I'm sure there is no silver bullet to this. But I can say what has worked for me in the past. The approach is very similar to how I plan monitoring which was discussed in another blog post. You start off by collecting every piece of data that you care about in context. Then you need to sit down and plan out exactly what you want to get alerted about. There is a temptation just to enable email alerts on every check script failure which is probably the worst thing you can do. Why would you monitor things that you aren't alerting on? Because that data is valuable for trending, reporting, diagnostics and may become something you alert on in future depending on observable behaviour or newly discovered information about your environments.

Not only is there the decision about what you want to get alerted on, but equally important, how you want to get alerted. The alert channel is critical as this is the way humans interface with a monitoring system. If our brains were computers and had the equivalent of StatsD built into our frontal cortex we'd probably be fine with receiving 5000 Nagios emails per hour.

Unfortunately, people tend to come with their own built-in credibility rating system which means that losing confidence in the alerts will lead to situations where you had monitoring and alerting in place but nobody responded to a critical failure. A bit like the story of the boy who cried wolf.

Having had several years to mull this over; this is my list of channels and what usually goes into them. It isn't definitive and the last one is wishy-washy futurist nonsense, but this is a tech blog so I feel inclined to add a little un-reality :)

1. Wake me up!

There are a few things that need to keep working otherwise they start to effect business value. For a SaaS company this might be that the site is working and you're able to collect money. You want a small number of bulletproof scripts that wake you up when critical services break. It's typical to hook these scripts up to Pagerduty or Twilio so you get alerted via SMS and phone calls.

A lot of the people we've spoken to have resorted to disabling critical checks from internal monitoring systems and instead use a few checks from external services like Pingdom. This seems like a shame as Pingdom is extremely limited and only really lets you do the equivalent of a curl to see if your site is up. Hopefully when Dataloop.io is released people can start to use a single platform and trust it to only wake them up when something terrible has happened.

2. Tell me during work hours

Anything that isn't effecting business value should probably be relegated to at least this channel. If you're in work then you want to get alerted via email, chat room bot, growl, rss or whatever you pay attention to. Typically these will be alerts that tell you that redundant systems have broken or performance isn't what it should be etc. Group up these scripts and make sure you only send these to someone when they are working.

3. Draw my attention to something

Even with all of the complicated machine learning, computers are still a bit stupid when it comes to spotting patterns in graph data. If you have some important metrics then graph them and throw them up on a screen somewhere. Maybe even rotate a few pages of them. I've had a few occasions where I've looked up and noticed something odd, investigated and fixed a problem before my alerting system got involved.

4. Subscribe me to something temporarily

Most monitoring tools are actually really bad at this. So this is kind of a wish list item (and something we're working on for Dataloop.io). The idea being that you may not always want to get alerted about something. You may be on a project, care about that area for a few weeks, then move onto something new and want to unsubscribe. A lot of times this is done via email distribution groups which is quite cumbersome.

5. Show me something I can discuss with other groups

Strangely, I've had quite a lot of success alerting other groups to issues by just scripting a weekly report that contains key business metrics, a weekly + or - diff from last week in green and red to show trend and graphs with threshold lines on them. Combined with anomaly detecting systems like Etsy's Kale stack you can have some quite liberating discussions across teams and plan to focus in on certain areas.

6. Show me the future!

I got quite excited when the first iPhone had augmented reality apps. Even more so when Google announced Glass and I saw people could walk around with a Heads Up Display. Like with everything in my life I considered how that might effect monitoring and alerting systems. There is a definite trend towards more devices being connected to the internet. A lot of these have sensors and the near field communication stuff is really interesting. There is the possibility that in the not too distant future you may want to use augmented reality as a channel for alerts when looking a certain objects. Being able to look at a fuel pipeline and see the rate of flow, pressure metrics like Tony Stark can when he wears his helmet in the film Iron Man would be pretty cool. The moment I find a practical application for this we'll start working on it as a channel for Dataloop.io, until then I'm holding back.

There was quite a bit in this blog and I'm sure it is only scraping the surface. As you may know from other blogs and the website we're building a cool new SaaS monitoring tool currently. It's due for release in 2014 and we're about to start work on the alerting and alert handling components (as well as cool dashboards). If you have any feedback on how you think alerts should work we'd love to hear from you, or alternatively hit sign up and you can see our tool in action during our private beta.

Go Beyond Cloud Monitoring

Scale your infrastructure monitoring for Cloud, SaaS, Microservices and IoT-ready deployments with Outlyer. We’ve designed a solution for the agile organization. Start now and see your operational metrics in minutes.

Start Your 14 Day Trial Now