We use Riak extensively at Dataloop because it’s an awesome object store that gets bigger and faster as you add more nodes. But how do we monitor it?
Like with most monitoring I started looking for some cool Nagios format scripts a while ago. After experimenting with a few I settled on check_riak.py from https://github.com/zorkian/nagios-plugins which returns a lot of performance data and has some cool alert logic built in to detect multiple node failure.
I was happy for a while but I knew that I wasn’t getting _all_ of the possible data from Riak. Also, I have my Nagios scripts set to run every 30 seconds so the granularity isn’t great when looking at graphs. Nobody likes a stuttery graph.
Since adding a Graphite port to Dataloop it has opened the door to streaming metrics in at an alarming rate. So today I set about trying to get _every_ piece of data possible from Riak into Dataloop via Collectd (my favourite tool right now). Obviously this blog is relevant to anyone using Riak and Graphite, you don’t need to be using Dataloop, in fact you can't use Dataloop because we haven't released yet, sorry.
After a bit of digging I came across http://www.the-eleven.com/tlegg/blog/2012/05/28/monitoring-riak-collectd-5/ which shows how Collectd could simply stream the data from the Riak stats interface using the curl-json plugin. This looked like a nice clean way to do it, but from inspecting the output from curl http://localhost:8098/stats I could see the config didn’t map __every__ single piece of data available.
Being the lazy person that I am I decided to invoke some really bad python code to mangle the output into some usable Collectd config. Piping the output from the stats curl into a file was pretty easy, then cat riakstats.txt | python -mjson.tool to get it to print prettily formatted output. Then some vi hackery to remove everything except the metric names (plus a bit of manual cleanup to remove version numbers and text).
I won’t bore you with the full details but essentially I got all of the possible metrics into a file called riaksource.txt, one per line. There were 131 in total!
Now unfortunately that isn’t the end of the story. Collectd needs a Type for each key which corresponds to a gauge, a counter, a byte etc. Being the 80/20 person I am I quickly knocked up this script to guess at them:
#!/usr/bin/env python source = open( "riaksource.txt", "r" ) for line in source: key_type = 'gauge' if 'avg' in line: key_type = 'gauge' if 'total' in line: key_type = 'counter' if 'memory' in line: key_type = 'bytes' print """ <Key \"%s\"> Type \"%s\" </Key> """ % (line.strip(), key_type)
That gave me what I think is every single metric, and with most of the correct types set. Well, they look correct from a brief skim down the list anyway.
The final config is here in this gist https://gist.github.com/sacreman/10554100
If anyone notices some incorrect types ping me a message and I’ll update them manually, or alternatively comment on the gist.
Anyway, that was a fun hour from start to finish. And right now I have 131 scrolling graphs showing me what Riak is doing, as well as my Nagios check script. Once I've got a good baseline and have worked out what every single number means I'll start adding some alert rules.
One day I need to investigate Basho's own Nagios scripts at https://github.com/basho/riak_nagios and see if they would be better than the Nagios script I found a while ago. But that can wait, I’ll probably take on RabbitMQ next.