In my Monitorama Pitch in May, I focused on one of the core problems we see with today's monitoring tools - Adoption. When I talked about Adoption, we weren't talking about adoption of the monitoring tools in the Operations team, but all the teams that interact with Operations such as Developers, QA, Support and Product Managers. This was one of the core reasons we launched Dataloop.IO after struggling with this issue at our last company.
At our last company we spent months building a monitoring system using Nagios, Graphite and a bunch of custom reporting components. We also went down a few dead ends with Sensu, but overall after some time we had a monitoring system that worked with our environments on AWS. However no matter how much we had customized Nagios & Graphite, outside our Operations team, no one knew how to use or modify the system, or would even login to see the metrics and graphs we'd spent months setting up for them. But who cares right? Our Operations team was happy, they had spent years previously working with these tools so knew how to use them and write Puppet modules to configure and modify them.
Reducing Release Bottlenecks Through Self-Service
The problem was we really cared about DevOps. We were putting in processes and tools to enable faster releases from the different product teams into our online service, and we cared about removing as much friction as possible from the process to ensure code could get pushed through as fast as possible. In order to do this we needed to not only automate, but also provide self-service tools, to the other teams like Developers and QA, so they could quickly deploy and test new releases without having a PHD in Operations or Nagios.
One of the tools we built was a Deployment Portal. We developed a beautiful web tool that allowed anyone in Engineering/QA or Support, to login, select a build, and spin up a new environment exactly like production for development/testing by themselves in minutes. Within a week, we went from 4-5 environments, to over 15-20 as different people setup new environments to test their releases (we'd destroy them at the end of the day to save costs). Not only did it make our releases faster, it removed the Operations team from the process - they provided the tool with an easy to use UI, but everyone else could get their work done to push out and test releases quickly without the Operations team being involved and becoming a constraint.
We wanted to do this with monitoring as well. The biggest problem we had with spammy alerts and uncaught production issues was keeping our monitoring coverage up to date. As our releases were pushed out, our production environment quickly changed over time, which meant our monitoring configuration and check scripts also needed to change over time to stay up to date and ensure we got full coverage and no false alerts from out of date check scripts. However the Operations team were the only people who knew how to modify our monitoring system, so whenever new monitoring checks were required, it would be another task added to our already long JIRA backlog, and usually would take several days to get to before the monitoring check was added to production.
What we wanted was, like the Deployment Portal, other teams outside Operations to also be able to go into our monitoring system and easily setup their own checks, dashboards and alerts as a self service tool. This way they could add monitoring checks, dashboards and alerts to their sprints and add them themselves without Operations becoming a constraint. The problem with all the popular Open-Source tools is although (albeit with a lot of customisation) we could get them to work with our dynamic Cloud environments, they are all too complex for non-Operations people to easily use and adopt. Like our Deployment Portal, we needed an easy to use GUI that they could login to to configure and setup the monitoring themselves without Operations becoming a constraint. This would allow us to remove our Operations team from the process, reducing the friction of keeping our monitoring configuration up to date as our environment changed between releases, allowing us to push out faster releases.
Tearing down Data Silos
The other issue we had with the Open-Source tools was getting other users to login to view the metrics and graphs we were collecting, especially management. Operations sometime forget they are collecting a wealth of really important metrics in real time across the service, such as users online, hosting costs and uptime; all important metrics that management care about for an online business. However because the Open-Source tools tend to have dated, difficult to use UIs, getting teams outside Operations to login and get these metrics for themselves doesn't work.
Many Operations teams resort to manually creating reports for management in Excel, or building custom Dashboards using frameworks like Dashing that can be shared with other teams. In our case we wrote a brittle Ruby script that produced an email every Monday that was emailed to the management team with all the key metrics, comparing this week with last week and their change, and key performance graphs for different parts of the service like logging in and search. Once this data was taken out of Operations and shared with management, it was amazing how quickly more Development was refocused onto key parts of the service which we'd raised issues with previously, but now with the real data could have a constructive, data-driven, conversation about.
However the ultimate monitoring tool for us would be easy to use, so teams outside Operations could easily collect and browse the metrics they cared about, and setup their own Dashboards using drag & drop without creating additional JIRA tasks for the Operations team to develop custom Dashboards. This would mean everyone in the organization could follow the metrics they cared about, opening up the Data Silos typically created by the Open-Source monitoring tools.
And the Problem Gets Worse with Micro-Services
Now that we've spoken to almost 100 online services to date about their monitoring issues, one of the key trends we've seen is a move to Micro-Services. Probably one of the best examples shared publicly is from Spotify, and how they organize their 24 product teams around 100 Micro-Services. While we never got to this model at our last company, a lot of online services are moving towards this model because it allows them to scale complex services while staying agile and we've spoken to a lot of larger services making strategic moves towards implementing Micro-Services in their organizations.
What we've seen from talking to all the largest services is they're building their own internal monitoring solutions that can be shared as a self-service tool for the different product teams to setup their monitoring, dashboards and alerts for the Micro-Services they own. Unfortunately however, not everyone has the engineering talent and resources of a NetFlix or Twitter, so again most online services are stuck spending significant time and resources trying to customize the Open-Source tools to make them work with this model.
How We're Solving Adoption Today
Like us, a lot of online services are bridging the adoption gap with their monitoring systems by developing custom Configuration Management modules and Dashboards so other teams can setup their own monitoring. There has also been a lot of new Open-Source projects trying to improve the user-experience and adoption of the existing Open-Source tools (such as Grafana) but while these make it easy for other teams to access the Data, you still need to fall back to the complex world of Nagios to setup new checks and alerts.
What's needed is an end to end solution that makes collection, visualization and alerting all easy to use and self-service without Operations teams having to write custom Configuration Management and bespoke tools for their organizations.
In my next blog I will walk through some of the ideas and features we're working on at Dataloop.IO so that monitoring can be shared outside Operations as a self-service solution, while still giving Operations the visibility and tools they need across the whole service.