When it comes to service status pages most of us feel they are more of a marketing gimmick than fact. For example with Amazon Web Services the first time you are aware of a problem it is not from the status page. It is when twitter sets on fire with people complaining about the poor service. The trend is alarming and it is not just Amazon doing it, almost all service providers do the same thing. For some reason special authorisation is required to update the status page. Special people need to confirm that this is the right marketing move for the business. That's not how we work.
Monitoring at scale is a hard task so we often get asked by people what our architecture looks like. The reality is that it's constantly changing over time. This blog aims to capture our current design based upon what we've learnt to date. It may all be different given another year. To provide some background we initially started Dataloop.IO just under 18 months ago. Before then we had all been involved in creating SaaS products at various companies where monitoring and deployments were always a large part of our job.
Knowledge is a powerful thing and a good monitoring solution should provide a wealth of data to help drive highly intelligent decisions. To subtly complicate things there is a keen distinction between monitoring (observing) and alerting (notifying). Monitoring and alerting are intrinsically linked but they should be looked at separately which I'll attempt to cover in a future blog post. Without going too deeply into that there is a basic question to be answered.