Effective monitoring and observability are more important than ever in the technology space, where systems and applications are ever more sophisticated. Beyond simple monitoring, observability is about getting a thorough grasp of how your systems are operating, seeing problems before they become outages, significantly reducing downtime and customer impact.
Moving from Shadows to Clarity
We'll dive into the idea of observability in this three-part blog series, going over its main ideas, advantages, practical applications and delivering value through informed SLIs/SLOs and user journeys. We'll also go over the different methods and techniques you can use to get the most out of your Infrastructure & Applications.
If you are new to observability, by the end of this series you will be up and running with Observability powered by Open-Telemetry and defining your SLIs/SLOs like a pro.
Unfortunately placing all local development under the microscope and all clouds under the telescope is a large endeavour, let's talk through the advantages & process.
What is Observability? O11y?
My answer has always been:
The ability to determine internal application & system state from the outputted or resulting data & interactions.
It is an attribute of any system to which implies how well you can understand the inner workings of the systems without having to access or change them directly.
Let's break this down with an example:
If I drop a tennis ball down an empty pipe and it does not fall out the other end we could immediately assume it is stuck in the pipe. Observability is the overall measure of how accurately we can determine the location and cause of the ball in the pipe, without modifying or cutting the pipe.
Observability vs Monitoring, Bringing Shadows into view
Monitoring is the art of observing the known, preconfigured measurement of signals/metrics/logs that a team can expect or trust to indicate issues/problems.
The best example of this can be Patient Life Signs (Heart Rate, Oxygen, Temperature and Blood Pressure). Or for our more metal based friends, The Four Golden Signals:
Latency
Traffic
Errors
Saturation
The above four signals cover the bare necessities of monitoring, from these metrics we can infer the health or glimpse some insight into the condition of our patient, they however do not provide in depth information as to what is occurring under the covers.
To apply them to our above tennis ball example we would get:
How long it took for the tennis ball to fall through the pipe (Latency)
How many tennis balls are travelling through the pipe. (Traffic)
404 Tennis Ball not found. (Error)
How many tennis balls are falling down a pipe vs the total count of the number that can at any one time. (Saturation)
Below are screenshots of Google Cloud Platform (GCP) and Amazon Web Services (AWS) stock monitoring that is deployed for Gcloud Compute Instances & Amazon EC2 Instances alike. As described the present dashboards follow the Four Golden Signals methodology that if you are going to have monitoring, at least cover the four areas of Latency / Traffic / Errors / Saturation.
(Google's Stock Monitoring for Instances)
(AWS Stock Monitoring for Instances)
While the above give us a starting point for monitoring it does not provide insight or visibility into our use of Infrastructure, health of our application or insight into the user's experience, this is where traditional monitoring starts to fall short of Observability.
Observability as a practice is the collection of Metrics, Logs and Tracing (The 3 pillars of Observability) to be able to explore, identify, alert and troubleshoot potential issues not just known issues.
Monitoring will help DevOps/SRE Teams determine if an application is functioning/online, while a good Observability practice would help:
Identify slow burn errors
Identify system instabilities
Root cause analysis
Increase speed of troubleshooting during outages
Offer insight to potential system improvements
Reduce Impact to the user experience
Reduce Alert Fatigue
The gap between Monitoring and Observability can be highlighted through the graph below:
Monitoring will help solve:
Known Knowns - Things you know and understand, often this is any preconfigured dashboards or alerting for previous issues/outages or common experienced problems.
Known Unknowns - Things you know but do not understand or unclear outside of the result, eg. There are future security Vulnerabilities that exist today, they just have not been discovered yet.
Observability targets the 2 squares above, aiming to cases/issues from there down to Known Knowns.
Unknown Knowns - Things you are not quite sure on but are aware of the result, eg. The WebApp has just stopped working for a subset/particular group of users, no change in system or current reporting.
Unknown Unknowns - Spontaneous and unexpected outage with no warning, change or detectable result, unclear on current status.
Getting Started with Observability
When adding Observability throughout a Tech-Stack we are actively targeting to bring down the number of unknowns, which enable widespread disasters, outages, & incidents.
Our journey with OpenTelemetry should deliver:
Better Incident / Outage Detection
Faster Response / Recovery
Less Burn-out amongst engineers
Less reliance on Specialists / Subject Matter Experts (SMEs)
Better codebase and Day 1 feature launches
Application / Architecture Improvements
End-User based monitoring
More Knowns, Less Unknowns
To drive Observability into product teams / engineers / developers, try addressing the following questions with the teams:
What was the cause of the last/worst outage?
At what point did you know there was an issue/problem affecting end users?
For how long was this a possibility to occur?
Were you best equipped to deal with this outage?
What was the metric, log or trace you wish you had before this?
Are you prepared for the next outage?
Next Up OpenTelemetry
My Follow up blog will discuss setting up Observability for visualising / analysis using OpenTelemetry with some steps you can run side by side to demonstrate or start to breakdown the benefits of its use.
Afterwards we will use the above information and details to help inform and drive our SLIs/SLOs that track the use case and interaction(s) our customer's have everyday.
(There is not any monitoring configured to tell me if you clicked on the next blog post.......Yet.)
About Innablr
Innablr is a leading consultancy for cloud native, observability and Kubernetes. Frequently championing community events, delivering thought leadership and leading practices, Innablr is recognised in the Australian market as one of the most experienced providers of cloud solutions.
Talk to us about our Blueprints and Observability Patterns for Google Cloud and Amazon Web Services, whether it is Cloud Native, Serverless, or a Kubernetes deployment in Google Kubernetes Engine (GKE) or Elastic Kubernetes Service (EKS).
Matthew Callinan, Senior Engineer @ Innablr