Observability is your Best Friend
With the rapid pace at which apps and services are being built in 2021, observability has become a non-negotiable must-have for your systems or services that are the backbone for your business. Along with key business metrics that help feed and drive business requirements and decisions, observability and key performance and availability metrics help make technical decisions to ensure maximum uptime, excellent performance and problem areas of focus.
Let’s start with what is observability and how do you get it implemented across your organisation and what are the considerations.
What is Observability?
The ability to have an eagle’s eye on all core systems and to be able to deep-dive into specific problem areas when needed. Most people mistake observability with APM (Application Performance Management) tools and solutions. APM solutions intend to provide observability however the implementation and interpretation is up to us and usually we get these two wrong.
To expand on observability even further – consider monitoring tools that are able to gather enough performance and availability time-series metrics and report into a centralised system for further aggregation. This aggregated data is then transformed into visualisations or actionable insights that help us get a quick understanding of the system performance and health. In addition, an alerting system configured with the right thresholds can be setup to trigger alarms when system performance deteriorates while self-healing capabilities start to kick-in to help recover or resolve issues.
Monitoring is NOT Observability
Most people will argue that they have configured monitoring and hence have observability across their systems. This is a false statement. These both are distinct concepts and are dependent on each other. While monitoring does help in observability, it alone is not responsible. Along with monitoring, we also need application logging, tracing parameters and the right visualisations for the right audiences. Showing network failures to an application team makes no sense. Showing performance metrics such as heap usage makes no sense for business users.
The 4our Pillars of Observability
Observability consists primarily of –
- Logging of key steps in your system along with different request metrics
- Monitoring of your system performance and behaviour
- Traceability of your requests using common identifiers
- Visualisation to help observe the above and make meaningful decisions
Ensuring you are logging and/or capturing key metrics and key observation points in your applications or systems into a time-series database is critical to receiving important feedback for an existing system or ensuring these are built in for a new system. The need here is to follow some set of standard logging formats for each stack and use a fast log aggregation and indexing tool to help with visualising what is required for your assessment. On other occasions especially for a more technical assessments, it is key to have infrastructure metrics captured in production to assist in establishing capacity requirements or certain bottlenecks in production. This is where a good Application Performance Management (APM) solution is handy. An investment into a good APM solution goes a long way in understanding performance needs and establishing what areas to focus on and help make critical business decisions while also providing some RUM (Real User Monitoring) for key user journeys. APM solutions also help in understanding user behaviours impacting performance which is not always captured as part of simple logging capture.
Where do you begin?
You need to start by ensuring two capabilities are built into your systems.
- Effective logging
- Process-level metric collection
Ensure your system is writing logs in the format that a log aggregation and indexing tool can scavenge through easily. Eliminate unnecessarily logs that help in determining every method enter and exit or just dumping a lot of logs (could end up being a performance issue). Capture key business metrics in name/value pairs to easily capture as fields and index them for later use.
This will help you quickly look at the business and specifically workload side of things.
- What kind of requests are taking most to respond?
- What kind of product or service are users requesting at what time of the day?
- How does a daily workload resemble in front of a peak seasonal period?
- Does your service behave differently at different periods of the day/week/month?
All these are good questions a log aggregator and indexer tool can help with. Some examples are: Splunk or ELK stack, etc.
Then let us focus on the technical side of things. How do you get technical observability across your stack that helps to have a wholesome view. In the good old days we all had metric capture tools to help capture different metrics of a system and write into a graph somewhere and have someone make some sense out of it.
In today’s world we have APM tools (just love ’em) as long as you configure and fine tune it to meet your needs.
APM tools can help capture key performance metrics and key user behaviour metrics that will make sense out of it. This will help you make better decisions. Ensuring you define what your key transactions are and add the right thresholds to monitor any deviation, this capability can help avoiding a lot of issues in production.
In the world of APIs and Microservices – the common identifiers are correlation ids. However you can have more business specific identifiers such as account numbers, user ids, etc to help identify a transaction end-to-end or user behaviour across a group of services.
This help in timely investigations and also helps in tracing out issues.
Making sense of heaps of data from different sources that results in some meaningful decisions is an art. In my experience it is usually analysing different events that can help drive what is required and help build visualisation requirements that feeds into logging and monitoring requirements.
Usually most log aggregator tools provide visualisation capabilities however there are multiple third-party visualisation tools out there to help in making more sense out of data than ever.
>>> This blog is incomplete as some excerpts had to be replaced. I will be adding further to this area in the near future. >>>