Observability is your Best Friend
With the rapid pace at which apps and services are being built in 2021 it is key to ensure you have overall observability across your system or services.
How do you get observability implemented across your services and what are the considerations?
When we set out on the journey to assess the testing needs for an application or service and ensuring it meets the needs of Production one of the key areas is to ensure it meets different Workload Models or Application Usage Model (AUM) or Application Simulation Model (ASM).
We derive these from different sources. At times it is the business giving you a maximum number that has been captured on a random day and termed as a Peak Load. Other times they are captured during some peak days such seasonal rush or business specific peak times. However what we at most get wrong or lack with are having key observability points in your system to ensure you are able to capture key metrics when the need arises and also ensuring the method of capture and calculation is helping in understanding user behaviour.
Observability consists primarily of –
- Logging of key steps in your system along with different request metrics
- Monitoring of your system performance and behaviour
- Traceability of your requests using common identifiers
- Visualisation to help observe the above and make meaningful decisions
Ensuring you are logging and/or capturing key metrics and key observation points in your application or service (system) is critical to receiving excellent feedback for an existing system or ensuring these are built in for a new system. The need here is to follow some set of standard logging libraries for each stack and use a fast log aggregation and indexing tool to help with visualising what is required for your assessment. On other occasions especially for a more technical assessment it is key to have infrastructure metrics captured in production to assist in establishing capacity requirements or certain bottlenecks in production. This is where a good Application Performance Management (APM) tools are handy. An investment into a good APM tool helps go a long way in understanding performance needs and establishing what areas to focus on and help make critical business decisions. APM tools also help in understanding user behaviours impacting performance which is not always captured as part of simple logging capture.
Where do you begin?
You need to start by ensuring two capabilities are built into your systems.
- Effective logging
- Process-level metric collection
Ensure your system is writing logs in the format that a log aggregation and indexing tool can scavenge through easily. Eliminate unnecessarily logs that help in determining every method enter and exit or just dumping a lot of logs (could end up being a performance issue). Capture key business metrics in name/value pairs to easily capture as fields and index them for later use.
This will help you quickly look at the business and specifically workload side of things.
- What kind of requests are taking most to respond?
- What kind of product or service are users requesting at what time of the day?
- How does a daily workload resemble in front of a peak seasonal period?
- Does your service behave differently at different periods of the day/week/month?
All these are good questions a log aggregator and indexer tool can help with. Some examples are: Splunk or ELK stack, etc.
Then let us focus on the technical side of things. How do you get technical observability across your stack that helps to have a wholesome view. In the good old days we all had metric capture tools to help capture different metrics of a system and write into a graph somewhere and have someone make some sense out of it.
In today’s world we have APM tools (just love ’em) as long as you configure and fine tune it to meet your needs.
APM tools can help capture key performance metrics and key user behaviour metrics that will make sense out of it. This will help you make better decisions. Ensuring you define what your key transactions are and add the right thresholds to monitor any deviation, this capability can help avoiding a lot of issues in production.
In the world of APIs and Microservices – the common identifiers are correlation ids. However you can have more business specific identifiers such as account numbers, user ids, etc to help identify a transaction end-to-end or user behaviour across a group of services.
This help in timely investigations and also helps in tracing out issues.
Making sense of heaps of data from different sources that results in some meaningful decisions is an art. In my experience it is usually analysing different events that can help drive what is required and help build visualisation requirements that feeds into logging and monitoring requirements.
Usually most log aggregator tools provide visualisation capabilities however there are multiple third-party visualisation tools out there to help in making more sense out of data than ever.