The World of Observability

As apps and services continue to be built at breakneck speed, Observability has moved from a “nice to have” to an absolute requirement for any system that supports a business. Beyond high-level business metrics that guide strategy and decision-making, Observability — through performance, availability, and reliability metrics — gives teams the insight they need to make better technical decisions, maximize uptime, maintain strong performance, and quickly identify areas that need attention.

Let’s start with some fundamentals – “What is Observability” and “How do you get it implemented across your organisation” and “what are the considerations“.

What is Observability?

Service Observability gives teams a high-level view across all critical systems, while still allowing them to zoom in on specific problem areas when necessary. It’s often confused with traditional APM (Application Performance Management) tools. While APM platforms aim to support observability, their effectiveness depends heavily on how they’re implemented and interpreted—and this is where teams and organisations commonly fall short.

Looking at observability more broadly, it involves monitoring tools that collect meaningful performance and availability time-series metrics and feed them into a centralised platform for aggregation. That aggregated data is then converted into visualisations and actionable insights, enabling a fast, intuitive understanding of system health and behaviour. When paired with well-defined alerting thresholds, observability systems can surface issues as performance degrades, while automated or self-healing mechanisms help restore service or mitigate impact before users are affected.

Monitoring is NOT Observability

It’s common to hear teams claim they have observability simply because monitoring is in place—but that’s a clear misconception. Monitoring and Observability are related, yet fundamentally different concepts that rely on each other. Monitoring contributes to observability, but it cannot deliver it on its own. True observability also requires rich application logs, distributed tracing, and thoughtfully designed visualisations tailored to their audience. Exposing network-level failures to application teams adds little value, just as presenting low-level metrics like heap usage to business stakeholders fails to provide meaningful insight.

The 4our Pillars of Observability

Observability consists primarily of –

  • Logging of key steps in your system along with different request metrics
  • Monitoring of your system performance and behaviour
  • Traceability of your requests using common identifiers
  • Visualisation to help observe the above and make meaningful decisions

Capturing the right logs, metrics, and observation points within applications and systems—and storing them in a time-series database—is essential for gaining meaningful feedback from existing services and for designing observability into new ones. This requires adopting consistent, standardized logging formats across technology stacks and using fast, scalable log aggregation and indexing tools to support effective analysis and visualization.

In more technically focused assessments, collecting infrastructure-level metrics in production becomes equally important, particularly for identifying capacity constraints and performance bottlenecks. This is where a well-implemented Application Performance Management (APM) solution adds significant value. Investing in a strong APM platform helps teams better understand performance characteristics, prioritise areas of improvement, and inform critical business decisions. It also enables Real User Monitoring (RUM) across key user journeys, offering visibility into user behaviour and its impact on performance—insight that basic logging alone often fails to capture.

How do you begin implementing this?

You need to start by ensuring two capabilities are built into your systems.

  • Effective logging
  • Process-level metric collection

Logging

Make sure your systems emit logs in a format that log aggregation and indexing tools can efficiently parse and analyse. Avoid excessive or low-value logging—such as tracing every method entry and exit or indiscriminately dumping large volumes of logs—as this can introduce noise and even impact performance. Instead, focus on capturing meaningful business metrics as structured name–value pairs, allowing them to be easily indexed, queried, and reused for deeper analysis later on.

This will assist you to quickly look at the business metrics and specifically workload side of things.

  • What kind of requests are taking most to respond?
  • What kind of product or service are users requesting at what time of the day?
  • How does a daily workload resemble in front of a peak seasonal period?
  • Does your service behave differently at different periods of the day/week/month?

All these are good questions a log aggregator and indexer tool can help with. Some examples are: Splunk or ELK stack, etc.

Monitoring

Shifting focus to the technical side, the challenge is achieving meaningful observability across the entire stack—one that provides a complete and coherent view of system behaviour. In the past, this often meant collecting isolated metrics, plotting them on graphs, and relying on someone to interpret what they might indicate.

Today, well-configured and properly tuned APM tools have transformed this process. When aligned with your specific needs, APM platforms can capture critical performance indicators alongside user behaviour metrics, turning raw data into actionable insight. By clearly defining key transactions and setting appropriate thresholds to detect deviations, teams can make better-informed decisions and proactively prevent many production issues before they impact users.

Tracing

When it comes to tracing, the goal is to gain end-to-end visibility across the entire stack and understand how requests flow through multiple services. Historically, teams relied on isolated metrics and logs, stitching together fragments of data and manually inferring where latency or failures might be occurring.

Modern tracing capabilities, often delivered through well-implemented APM platforms, have changed this significantly. Distributed tracing allows teams to follow individual transactions across service boundaries, exposing latency, errors, and dependencies with precision. By identifying critical request paths and defining meaningful thresholds, tracing data enables faster root-cause analysis, better performance tuning, and more informed decisions—helping teams resolve issues before they escalate into production incidents.

Visualisation

Effective visualisation is about turning complex observability data into clear, intuitive views that make system behavior easy to understand. In the past, teams were often left staring at disconnected graphs and raw data, manually correlating metrics, logs, and traces to piece together what was happening across their systems.

Modern visualisation capabilities—often built into observability and APM platforms—bring this data together in a cohesive way. By presenting metrics, traces, and dependencies through purpose-built dashboards and service maps, teams can quickly identify performance issues, understand relationships between services, and spot emerging trends. When visualisations are designed with the right context and audience in mind, they enable faster insights, better decision-making, and more effective responses before issues impact users.

I’m new to this – what skills do I need?

Observability is a capability that can assist in your end objective – that is either to gather business and technical insights, identify and trace particular issues or find bottlenecks, monitor usage across your systems and have different visualisations for different audiences.

What it isn’t is a one-stop shop for just about anyone. You do need skills to use this capability. Natively we had a lot of Performance Engineers who were the first to transition very seamlessly to Observability. Next you had the advent of SRE’s who were Developers + Devops + Performance Engineering + Production Support. Today you see a lot of Developers and Devops also interested in Observability.

Broadly these are the high level skills you need to do things around Observability:

  • Math and Statistics foundation: Have a clear understanding of average, median, mode, percentiles etc. Be good with numbers. When do you need percentages vs actual numbers. When do you need to round off after 2 decimal places, etc. The value in comparing average, 95th percentile and 99th percentile and what does that say about those metrics.
  • Infrastructure knowledge: Understand how CPU and memory work, how heap works in a JVM vs in .NET CLR, how network impacts latency, how disk read and writes impacts performance
  • Understanding OpenTelemetry: Have a broad understanding of OpenTelemetry as a foundational protocol that most APM and Observability solutions are being built on. With this will come tech stack specific libraries that help to produce metrics that these systems can read seamlessly.
  • Intuitive Visualisation Design: While we say a graphical representation is as good as the information it serves, I also believe that showing the right visualisation for the right kind of data is also key. If you show stacked bar graphs for throughput and response time on one axis, it doesn’t help.

These skillsets are a baseline as compared to other technical skills required along with presentation skills and a little of business acumen to ensure that the information captured, aggregated and visualised is meaningful to the audience.

Next ?

I am also keen to add more info here on certain tooling, my experience with some of them such as Splunk, OpenSearch, AppDymamics, Dynatrace, Grafana – the good, the bad and the ugly!

No comments for The World of Observability


    Leave a Reply

    Your email address will not be published. Required fields are marked *