How to build an observability stack that cuts to the heart of any issue with no wasted effort

Indika

04 Sep 2025 - 6 min read

Once upon a time, I was searching for a tool to SSH into multiple hosts and run commands simultaneously.

I wanted such a tool because, I often found myself opening many tabs in the terminal app and firing commands to multiple Linux hosts during troubleshooting. I though it would be cool if there was tool that could simultaneously run commands across multiple hosts.

It would save me some effort.

But, I found only a few tools built for that purpose. Most open-source tools were not actively maintained. Commercial tools were unconvincing.

Considering the security implications of such a tool as well, I set to create one.

And I did. It was just a bunch of Python scripts thrown in one folder. But it worked.

Later, I thought that I would make this a fully-fledged open-source tool. So I and started rewriting it in Go.

But, I found that UI of this tool is quite messy. (Not because I switched to Go, but because of the inherent nature of the tool) Also, there was a risk of running a disruptive command on multiple servers at once.

So, I started thinking it all over. What was the real problem I was trying to solve?

Then, I realized that I was working on the wrong problem.

What I needed was not a tool that can log into and run commands on multiple hosts. What I wanted was a mechanism that can quickly identify faults.

That's the job of the observability stack.

What is an observability stack

An observability stack is a collection of tools and technologies used to monitor, analyze, and gain insight into the internal state of a software system.

Unlike traditional monitoring that focuses on known metrics and alerts, observability is about providing a holistic view of the system, allowing you to ask ad-hoc questions and identify problems.

An effective observability stack is built on the three pillars of telemetry data:

Metrics: Numerical measurements collected over time, like CPU usage, request latency, or error rates. They are great for providing a high-level overview of system health and for alerting on anomalies.

Logs: Time-stamped, granular records of events that occur within an application. They provide rich context for debugging and can be used to understand the sequence of events leading to an issue.

Traces: Represent the end-to-end journey of a request as it moves through a distributed system. They are crucial for understanding the performance and dependencies of microservices.

There is no single tool that can do all these.

So, you must hook up multiple tools to collect, store, and visualize metrics, logs, and traces - hence the name observability stack.

What is the key characteristics of a good observability stack

A good observability stack allows you to pinpoint problems without manual analysis.

You don't need to SSH into the hosts. You don't need to run commands to see which host has used up all the memory. You don't need to grep each and every log. Your can get it all done from your observability stack.

To achieve this, the observability stack must offer a good holistic view of the running health of your system. If certain parameters drops, it should be immediately reflected in this holistic view.

Then you are able to drill down to granular metrics, logs, and traces from a single dashboard and identify the faulty part in the system.

This is how a good observability stack works.

The problem with silos

Most observability stacks suffer from a fundamental flaw.

They treat metrics, logs, and traces as isolated datasets. When a problem occurs, you'd see that there's a problem. But, you'll have a poor clue of what part of your system is misbehaving.

For example, a dashboard might show you are getting more than usual 500 responses. But, to find out what Pod, what host, or what cluster is causing this and why it is causing this problem you'll have to analyze manually.

It's inefficient. It's frustrating. When your software in prod are not working, time is precious. You need to know which part of your system is responsible to recover quickly.

But, if your observability data sets are isolated, you'll waste precious minutes doing manual analysis.

Siloed data sets are enemy of effective observability.

Correlation is key

The solution to the silo problem is correlation.

A truly effective observability stack is built on the principle that all data points—metrics, logs, and traces—must be linked. When a metric shows an anomaly, you can instantly pivot to the related logs and traces to understand the root cause.

This eliminates the need for manual context switching and SSHing into multiple hosts.

How to correlate data

The key to correlation lies in standardized labels.

When you instrument your code, every metric, log entry, and trace span must be tagged with common labels that uniquely identify a request, service, or component. Think of these labels as shared keys that connect different data types.

A metric like http_requests_total might have labels for service_name, endpoint, and status_code. To effectively correlate, your logs and traces also must have the same labels.

Prometheus labels are an excellent example of this concept. Grafana Loki and Grafana Tempo also have similar labeling concept for logs and traces, making Prometheus + Loki + Tempo + Grafana a highly efficient observability stack.

Finding the correlating labels

When instrumenting you code for observability, a crucial question you need to ask yourself is how to label them.

To determine which labels to use, imagine a manual troubleshooting session. What information would you use to drill down a problem?

You start with a metric that's out of the ordinary, like a sudden drop in success rate for a specific API endpoint. The metric would have a label for endpoint (e.g., /user/profile).
Next, you'd want to look at the logs for that specific endpoint. To do this without a unified system, you might SSH into the server and grep for the endpoint name. Ha.. the endpoint name is a must have label in your logs.
You might also look for a specific request_id from the log to see the entire flow of a single user request. Then the request_id becomes a crucial correlating label in the traces.

By identifying these "mental" links you make during a manual investigation, you can formalize them as required labels for your instrumentation.

This ensures that when a metric alerts you to an issue, you can immediately find all related log entries and traces.