Introduction to metrics - one of the three pillars in DevOps observability

Observability in DevOps comprise of three pillars; metrics, logs, and traces.
We've been talking about logs here, here, and here.
Now, let's focus on metrics—the numerical heartbeat of your applications and infrastructure.
What are Metrics (and how they differ from logs and traces)?
A metric is a numerical measurement of a system's property captured over time. Think of the CPU utilization of a server, the number of active users on a website, or the response time of an API endpoint. Metrics are fundamentally numbers with timestamps, often accompanied by labels (or tags) that add context, like http_requests_total{method="POST", endpoint="/api/users"}
.
They are lightweight, easy to aggregate, and ideal for building dashboards and setting up alerts. But how do they differ from logs and traces?
Let's use a car trip analogy:
* Metrics: These are the readings on your car's dashboard—speed, RPM, fuel level, and engine temperature. They give you a high-level, quantifiable overview of the car's state at any given moment. (speed=60km/h, timestamp=14:30:05)
.
* Logs: This is your trip diary. A log is a detailed, timestamped text record of a specific event, like "Started engine," "Turned left on Main St.," or "Engine warning light came on." They are great for debugging specific, discrete events.
* Traces: This is the GPS route of your entire journey. A trace follows a single request as it travels through all the different microservices in your system. It shows you the path and the time spent in each service, helping you pinpoint bottlenecks in a distributed architecture.
In short, metrics tell you what happened (e.g., memory usage spiked to 90%), logs tell you why it happened (e.g., an error message showing an out-of-memory exception), and traces show you where it happened (e.g., in the authentication service during a database call).
Types of Metrics
Metrics generally fall into four main categories. Understanding them is key to instrumenting your applications effectively.
Counter: A cumulative metric that only ever goes up (or resets to zero on restart). It's like your car's odometer.
- Use Case: Counting the total number of HTTP requests, tasks completed, or errors that have occurred since the service started.
Gauge: A single numerical value that can arbitrarily go up and down. It's like your car's speedometer.
- Use Case: Measuring current memory usage, the number of active connections in a pool, or the number of items in a queue.
Histogram: A more complex metric that samples observations (like request durations) and counts them in configurable buckets. It provides a distribution of the data. Histograms allow you to calculate quantiles (e.g., the 95th percentile request latency) on the server side.
- Use Case: Understanding the distribution of API response times. You can answer questions like, "How many requests completed in under 100ms, 200ms, and 500ms?"
Summary: Similar to a histogram, a summary also samples observations. However, it calculates configurable quantiles on the client side and exposes them directly.
- Use Case: Also used for request latencies or response sizes, but it's less flexible for aggregation across many instances compared to a histogram.
How to create and collect metrics
Creating and collecting metrics generally involves two main steps and models:
Instrumentation: You add code to your application using a client library (e.g., Prometheus client for Python, Micrometer for Java). This code creates and updates the metrics (counters, gauges, etc.) as your application runs. For example, you'd increment a counter for every HTTP request your API receives.
Collection (or Scraping): Once your application is instrumented, it needs to expose these metrics so a monitoring system can collect them. There are two primary models for this:
- Pull Model: The monitoring system (like Prometheus) periodically queries an HTTP endpoint (e.g.,
/metrics
) on your application to "pull" or "scrape" the latest metric values. This is the most common model in modern cloud-native environments. - Push Model: The application or an agent actively "pushes" its metrics to a central monitoring system. This is useful for short-lived jobs or systems behind a strict firewall.
- Pull Model: The monitoring system (like Prometheus) periodically queries an HTTP endpoint (e.g.,
Popular open-source tools for metrics
Once you're collecting metrics, you need tools to store, query, and visualize them. Here are some of the most popular open-source choices in the DevOps world:
Prometheus and Grafana: This is the de facto standard for cloud-native monitoring.
- Prometheus: An open-source monitoring system and time-series database. It uses a pull model to collect metrics, stores them efficiently, and has a powerful query language called PromQL.
- Grafana: An open-source visualization tool. It connects to Prometheus (and many other data sources) to build beautiful, insightful dashboards and alerts. The combination is powerful and flexible.
ELK Stack (Elasticsearch, Logstash, Kibana): While primarily known as a powerful log aggregation platform, the ELK Stack can also ingest and visualize metrics data. It's particularly useful if you want to keep your logs and metrics in one place, though it may not be as efficient for metrics as a purpose-built TSDB.
Zabbix: A mature, all-in-one monitoring solution that offers monitoring of networks, servers, applications, and services. It typically uses an agent-based approach (push model) and provides its own storage and visualization capabilities. It's a robust choice for traditional IT infrastructure monitoring.
VictoriaMetrics: A fast, cost-effective, and scalable open-source time-series database. It's often used as a long-term storage solution for Prometheus or as a complete replacement. It's designed for high performance and can handle massive amounts of metrics data.
Why you should not use SQL databases to store metrics
It might be tempting to just store your metrics in a familiar SQL database like PostgreSQL or MySQL. Don't do it. Here’s why traditional relational databases are a poor fit for metrics data:
High Write Volume: Monitoring systems generate a massive, constant stream of data (e.g., hundreds of metrics from thousands of servers every few seconds). SQL databases, which are often optimized for complex reads and transactional integrity (ACID), struggle with this relentless, high-velocity write load.
Inefficient Indexing: Metrics are time-series data, meaning they are always queried over a time range. Relational database indexes (like B-trees) are not optimized for this access pattern, leading to slow queries as your data grows. Time-Series Databases (TSDBs) like Prometheus use specialized indexes designed for time.
Poor Compression: TSDBs use highly specialized compression algorithms (like delta-of-delta encoding and XOR) that understand the nature of time-series data, resulting in huge storage savings. SQL databases lack this, and your storage costs will explode.
Schema Rigidity: Adding a new metric or a new label in a SQL world might require a schema migration (
ALTER TABLE
). This is cumbersome and impractical in a dynamic environment where new services and metrics are constantly being added. TSDBs have a flexible, schema-on-write approach that handles new labels and metrics gracefully.
In essence, using a SQL database for metrics is like using a sedan to haul commercial cargo—it might work for a very small load, but it's the wrong tool for the job and will quickly break down at scale. Always use a purpose-built Time-Series Database (TSDB).
Tip for aspiring DevOps engineers: How to get started with observability tools
When it comes to building skills on observability and observability tools nothing can beat home labs.
You can build a home lab with a comprehensive observability stack all with open-source tools. If you are new to these tools, I suggest choosing Prometheus and Grafana for collecting and visualizing metrics in your home lab. This tool combo gives you the best experience at the start. They are also widely used by DevOps teams in production.
Here's what you need to build a home lab with Prometheus and Grafana.
- Build a web application and instrument it to emit metrics using Prometheus client libraries
- Install Prometheus (preferably on Docker) and collect metrics.
- Use Prometheus query language (PromQL) to visualize metrics within Prometheus itself.
- Install Grafana (again Docker is preferred) and use Prometheus as a data source to build advanced visualization dashboards.
Out of these four steps, I find that PromQL is what beginners mostly struggle with.
But, don't worry.
If you got the fundamentals correct, you'll have no trouble with PromQL. We will delve into that in an upcoming post.