A Guide to logging in DevOps

When an application fails at 2 AM, your first line of defense isn't a guess—it's a log file.
Logging is the practice of recording events that happen inside your software. Think of it as a detailed diary or a flight recorder for your application, capturing everything from routine operations to critical failures.
In the world of DevOps, where speed and reliability are king, mastering logging isn't just a good practice; it's a necessity for survival.
Logging vs. metrics vs. tracing: The three pillars of observability
Logging is often discussed alongside two other concepts: metrics and tracing. Together, they form the three pillars of observability, but they each answer different questions.
Metrics are numerical measurements over time. They tell you what is happening at a high level.
- Analogy: Your car's dashboard showing speed, engine temperature, and fuel level.
- Example: CPU utilization is at 90%.
Tracing follows a single request as it travels through all the different microservices in your system. It tells you where a problem is occurring.
- Analogy: A GPS map showing the exact route your delivery took, including where it got stuck in traffic.
- Example: The payment service is taking 5 seconds to respond to a request from the cart service.
Logging provides a detailed, timestamped record of a specific event. It tells you why something happened.
- Analogy: The detailed mechanic's report explaining that the engine temperature is high because the radiator is leaking.
- Example: A log entry shows
ERROR: Connection to database timed out after 3 retries.
You need all three for a complete picture, but logs provide the granular context that metrics and traces often lack.
Log levels
You don't need to log everything, specially in production.
If you are to log everything in production, you will need enormous amounts of storage and your business value of logging will diminish.
So DevOps teams have adapted log levels according to severity of events. The standard levels are:
- DEBUG: Detailed information, useful only for debugging. (e.g.,
Variable x = 10
) - INFO: Routine information about normal operations. (e.g.,
Server started on port 8080
) - WARN: Indicates something unexpected happened that might cause a problem in the future. (e.g.,
API key is deprecated
) - ERROR: A serious issue occurred, and a specific operation failed, but the application can continue running. (e.g.,
Failed to connect to the database
) - FATAL / CRITICAL: A critical error that forces the application to shut down. (e.g.,
Cannot bind to required network port
)
In development, you might log at the DEBUG
level, but in production, you'd typically set the level to INFO
or WARN
to reduce log volume.
Logging formats: plain text vs. structured logs 🧱
How you write your logs matters. There are two main approaches:
Plain Text (Unstructured): These are simple, human-readable text strings.
18-08-2025 18:10:00 ERROR: Failed to process payment for user 123.
While easy to read, they are very difficult for machines to parse and query reliably.Structured Logs (e.g., JSON): These are logs written in a machine-readable format with key-value pairs.
{"timestamp": "2025-08-18T18:10:00Z", "level": "ERROR", "message": "Failed to process payment", "userId": "123", "service": "payment-api"}
Structured logging is the modern standard. It allows you to easily filter, search, and analyze your logs (e.g., "Show me all ERROR logs from thepayment-api
service").
Plain text logging used to dominate in the past. But, more DevOps teams are adopting structured logging now.
Structured logging treats logs as machine-readable data, not just text. This makes them significantly easier to search, filter, and analyze automatically, which is essential for troubleshooting modern, complex applications.
Logging in your favorite languages 💻
All major programming languages have built-in support for logging. But, this built-in support is often limited to unstructured plain text logging.
If you want to structured logging, you must use third party libraries and SDKs like these.
Language | Standard Library Default | Common Practice for Structured Logging |
---|---|---|
Python | Plain Text | Custom formatter or libraries like python-json-logger |
Java | Plain Text | SLF4J facade with Logback/Log4j2 using a JSON encoder |
Go | Plain Text | Third-party libraries like Zerolog or Zap |
Node.js | Plain Text (console.log ) |
Third-party libraries like Pino or Winston |
The power of centralized logging 🏢
In a modern microservices architecture, you might have dozens or even hundreds of services, each generating its own log files. Trying to SSH into each server to read logs during an outage is a nightmare.
Centralized logging is the solution. It involves collecting logs from all your services and applications and shipping them to a single, central location for storage and analysis. This gives you:
- A Single Source of Truth: Search all your logs from one interface.
- Easier Troubleshooting: Correlate events across different services to find the root cause of a problem.
- Enhanced Visibility: Create dashboards and alerts based on log data from your entire system.
- Improved Security: Aggregate and analyze logs for security audits and threat detection.
Popular logging tools 🛠️
Several powerful toolsets, both open-source and commercial, can help you achieve centralized logging.
Open-source tools
- ELK Stack (Elasticsearch, Logstash, Kibana): The traditional powerhouse. Elasticsearch is for storing and searching, Logstash is for collecting and processing, and Kibana is for visualization.
- PLG Stack (Promtail, Loki, Grafana): A modern, lightweight, and cost-effective alternative from Grafana Labs. Loki is specifically designed to be efficient by only indexing metadata (labels) instead of the full log text.
- Fluentd: An extremely flexible data collector (often called the "unified logging layer") that can gather logs from hundreds of sources and send them to various destinations.
Commercial observability tools
Platforms like Grafana cloud. Datadog, New Relic, Splunk, and Dynatrace offer sophisticated, integrated logging solutions as part of their broader observability platforms. They often provide seamless log collection, powerful analytics, and automatic correlation with metrics and traces, but at a subscription cost.
Common Logging Challenges 🚧
Logging isn't without its challenges. Here are a few to keep in mind:
- Cost and Volume: Logs can consume massive amounts of storage and network bandwidth, leading to high costs, especially in the cloud.
- Signal vs. Noise: Logging too much (
DEBUG
in production) can hide important messages. Logging too little can leave you blind during an incident. - Handling Sensitive Data: Logs can easily contain Personally Identifiable Information (PII) like names, email addresses, or credit card numbers. You must have a strategy to scrub or mask this data.
- Inconsistent Formatting: If different teams or services use different log formats, it makes centralized analysis nearly impossible.
Troubleshooting with logs is a fundamental skill for all DevOps engineers.
By adopting structured logging, using appropriate log levels, and building a centralized pipeline, you transform logs from simple text files into a rich, queryable dataset that support efficient troubleshooting and problem resolving.
You can try logging in your home lab to get a knack of troubleshooting.
Create a Python app that generate a continuous stream of dummy logs. Deploy Promtail, Loki, and Grafana and build a logging pipeline. Randomly simulate errors from the Python app and try to identify them from the logs.
Developing troubleshooting skills require you to work on a production setup. This type of home labs are the next best thing to developing those skills when you are striving to get your first DevOps job.