☁️ Cloud & DevOps

Mastering OpenTelemetry Observability in Production

Q: Should I prioritize logs, metrics, or traces first?

Start with metrics for alerting (knowing *when* something is broken). Then implement distributed tracing to map the context (knowing *where* it is broken). Finally, attach your logs to those traces (knowing *why* it is broken). Traces act as the skeleton that holds your logs and metrics together.

📅 May 23, 2026

Marcus Cole

Cloud & DevOps Lead

Platform engineer who's been through every infrastructure era — bare metal, VMs, containers, serverless. Has strong opinions about YAML files and even stronger opinions about over-engineering.

cloud native telemetryOTel collector configurationdistributed tracing tutorialDevOps monitoring

The Reality Check

It is 3:14 AM. Your pager is screaming. You drag yourself out of bed, open your laptop, and stare at a dashboard that tells you CPU usage on the payment service has spiked to 100%. You switch tabs to your logging platform, but the logs for that exact minute are just a wall of generic database timeout errors. You switch tabs again to your tracing tool—assuming you even have one—only to realize the trace context was dropped somewhere between the API gateway and the billing service.

Microservices and complex CI/CD pipelines are fantastic for shipping code quickly. But in production, complexity is a tax you pay in sleep. We have fractured our systems into dozens of tiny, independent pieces, and in doing so, we shattered the single most important thing an operator needs: context.

When a system breaks, and it always will, you do not care about how elegant the architecture is. You care about finding the broken pipe. But right now, you are trying to find a leak in a city's water supply using three different maps drawn by three different vendors, and none of the streets line up.

The Core Problem

The real bottleneck in modern infrastructure is not the speed of deployment; it is tool fragmentation and vendor lock-in. For years, every time we wanted to send metrics to Prometheus, logs to Elastic, and traces to Jaeger, we had to import three different proprietary SDKs into our application code.

If the business decided to switch vendors to save money, someone had to go back and rewrite the instrumentation in every single repository. This is wasted effort. The best code is code you don't write, and you certainly shouldn't be writing custom glue code just to know if your application is breathing.

This week, the Cloud Native Computing Foundation (CNCF) announced the graduation of OpenTelemetry. This is not just another trendy project; it is a pragmatic standardization of how we observe our systems. OpenTelemetry observability provides a single, vendor-neutral standard for metrics, logs, and traces.

Under the Hood

Before we rely on the abstraction of an SDK, let's look at what is actually happening underneath.

Think of the OpenTelemetry (OTel) Collector as a major shipping port logistics hub.

Applications (the ships) are constantly arriving, unloading raw cargo (telemetry data). In the old days, every ship spoke a different language and used different sized boxes. The port was chaotic.

OpenTelemetry standardizes the shipping containers into a format called OTLP (OpenTelemetry Protocol). When the cargo arrives at the port, it goes through a strict, three-step assembly line:

1. Receivers: The loading docks. They accept data in various formats (OTLP, Prometheus, Jaeger) and translate it into a single internal format.
2. Processors: The sorting facility. Here, data is batched, filtered, or enriched. If a log contains sensitive customer data, the processor scrubs it before it leaves the building.
3. Exporters: The outbound trucks. They take the standardized internal data, translate it into whatever format your specific vendor requires, and ship it off to your storage backends.

By placing this Collector between your applications and your vendors, you decouple your code from your infrastructure. You write the instrumentation once.

The Pragmatic Solution: Step-by-Step Tutorial

Let's build a stable, fundamental implementation of the OpenTelemetry Collector. We will avoid over-engineering. We are going to spin up a local Collector, configure it to accept standard OTLP data, process it in batches to save network overhead, and export it to the console so you can see exactly what the raw data looks like before adding complex vendor backends.

Prerequisites

To follow along, you will need:

Docker and Docker Compose installed on your machine.

Python 3.9 or higher (for our simple test application).

A basic understanding of YAML and HTTP.

Step 1: Defining the Collector Pipeline

Before we write the configuration file, you need to understand why it is structured this way. The configuration is divided into two main parts: the component definitions and the service pipelines.

First, we define our components:

Receivers: We will open up port 4317 for gRPC and 4318 for HTTP. This is how our app will talk to the Collector.

Processors: We will use the batch processor. Sending a network request for every single trace is a great way to accidentally DDoS your own monitoring system. Batching groups them together.

Exporters: We will use the debug exporter. It simply prints the incoming data to standard output.

After defining them, we must explicitly wire them together in a pipeline. If a component is defined but not added to a pipeline, the Collector ignores it.

Create a file named otel-collector-config.yaml and add the following:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

Step 2: Deploying the Infrastructure

Now that we have our blueprint, we need to run the Collector. We will use Docker Compose because it is reproducible and keeps our local environment clean.

We are using the otelcol-contrib image instead of the core image. The contrib version includes a wider array of community-supported receivers and exporters, which you will inevitably need when you move to production.

Create a docker-compose.yml file in the same directory:

version: '3.8'
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otelcol-contrib/config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317" # OTLP gRPC receiver
      - "4318:4318" # OTLP HTTP receiver

Start the infrastructure by running:

docker-compose up -d

Step 3: Generating Telemetry the Hard Way

We could use a massive framework to auto-instrument an application, but that hides the mechanics. To truly understand OpenTelemetry observability, let's write a tiny Python script that manually creates a trace and sends it to our Collector.

Before running the code, you need to install the OpenTelemetry SDK packages.

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

Now, create app.py. Notice how we explicitly define a TracerProvider, attach our OTLP exporter to it, and then wrap a simple unit of work in a start_as_current_span block. This block represents the "context" we talked about earlier.

import time
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# 1. Define who is sending the data
resource = Resource(attributes={"service.name": "payment-service"})
provider = TracerProvider(resource=resource)

# 2. Configure where the data goes (Our local Collector)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# 3. Create a tracer instance
tracer = trace.get_tracer(__name__)

def process_payment():
    # 4. Wrap our work in a span
    with tracer.start_as_current_span("process_credit_card") as span:
        span.set_attribute("payment.method", "visa")
        print("Processing payment...")
        time.sleep(0.5) # Simulate database call
        span.add_event("Database query completed")
        span.set_attribute("payment.status", "success")

if __name__ == "__main__":
    process_payment()
    # Force flush to ensure data is sent before script exits
    provider.force_flush()

Run the application:

python app.py

Verification

How do we know it worked? We check the logs of our Collector. Since we configured the debug exporter with detailed verbosity, the Collector will print the exact payload it received.

Run the following command to view the Collector's logs:

docker-compose logs otel-collector

You should see a massive JSON-like output. Look closely at the payload. You will see your service.name ("payment-service"), the span name ("process_credit_card"), and the custom attributes ("payment.method": "visa") you defined.

This is the standardized context. Whether you eventually route this to Datadog, New Relic, or an open-source Jaeger instance, the payload structure remains exactly the same.

Troubleshooting

If you don't see the output, check these common pitfalls:

Port Conflicts: If docker-compose up fails, you might have another service running on port 4317 or 4318. Check your active ports using lsof -i :4317 (Mac/Linux) or netstat -ano | findstr 4317 (Windows) and kill the conflicting process.
Network Routing: If your Python script throws a connection error, ensure you are sending data to http://localhost:4318/v1/traces. The /v1/traces path is strictly required for the OTLP HTTP exporter.
Missing Flush: In short-lived scripts, the program might exit before the BatchSpanProcessor has a chance to send the data over the network. Always call provider.force_flush() before the script terminates.

What You Built

You just built a vendor-agnostic observability pipeline. You decoupled your application code from your monitoring backend. If your company decides to switch from a paid tracing vendor to an open-source solution tomorrow, you will not have to touch a single line of Python code. You will simply update the exporters section of your otel-collector-config.yaml and restart the Collector.

Stop letting your tools dictate your architecture. Take control of your telemetry data at the source.

There is no perfect system. There are only recoverable systems.

FAQ

Does the OpenTelemetry Collector add performance overhead?

Yes, but it is minimal and manageable. Running the Collector as a sidecar or a daemonset consumes CPU and memory. However, by offloading the batching, compression, and exporting work from your application code to the Collector, you actually improve the performance and stability of your core services.

Can I use OpenTelemetry if I am already locked into a specific vendor?

Absolutely. OpenTelemetry is designed to bridge the gap. You can configure the OTel Collector to receive standard OTLP data from your apps, and use a vendor-specific exporter (like the Datadog or New Relic exporter) to send that data to your current platform. This allows you to standardize your code now, and change vendors later.

Should I prioritize logs, metrics, or traces first?

Start with metrics for alerting (knowing when something is broken). Then implement distributed tracing to map the context (knowing where it is broken). Finally, attach your logs to those traces (knowing why it is broken). Traces act as the skeleton that holds your logs and metrics together.

Why use OTLP over HTTP instead of gRPC?

gRPC is generally preferred for production because it is faster and uses less bandwidth due to HTTP/2 and Protobufs. However, HTTP is much easier to debug locally, works better through restrictive corporate firewalls, and is simpler to test using basic tools like curl. Start with HTTP for debugging, move to gRPC for production.

☁️ Cloud & DevOps

Prometheus Alert Validation: Stopping 3 AM Pager Noise

Mar 28, 2026

☁️ Cloud & DevOps

Standard vs Confidential Containers: Which in 2026?

May 20, 2026

☁️ Cloud & DevOps

Managing Ephemeral Kubernetes Environments Pragmatically

May 17, 2026