☁️ Cloud & DevOps

Pragmatic Observability: Surviving Modern DevOps Sprawl

📅 April 23, 2026

Marcus Cole

Cloud & DevOps Lead

Platform engineer who's been through every infrastructure era — bare metal, VMs, containers, serverless. Has strong opinions about YAML files and even stronger opinions about over-engineering.

OpenTelemetrysystem reliabilityreduce toilplatform engineeringDevOps pipelines

It's 3:14 AM. Your phone vibrates off the nightstand. The PagerDuty alert is screaming something about High Latency on Checkout Service, but by the time you flip open your laptop and authenticate through the VPN, the alert has resolved itself. Ten minutes later, it fires again.

You log into a vendor dashboard that costs your company more than your annual salary. You are greeted by a galaxy of red dots, a web of microservices that looks like a bowl of spaghetti, and absolutely zero actionable answers.

This is the reality of modern DevOps observability. We have built systems so distributed and complex that no single human understands them entirely. In response, the industry's default behavior is to buy more tools, add more layers of abstraction, and hope that some proprietary magic will make the pain go away.

But magic doesn't survive contact with production. The best code is code you don't write, and the best operational strategy is the one that requires the least amount of cognitive load when everything is on fire.

Today, we are looking at the current state of platform engineering, the rise of OpenTelemetry, and how we can manage operations without losing our minds.

The Reality Check: Drowning in Abstractions

As highlighted by recent discussions leading up to the DevOps Experience 2026 conference, the DevOps ecosystem is entering one of its most consequential phases. Teams are confronting massive tool sprawl across pipelines, platform engineering portals, and cloud-native infrastructure.

We are trying to govern an ever-expanding web of tools. We add service meshes to manage network traffic, then we add tools to manage the service mesh, and then we add dashboards to monitor the tools that manage the mesh.

Every time we introduce a new piece of technology to "simplify" our lives, we often just shift the complexity from the application code into the infrastructure configuration. When that infrastructure breaks, the failure modes are spectacular, silent, and deeply hidden beneath layers of vendor-specific agents.

The Core Problem: Tightly Coupled Telemetry

The real bottleneck in our infrastructure isn't the lack of data; it's the lack of context and the tight coupling of our telemetry to specific vendors.

As Martin Thwaites pointed out in his recent talk at GOTO Copenhagen, observability must evolve alongside our architectures. We have spent the last decade breaking apart our monoliths into serverless functions, event-driven architectures, and cell-based deployments. We decoupled our compute, our storage, and our networks.

Yet, inexplicably, we kept our telemetry tightly coupled.

If you want to use Vendor A for metrics, you install Vendor A's proprietary agent. If you want to use Vendor B for distributed tracing, you import Vendor B's proprietary SDK into your application code. You are hardcoding your operational decisions into your business logic.

When you treat observability as a product you buy rather than a fundamental property of the system you build, you end up with fragmented data. You have logs in one system, metrics in another, and traces in a third. When the system crashes at 3 AM, you are forced to be the human join-table, manually correlating a spike in CPU from one dashboard with an error log in another.

Under the Hood: The Plumbing of OpenTelemetry

To fix this, we need to look under the hood at how data actually moves through a system. Let's talk about OpenTelemetry (OTel), which is rapidly becoming the industry standard.

OpenTelemetry is not a backend. It is not a database, and it does not have a fancy UI. OpenTelemetry is the plumbing.

Think of your infrastructure like a city's water system. Your applications are the reservoirs generating water (telemetry data). For years, every vendor forced you to use their specific, patented pipes to get the water to their specific treatment plant. If you wanted to change treatment plants, you had to rip out all the pipes in the city.

OpenTelemetry is standard PVC piping. It provides a unified set of APIs, SDKs, and a Collector to generate, process, and export telemetry data (metrics, logs, and traces) to any destination.

The Restaurant Kitchen Analogy

Think of the OTel Collector like the ticket rail in a busy restaurant kitchen.

When a waiter takes an order (your application generating telemetry), they don't walk over to the grill and tell the chef how to cook the steak, then walk over to the salad station and explain the dressing. That would be tightly coupled, inefficient, and chaotic.

Instead, the waiter writes the order in a standard format and clips it to the metal rail above the counter.

The rail is the OpenTelemetry Collector. It sits in the middle. The grill chef (your metrics database) looks at the rail and pulls the information they need. The fry cook (your distributed tracing backend) pulls what they need. If management decides to fire the grill chef and hire a new one (switching vendors), the waiter doesn't have to change how they write tickets. The standard remains the same.

Understanding the OTel Pipeline

Before you look at any YAML configuration files, you need to understand the 'Why' behind the OTel Collector's design. It operates on a simple pipeline model consisting of three phases:

1. Receivers: How data gets in. This could be accepting data via OTLP (OpenTelemetry Protocol), scraping Prometheus endpoints, or tailing log files.
2. Processors: What happens to the data in transit. This is where the magic of pragmatism happens. You can batch data, filter out noisy health-check logs, or scrub Personally Identifiable Information (PII) before it leaves your network.
3. Exporters: How data gets out. This translates the standard OTel format into whatever proprietary format your chosen backend requires.

Because of this decoupled design, your application only ever talks to the local Collector.

Here is what a basic, pragmatic pipeline configuration looks like. Notice how clearly the responsibilities are separated:

receivers:
  otlp:
    protocols:
      grpc: # Applications send data here

processors:
  batch:
    timeout: 1s # Group data together to save network calls
  attributes/scrub:
    actions:
      - key: "user.email"
        action: delete # Never send PII to third-party vendors

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889" # Send metrics here
  otlp/vendor:
    endpoint: "api.vendor.com:443" # Send traces here

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes/scrub, batch]
      exporters: [otlp/vendor]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Managing Operations: Reducing Toil

Having standard pipes is great, but it doesn't automatically fix broken operations. As modern DevOps teams look to manage operations using modern technology, the focus must shift from "automating everything" to "reducing toil."

Toil is the manual, repetitive, tactical work tied to running a production service that scales linearly as the service grows. Restarting a stuck pod is toil. Manually provisioning database credentials for a new developer is toil.

The pragmatic approach to operations isn't to build a hyper-complex, self-healing system that you don't understand. The pragmatic approach is to build self-service paved roads for developers and establish clear boundaries.

Vendor-Driven vs. Pragmatic Operations

Feature	Vendor-Driven Approach	Pragmatic Standard (OTel + Self-Service)
Instrumentation	Proprietary agents injected into code	Open standard SDKs, vendor-agnostic
Data Ownership	Vendor owns the format and retention	You own the data pipeline and routing
Cost Control	Pay for ingestion of all data	Filter and sample at the Collector level
Incident Response	Search through 5 different vendor UIs	Query a unified dataset with shared context
System Growth	Requires buying more agent licenses	Scales horizontally with standard infrastructure

When you decouple your telemetry and focus on standardized operations, you regain control. You can route your high-value trace data to an expensive analytics tool, while routing your noisy, low-value debug logs to cheap, cold storage. You make decisions based on engineering needs, not vendor constraints.

What You Should Do Next

Stop buying dashboards to solve cultural and architectural problems. If your system is a mess, a more expensive monitoring tool will just give you a higher-resolution picture of your mess.

Instead, take these pragmatic steps:

1. Standardize on OpenTelemetry: Stop importing vendor-specific SDKs into your application code. Instrument your code with the OTel SDK. It is the safest, most future-proof technical decision you can make today.
2. Deploy the OTel Collector as a Gateway: Put an OTel Collector between your applications and the internet. Use it to scrub PII, drop noisy health checks, and control your outbound data costs.
3. Define a Shared Vocabulary: Ensure every service tags its telemetry with standard attributes (e.g., service.name, environment, tenant.id). Good telemetry is about consistent naming, not just volume.
4. Target Toil, Not Uptime: 100% uptime is a myth. Focus your operational efforts on reducing the manual toil required to recover from a failure, rather than trying to build an impossible system that never fails.

FAQ

Why shouldn't I just use the agent provided by my monitoring vendor?

Vendor agents tightly couple your infrastructure to their pricing model and feature set. If you ever want to change vendors, or send metrics to one system and traces to another, you have to rip out and replace the agent across your entire fleet. OpenTelemetry gives you control over your own data routing.

Does OpenTelemetry replace Prometheus or Jaeger?

No. OpenTelemetry is the instrumentation and delivery mechanism (the pipes). Prometheus (metrics storage) and Jaeger (trace visualization) are the backends (the treatment plants). OTel collects the data and delivers it to those backends.

How does this approach help with serverless architectures?

In serverless, you don't control the underlying host, making traditional infrastructure monitoring impossible. OpenTelemetry focuses on distributed tracing—following a request as it hops from an API gateway, to a Lambda function, to a database. This gives you visibility into the transaction itself, regardless of where the compute lives.

What exactly is 'toil' in DevOps?

Toil is work that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly as a service grows. Reviewing logs manually to find an error is toil; having the system automatically extract the error and attach it to the alert ticket is a pragmatic operation.

There is no perfect system. There are only recoverable systems.