☁️ Cloud & DevOps

Pragmatic Observability Data Management at Scale

Marcus Cole
Marcus Cole
Cloud & DevOps Lead
[email protected]
OpenTelemetryeBPF telemetrycloud native observabilityKubernetes monitoring

The Reality Check: Drowning in the Shallow End

You know the feeling. It is 3:14 AM. The pager on your nightstand is screaming. You stumble to your laptop, eyes burning from the sudden screen glare, and open your logging dashboard. The Kubernetes cluster is failing. You type in a query to find the root cause, hit enter, and... wait.

The browser tab spins. And spins.

It spins because your dashboard is choking on 400 gigabytes of INFO: Health check OK messages generated in the last twenty minutes. Your system did not just crash; it drowned in its own telemetry.

As modern rapid-development tools accelerate how fast we can ship code, they are also accelerating how fast we generate logs, metrics, and traces. We are shipping more microservices than ever before. The result is a massive crisis in observability data management. We have reached a point where the infrastructure required to monitor our applications often costs more—and is more complex—than the infrastructure running the actual business.

Looking at the agenda for the upcoming KubeCon + CloudNativeCon Europe 2026 Observability Day, it is obvious that the industry is feeling this pain. The community is no longer obsessed with collecting more data. The focus has entirely shifted to cost efficiency, scale, and surviving the data we already have.

We have built incredibly complex CI/CD pipelines and distributed meshes, but we treat our telemetry like a landfill. We just dump everything into a SaaS vendor's bucket and hope we can search it later. This is not observability. This is hoarding.

The Core Problem: Signal vs. Noise

The real bottleneck in our infrastructure is not our network throughput or our CPU limits. The bottleneck is our capacity to extract signal from noise during an outage.

When you decouple your architecture into dozens of microservices, a single user request might traverse fifteen different containers. If every container logs every step of its journey, a minor traffic spike creates an exponential explosion of data.

Historically, we solved this by asking developers to carefully curate their log levels. But let's be pragmatic: developers are under pressure to ship features. When a bug occurs in production, the immediate reaction is to add more debug logs. Those logs rarely get removed. Over time, the baseline volume of "normal" system chatter grows until it overwhelms the operators tasked with keeping the lights on.

The core problem is that we have tightly coupled data generation with data storage. If an application emits a log, we assume it must be shipped across the internet and stored on an expensive solid-state drive for thirty days.

We need to break that assumption.

Under the Hood: The Harbor Master and the Water Treatment Plant

Before we look at the modern tools promising to fix this, we need to understand what is actually happening underneath the abstractions. Let's look at the two technologies fundamentally changing how we handle this sprawl: eBPF and OpenTelemetry.

eBPF: The Harbor Master

For decades, the "hard way" to get metrics out of an application was to write code. If you wanted to know how long a database query took, you imported a library, wrapped your function in a timer, and pushed the metric to a server. If you had applications written in Java, Go, Python, and Node.js, you had to maintain instrumentation libraries for all four languages.

Extended Berkeley Packet Filter (eBPF) changes this by moving the observation point to the Linux kernel.

Think of your Linux server as a busy commercial harbor. The applications are the shipping companies, and the network packets are the cargo trucks.

Traditional instrumentation is like forcing every shipping company to hire an inspector to open every box before it leaves the warehouse, fill out a form, and mail it to the harbor master. It is slow, expensive, and requires everyone to speak the same language.

eBPF is different. eBPF allows us to put a security camera right at the harbor's main toll booth (the kernel). Because every network request, file read, and memory allocation must eventually pass through the kernel to interact with the hardware, eBPF can observe the traffic as it flows by. It does not care what language the application is written in. It just watches the trucks pass the gate.

Traditional vs. eBPF Telemetry Traditional (User Space) App Code (Java/Go/Python) Vendor SDK / Agent Network Out eBPF (Kernel Space) App Code (Untouched) Linux Kernel eBPF Observation Program

OpenTelemetry: The Water Treatment Plant

If eBPF is how we gather data without touching code, OpenTelemetry (OTel) is how we process it.

Think of your raw telemetry as municipal plumbing. Raw data flows out of your applications like water. In the past, we piped this raw data directly into our storage backend (the ocean).

The OpenTelemetry Collector acts as a water treatment plant. It sits between your applications and your storage vendor. It receives the raw data, filters out the waste, samples the clean water, and routes it to the right destination.

The Pragmatic Solution: Stop Hoarding, Start Filtering

The most stable, fundamentals-focused approach to observability data management is to implement a strict telemetry pipeline at the edge of your infrastructure. The goal is simple: drop useless data before it leaves your network.

1. The "Why" Behind Dropping Data

Before I show you how to configure a filter, let's talk about why we need it.

In a standard Kubernetes cluster, the kubelet performs health checks (liveness and readiness probes) on every pod, usually every 10 seconds. If you have 100 pods, that is 10 health checks a second, or 864,000 requests a day.

If your application logs every HTTP 200 OK response, you are paying your observability vendor to store nearly a million useless logs a day just to prove your app is breathing. You do not need to store health checks unless they fail.

2. Implementing the Filter

Using the OpenTelemetry Collector, we can intercept the logs, look at the HTTP path, and drop the data entirely if it is a successful health check. We do this using a filter processor.

Here is what that looks like in practice. We define a processor that targets our log stream. We tell it to look at the log body. If the log contains the path /healthz and the status code is 200, we drop the record.

By placing this logic in the OTel Collector, your application developers do not have to change a single line of code. They can keep logging everything, but the infrastructure acts as a pragmatic gatekeeper, protecting the system from noise and protecting the business from massive storage bills.

3. Comparing the Approaches

Let's look at how this modern, decoupled approach compares to the traditional way we've been doing things for the last decade.

FeatureTraditional InstrumentationeBPF + OpenTelemetry Pipeline
Code Changes RequiredHigh (Must import SDKs per language)Zero to Minimal (Kernel-level observation)
Data FilteringDone at the vendor (Post-ingestion cost)Done at the edge (Pre-ingestion savings)
Vendor Lock-inHigh (Proprietary agents)None (Open standards)
Performance OverheadModerate (Runs in user space)Low (Highly optimized kernel sandbox)
Operator BurdenHigh (Managing conflicting agents)Low (Single unified collector pattern)

The Pragmatic Telemetry Pipeline Raw Data (Apps & eBPF) OpenTelemetry Collector Receive Filter Export Actionable Storage

What You Should Do Next

Do not try to boil the ocean. If you are struggling with observability data management, take these concrete steps this week:

1. Audit Your Top 10 Metrics: Log into your observability platform and look at the ingestion volume. Identify the top 10 most frequent log messages or metrics. I guarantee at least three of them are completely useless (like health checks or debug chatter).
2. Deploy an OpenTelemetry Collector: Put an OTel collector in your cluster. Do not migrate all your apps at once. Just route your highest-volume application through the collector and implement a simple drop filter for the noise.
3. Experiment with eBPF: If you have legacy applications that are "black boxes" because nobody wants to touch the ten-year-old code, drop an eBPF daemonset onto the nodes. You will instantly get network-level metrics (latency, error rates) without a single code change.

Frequently Asked Questions

Does eBPF add performance overhead to the kernel?
Any observation adds overhead, but eBPF is incredibly lightweight. The programs are compiled to run safely in a restricted kernel sandbox. Compared to a traditional sidecar proxy or an in-app agent that consumes CPU cycles in user space, eBPF is vastly more efficient.
Do I have to rip out my existing logging vendor to use OpenTelemetry?
No. That is the beauty of the OTel Collector. It can receive data in almost any format (Prometheus, Jaeger, Fluentd) and export it to almost any vendor. You can place it in the middle of your existing pipeline to filter data before it reaches your current vendor.
What is tail-based sampling, and should I use it?
Tail-based sampling is a strategy where you wait until a transaction is completely finished before deciding whether to keep its telemetry. If the transaction was fast and successful, you drop it. If it was slow or threw an error, you keep it. It is highly recommended for reducing trace volume, but it requires enough memory in your OTel Collector to buffer the data while the transaction completes.
Is traditional instrumentation dead?
Not at all. eBPF is fantastic for baseline metrics (the "what"), but if you need deep business logic context—like knowing why a specific user ID failed a checkout process—you still need developers to write structured logs in the application code. Use eBPF for the baseline, and manual instrumentation for the business logic.

*

There is no perfect system. There are only recoverable systems.

📚 Sources