☁️ Cloud & DevOps

Migrating from AWS App Runner to ECS Fargate with OpenTelemetry

Q: What happens if the OTel sidecar crashes?

Because we marked the sidecar as `"essential": true` in the task definition, if the sidecar crashes, ECS will gracefully shut down the application container and spin up a fresh, healthy task. This ensures you never have "zombie" tasks running without observability.

📅 April 26, 2026

Marcus Cole

Cloud & DevOps Lead

Platform engineer who's been through every infrastructure era — bare metal, VMs, containers, serverless. Has strong opinions about YAML files and even stronger opinions about over-engineering.

ECS Fargate tutorialOpenTelemetry integrationJaeger tracingcloud infrastructureDevOps best practices

The Reality Check

If you've been working in cloud infrastructure long enough, you know the feeling. It's a random Tuesday, you're sipping your morning coffee, and an email drops into your inbox: a managed service you rely on is being deprecated.

This week, AWS announced that App Runner is moving to maintenance mode and will stop accepting new customers. WorkMail is being shut down entirely. If you built your deployment pipelines around App Runner's "magic" abstraction layer, you are now on the clock to find a new home for your containers.

I understand the frustration. When you're the operator responsible for keeping the lights on, the last thing you want is forced migration work. Managed Platform-as-a-Service (PaaS) offerings like App Runner promise to hide the complexity of infrastructure. They tell us, "Just give us your code, and we'll handle the rest." But abstraction is essentially a loan. It buys you speed today, but eventually, the bill comes due. When the vendor changes their roadmap, you are the one left holding the pager at 3 AM, trying to figure out how to move production traffic without causing an outage.

The Core Problem

The real bottleneck here isn't just moving a Docker container from Point A to Point B. The core problem is the loss of built-in tooling.

When you use a highly abstracted service, you rely on its native dashboards for logs and metrics. As we migrate away from App Runner to a more fundamental, stable primitive like Amazon Elastic Container Service (ECS) with AWS Fargate, we lose that out-of-the-box visibility. We are taking back control of our infrastructure, but we must also take back ownership of our observability.

Without proper tracing, debugging a distributed system is like trying to diagnose a plumbing leak in a skyscraper without a blueprint. You know water is pooling in the lobby, but you have no idea which pipe on which floor burst.

Fortunately, the open-source community is standardizing. Jaeger, the widely used distributed tracing system, has just adopted OpenTelemetry at its core to solve observability gaps across modern architectures. This is our pragmatic path forward: we move our workloads to stable, boring infrastructure (ECS Fargate) and implement vendor-neutral observability (OpenTelemetry + Jaeger).

Under the Hood: The Restaurant Kitchen

Before we look at configuration files, let's understand how these components interact. Think of your cloud infrastructure as a busy restaurant kitchen.

The Application Load Balancer (ALB) is the waiter. It takes requests from customers and hands the tickets to the kitchen.
The ECS Fargate Task is the cooking station. It's where the actual work happens.
The Application Container is the chef. It processes the request, queries the database, and prepares the response.
The OpenTelemetry (OTel) Sidecar Container is the expeditor. It stands next to the chef, holding a stopwatch and a clipboard. It records exactly when the order arrived, how long the meat took to cook, and when the plate left the station.
Jaeger is the restaurant manager's office. The expeditor sends all the clipboard notes here, where they are organized into a timeline so the manager can see exactly why Table 4's order was delayed.

By placing the OTel Collector in a "sidecar" container next to our application, we separate concerns. The application only needs to know how to emit standard OpenTelemetry signals. The sidecar handles the heavy lifting of batching, compressing, and securely transmitting those signals to Jaeger. If Jaeger goes down, or if we decide to switch to a different observability backend later, we don't have to rewrite our application code. We just update the sidecar.

The Pragmatic Solution: Step-by-Step Tutorial

Let's build this. We are going to migrate a standard web application to ECS Fargate and configure an OpenTelemetry sidecar to send traces to a Jaeger backend.

Prerequisites

Before we begin, ensure you have the following ready:

An AWS account with administrative access.
The AWS CLI installed and configured on your local machine.
Docker installed locally.
A basic understanding of AWS IAM (Identity and Access Management).
An existing Jaeger instance running (for this tutorial, we assume you have Jaeger accessible via a URL, e.g., http://jaeger-collector.internal:4317).

Step 1: Prepare the OpenTelemetry Collector Configuration

Why we are doing this: The OTel Collector needs instructions. It needs to know where to receive data from our application (receivers), how to process that data (processors), and where to send it (exporters). We define this in a simple YAML file.

Create a file named otel-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  otlp/jaeger:
    endpoint: "jaeger-collector.internal:4317"
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger]

Explanation: We open ports 4317 (gRPC) and 4318 (HTTP) to listen for traces from our app container. The batch processor groups the traces together so we don't overwhelm the network with tiny requests. Finally, the exporter sends the batched data to our Jaeger backend.

Step 2: Push the Custom Collector Image to ECR

Why we are doing this: ECS needs to pull container images from a registry. We need to package our otel-config.yaml into the official OpenTelemetry Collector image and push it to Amazon Elastic Container Registry (ECR).

Create a Dockerfile for the collector:

FROM otel/opentelemetry-collector-contrib:latest
COPY otel-config.yaml /etc/otelcol-contrib/config.yaml
CMD ["--config", "/etc/otelcol-contrib/config.yaml"]

Build and push this image to your AWS ECR repository:

# Authenticate Docker to your ECR registry
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

# Build the image
docker build -t my-otel-collector .

# Tag and push
docker tag my-otel-collector:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-otel-collector:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-otel-collector:latest

Step 3: Define the ECS Task Definition

Why we are doing this: The Task Definition is the blueprint for our Fargate deployment. It tells AWS which containers to run, how much CPU and memory to allocate, and how the containers should talk to each other. Notice how we define two containers in the containerDefinitions array: our main app and our OTel sidecar.

Create a file named task-def.json:

{
  "family": "my-migrated-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/ecsTaskRole",
  "containerDefinitions": [
    {
      "name": "my-app",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {
          "name": "OTEL_EXPORTER_OTLP_ENDPOINT",
          "value": "http://localhost:4318"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/my-migrated-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "app"
        }
      }
    },
    {
      "name": "otel-collector",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-otel-collector:latest",
      "essential": true,
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/my-migrated-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "otel"
        }
      }
    }
  ]
}

Explanation: The magic here happens in the environment variables of the my-app container. We set OTEL_EXPORTER_OTLP_ENDPOINT to http://localhost:4318. Because both containers share the same network namespace in a Fargate task, localhost routes directly to the sidecar container. No complex networking required.

Step 4: Register and Deploy

Why we are doing this: Now that we have our blueprint, we register it with AWS and tell ECS to run it as a continuous service.

# Register the task definition
aws ecs register-task-definition --cli-input-json file://task-def.json

# Create the service (assuming you have a cluster and subnets ready)
aws ecs create-service \
    --cluster my-cluster \
    --service-name my-app-service \
    --task-definition my-migrated-app \
    --desired-count 2 \
    --launch-type FARGATE \
    --network-configuration "awsvpcConfiguration={subnets=[subnet-abcde123],securityGroups=[sg-abcde123],assignPublicIp=ENABLED}"

Verification

How do we know this actually works?

1. Generate some traffic to your application via your Load Balancer's DNS name.
2. Open your Jaeger UI.
3. In the left sidebar, select your service name from the dropdown and click "Find Traces."
4. You should see waterfall graphs showing the exact lifecycle of your HTTP requests.

If you see the traces, congratulations. You have successfully migrated away from a deprecated PaaS and established a robust, vendor-neutral observability pipeline.

Troubleshooting

When things break—and they will—here is where you should look first.

Problem: The ECS Task fails to start (Pending to Stopped).
Fix: Check your IAM roles. The executionRoleArn needs permissions to pull images from ECR and write logs to CloudWatch. Ensure the AmazonECSTaskExecutionRolePolicy is attached to the role.

Problem: The application runs, but no traces appear in Jaeger.
Fix: This is almost always a networking issue.
1. Verify that the application is actually sending data to localhost:4318. Check the application logs in CloudWatch.
2. Verify that the OTel Collector sidecar can reach the Jaeger backend. If Jaeger is in another VPC or subnet, check the Security Group attached to the ECS task. It must allow outbound traffic on port 4317 to the Jaeger instance's IP address.

Problem: The OTel Collector container exits immediately.
Fix: A typo in otel-config.yaml will cause the collector to crash on startup. Look at the CloudWatch logs for the otel stream. The OpenTelemetry collector is very strict about YAML formatting.

What You Built

You replaced a black-box, deprecated managed service with a predictable, standard architecture. You decoupled your application logic from your telemetry routing using the sidecar pattern. Most importantly, you regained control over your operational visibility using OpenTelemetry and Jaeger.

There is no perfect system. There are only recoverable systems.

FAQ

Is moving to ECS Fargate more expensive than App Runner?

Generally, no. App Runner charges a premium for the managed abstraction layer. While Fargate bills you for raw vCPU and memory usage, you have much finer control over scaling policies and resource allocation, which often results in lower overall costs for sustained workloads.

Why use OpenTelemetry instead of AWS X-Ray natively?

Vendor lock-in. OpenTelemetry has become the industry standard. By instrumenting your code with OTel and using a collector, you can send your traces to Jaeger today, AWS X-Ray tomorrow, and Datadog next week, all without changing a single line of application code.

Does the sidecar container consume a lot of resources?

The OpenTelemetry Collector is written in Go and is highly efficient. For most standard web workloads, it consumes less than 50MB of RAM and negligible CPU. However, you should factor this small overhead into your ECS Task CPU and memory allocations.

What happens if the OTel sidecar crashes?

Because we marked the sidecar as "essential": true in the task definition, if the sidecar crashes, ECS will gracefully shut down the application container and spin up a fresh, healthy task. This ensures you never have "zombie" tasks running without observability.