⚙️ Dev & Engineering

How to Build a Scalable Async Data Pipeline in TypeScript

Chloe Chen
Chloe Chen
Dev & Engineering Lead

Full-stack engineer obsessed with developer experience. Thinks code should be written for the humans who maintain it, not just the machines that run it.

web scraping at scaleconcurrency limitsdeveloper experienceNode.js performanceTypeScript tutorial

We've all stared at our terminal waiting for a script to finish processing 10,000 API calls, downing coffee and wondering if there's a better way, right?

When we first build an async data pipeline—whether it's for web scraping at scale, migrating databases, or syncing third-party APIs—we usually start with a simple loop. Then, to make it faster, we throw Promise.all() at it. Suddenly, our memory spikes, the API rate-limits us into oblivion, and our app crashes with an intimidating EMFILE: too many open files error.

Scaling from 1,000 to 10 million pages isn't just about writing a script; it's an engineering challenge that stresses network I/O, CPU, and memory. But don't worry! Shall we solve this beautifully together? ✨

Today, we are going to build a resilient, scalable async data pipeline in TypeScript. We will focus equally on raw backend performance and Developer Experience (DX)—because the code you write should be a joy for your fellow developers to read and use.

The Mental Model: The Assembly Line

Before we look at code, let's visualize the problem.

Imagine a busy coffee shop.

Synchronous processing is having exactly one barista. They take an order, grind the beans, brew the espresso, and hand it to the customer before even looking at the next person in line. It's safe, but incredibly slow. 95% of the time is spent just waiting for the water to boil (Network I/O).

Unbounded Asynchronous processing (Promise.all) is taking 10,000 orders simultaneously and trying to brew them all at the exact same millisecond. The espresso machine explodes, the power grid trips, and the coffee shop burns down (Heap Out of Memory / Rate Limits).

A Bounded Async Queue—which we are building today—is a well-oiled assembly line. We have a set number of workers (concurrency limit). As soon as one worker finishes a task, they immediately grab the next one from the queue.

Task Queue Worker 1 Worker 2 Worker 3 External API

Let's turn this mental model into elegant code.

Prerequisites

Before we dive in, ensure you have the following ready:

  • Node.js v20+: We want to take advantage of modern native fetch and standard library features.

  • TypeScript: We'll be using strict typing to ensure our pipeline is robust.

  • A basic understanding of Promises: You should be comfortable with async/await.


Step 1: The Naive Approach (And Why It Fails)

To understand the solution, we must deeply understand the problem. When developers first try to optimize a synchronous loop, they usually write something like this:

// ❌ The "Crash My Server" Approach
async function fetchAllData(urls: string[]) {
  // This fires 100,000 requests at the EXACT same time.
  const promises = urls.map(url => fetch(url).then(res => res.json()));
  return await Promise.all(promises);
}

Why is this bad DX and bad UX?
1. OS Limits: Your operating system has a limit on how many open file descriptors (sockets) it allows at once. Firing 100k requests will throw an EMFILE error.
2. Memory Exhaustion: Node.js has to hold all 100,000 pending Promise objects in the V8 heap memory. Your RAM usage will spike vertically until the process crashes.
3. DDoS-ing your target: If you are web scraping at scale, hitting a server with 100k simultaneous requests will either get your IP banned instantly or take down their server.

Time (Requests Processed) Memory Usage Promise.all() Crash Bounded Queue (Stable)

Step 2: Building the Concurrency Limiter

Instead of firing everything at once, we need a mechanism that says: "Only process 20 items at a time. When one finishes, pull the next one."

We can achieve this elegantly using a Semaphore pattern. A semaphore acts like a bouncer at a club—it only lets a new task in when an existing task leaves. Here is a beautifully clean implementation that your fellow developers will love using.

// limiter.ts
export class ConcurrencyLimiter {
  private maxConcurrent: number;
  private activeCount: number = 0;
  private queue: Array<() => void> = [];

  constructor(maxConcurrent: number) {
    this.maxConcurrent = maxConcurrent;
  }

  async acquire(): Promise<void> {
    if (this.activeCount < this.maxConcurrent) {
      this.activeCount++;
      return Promise.resolve();
    }

    // If we are at capacity, return a Promise that resolves 
    // ONLY when the release() method is called.
    return new Promise(resolve => {
      this.queue.push(resolve);
    });
  }

  release(): void {
    this.activeCount--;
    if (this.queue.length > 0) {
      this.activeCount++;
      const nextResolve = this.queue.shift();
      if (nextResolve) nextResolve();
    }
  }
}

Why this code is great (The DX perspective):
Notice how we abstract the complexity of queue management into a simple acquire and release flow. The developer using this class doesn't need to know about the internal array shifting or promise resolution. They just ask for permission to proceed, and release the lock when done.

Step 3: Creating the Resilient Worker

Now that we have our bouncer (ConcurrencyLimiter), let's build the actual worker.

When building an async data pipeline at scale, things will fail. Networks drop packets, APIs return 502 Bad Gateway, and rate limits trigger 429 Too Many Requests. A pipeline without retries is a fragile pipeline.

Let's write a wrapper that includes exponential backoff.

// worker.ts
import { ConcurrencyLimiter } from './limiter';

interface FetchResult<T> {
  data?: T;
  error?: string;
  url: string;
}

export async function fetchWithRetry<T>(
  url: string,
  limiter: ConcurrencyLimiter,
  retries: number = 3
): Promise<FetchResult<T>> {
  
  // 1. Wait for our turn in the queue
  await limiter.acquire();

  try {
    for (let attempt = 1; attempt <= retries; attempt++) {
      try {
        const response = await fetch(url);
        
        if (!response.ok) {
          throw new Error(HTTP ${response.status});
        }

        const data = await response.json();
        return { data, url };

      } catch (err) {
        if (attempt === retries) {
          return { error: (err as Error).message, url };
        }
        // Exponential backoff: 1s, 2s, 4s...
        const backoff = Math.pow(2, attempt - 1) * 1000;
        await new Promise(res => setTimeout(res, backoff));
      }
    }
  } finally {
    // 2. ALWAYS release the lock, even if the try block fails
    limiter.release();
  }
  
  return { error: 'Unknown error', url };
}

The "Why" Behind the Code:
The finally block is the most critical part of this snippet. If a request throws an error and we forget to call limiter.release(), that "slot" in our concurrency pool is lost forever. If this happens 20 times, our pipeline deadlocks and freezes entirely. The finally block guarantees the lock is released, ensuring a flawless developer experience where the pipeline never inexplicably hangs.

Step 4: Assembling the Pipeline

Now, let's tie it all together into a clean, developer-friendly API.

// pipeline.ts
import { ConcurrencyLimiter } from './limiter';
import { fetchWithRetry } from './worker';

export async function runPipeline(urls: string[], concurrency: number = 20) {
  console.log(🚀 Starting pipeline for ${urls.length} URLs with concurrency ${concurrency});
  
  const limiter = new ConcurrencyLimiter(concurrency);
  
  // Map URLs to our resilient fetch workers
  const tasks = urls.map(url => fetchWithRetry(url, limiter));

  // Await all tasks. Because of our limiter, only 20 will run at a time!
  const results = await Promise.all(tasks);

  const successes = results.filter(r => r.data).length;
  const failures = results.filter(r => r.error).length;

  console.log(✨ Pipeline complete! Success: ${successes}, Failures: ${failures});
  return results;
}

Verification

To confirm your async data pipeline works beautifully, create a test script that generates 100 dummy URLs and runs the pipeline:

// test.ts
import { runPipeline } from './pipeline';

const dummyUrls = Array.from(
  { length: 100 }, 
  (_, i) => https://jsonplaceholder.typicode.com/todos/${i + 1}
);

runPipeline(dummyUrls, 10).then(() => process.exit(0));

Run this using npx ts-node test.ts. You should see the requests firing off in neat batches of 10, rather than all 100 dumping into your console simultaneously.

Troubleshooting

Even with elegant code, distributed systems can be tricky. Here are common pitfalls:

  • The pipeline hangs and never finishes: This almost always means limiter.release() wasn't called. Double-check your try/catch/finally blocks to ensure every acquired lock is released.
  • Memory is still creeping up: If you are processing 10 million pages, you shouldn't hold all results in an array in memory. Instead of Promise.all(tasks), modify the pipeline to yield results via an AsyncGenerator or stream them directly to a database or file.
  • Getting 403 Forbidden or 429 Too Many Requests: Your concurrency is too high for the target server. Lower the concurrency limit to 5 or 10, and increase the exponential backoff delay.

Performance vs DX: The Perfect Balance

Let's evaluate what we just built.

From a Performance perspective, we've solved the EMFILE OS limitation and flattened our V8 heap memory curve. By controlling the exact number of concurrent TCP sockets, our application runs with a predictable, flat memory footprint regardless of whether we pass it 1,000 or 1,000,000 URLs.

From a Developer Experience (DX) perspective, we've completely decoupled the business logic (fetching data) from the infrastructure logic (concurrency and retries). A junior developer on your team can now safely fetch massive amounts of data just by calling runPipeline(urls, 20). They don't need to understand semaphores, exponential backoff math, or event loop mechanics. They just get to go home earlier. 💡

What You Built

You've successfully engineered a resilient, bounded async data pipeline from scratch! You moved away from the dangerous Promise.all memory spikes and implemented a professional-grade Semaphore pattern with exponential backoff retries.

Your components and scripts are way leaner and safer now. Happy Coding! ✨


Frequently Asked Questions

Why not just use a library like p-limit or async.queue? Using libraries like p-limit is absolutely a great choice for production! However, building the concurrency limiter yourself (as we did here) is a crucial exercise for understanding how Node.js handles the event loop and microtask queues. Once you understand the "why" behind the code, you can debug third-party libraries much more effectively.
How do I handle massive datasets that don't fit in memory? If your urls array is millions of items long, you shouldn't load the array into memory either. You should read the URLs from a database cursor or a file stream, process them through the concurrency limiter, and write the results out to a file stream immediately. This keeps your memory footprint completely flat.
Is this approach better than Node.js Worker Threads? It depends on the workload. For Network I/O (like web scraping or calling APIs), the async event loop with a concurrency limiter is vastly more efficient than spawning Worker Threads. Worker Threads are best reserved for CPU-heavy tasks, like image processing, encryption, or heavy mathematical computations.
What is exponential backoff and why is it necessary? Exponential backoff is an error-handling strategy where you increase the waiting time between retries exponentially (e.g., 1s, 2s, 4s, 8s). It is essential because if an API server is struggling under load, hitting it with immediate, aggressive retries will only crash it faster. Backoff gives the server time to recover.

📚 Sources

Related Posts

⚙️ Dev & Engineering
Mastering Modern Full-Stack Architecture: Monorepos & Limits
Mar 24, 2026
⚙️ Dev & Engineering
Modern TypeScript DX: Mastering Effect-TS & Vitest
Mar 23, 2026
⚙️ Dev & Engineering
Mastering the WebMCP API and Context-Aware Python Testing Workflows
Mar 22, 2026