Designing Systems That Hold Under Load

Introduction

Most systems do not break because of one large spike. They usually fail when multiple small inefficiencies combine under pressure. What works at low traffic can collapse under load because of unbounded requests, tight coupling, slow dependencies, and weak failure handling.

Designing systems that hold under load is less about blindly scaling infrastructure and more about controlling how work flows through the system. A reliable system protects its resources before overload turns into failure.

This note focuses on practical engineering decisions behind building systems that remain stable under load, especially the patterns that control traffic, protect downstream services, and keep core functionality available.

The Problem

Under load, systems usually show predictable failure patterns. Requests start arriving faster than they can be processed, dependencies slow down, queues grow, retries increase pressure, and failures spread across services.

Common Failures

Requests pile up faster than they can be processed
Downstream services become bottlenecks
Retries amplify traffic instead of stabilizing it
Failures cascade across connected services

Engineering Impact

Latency becomes unpredictable during traffic spikes
Databases and APIs become overloaded by repeated work
Background jobs fall behind and increase recovery time
Users experience timeouts, errors, or broken flows

The core issue is lack of control over concurrency, resource usage, request flow, and failure boundaries.

System Design / Approach

A resilient system under load does not accept unlimited work blindly. It shapes traffic, buffers non-urgent processing, fails fast when capacity is exceeded, and keeps critical paths available even when non-critical features need to degrade.

1. Limit Incoming Traffic

Rate limits protect APIs from sudden spikes, brute force behavior, repeated requests, and abusive clients.

2. Buffer Heavy Work

Queues move slow or expensive tasks out of the request-response cycle so traffic spikes do not immediately block user-facing APIs.

3. Apply Backpressure

When the system is overloaded, it should reject, delay, or downgrade work instead of accepting more load than it can safely process.

Implementation

Step 1: Add Rate Limiting

Rate limiting controls how many requests a user, client, IP, or token can make in a given time window. This protects the system before traffic becomes unmanageable.

rate-limit.ts

if (requests > limit) {
  return res.status(429).json({
    error: "Too many requests",
    retryAfter: 60,
  });
}

Rate limiting prevents sudden spikes or repeated requests from overwhelming APIs, databases, and expensive services.

Step 2: Use Queue-Based Processing

Heavy tasks should move into background queues. This keeps the API fast and allows workers to process jobs at a controlled pace.

queue.ts

await queue.add("process-task", {
  userId,
  payload: data,
});

Queues smooth out traffic spikes and prevent slow work from blocking user-facing request flows.

Step 3: Apply Backpressure

Backpressure tells the system when to stop accepting more work. It is better to reject or delay requests clearly than to accept everything and collapse unpredictably.

backpressure.ts

if (queue.length > threshold) {
  return res.status(503).json({
    error: "System busy",
    message: "Please try again shortly.",
  });
}

Backpressure protects core services by preventing overload from spreading deeper into the system.

Step 4: Degrade Gracefully

During high load, non-critical features can be reduced, delayed, or disabled so the most important product flows remain available.

degraded-mode.ts

if (isHighLoad) {
  return minimalResponse({
    includeAnalytics: false,
    includeRecommendations: false,
  });
}

Graceful degradation keeps core functionality available even when the system cannot serve every feature at full quality.

Trade-offs

Technique	Benefit	Cost
Rate Limiting	Protects system stability during spikes and repeated requests	May reject valid users if limits are too strict
Queues	Smooths traffic spikes and decouples heavy processing	Adds latency and requires worker monitoring
Backpressure	Prevents overload from cascading through the system	Can create user-facing errors when capacity is exceeded
Graceful Degradation	Keeps critical product flows available during high load	Requires clear decisions about which features are non-critical

Real-World Impact

Stable Spikes

The system stays more stable during sudden traffic increases because incoming work is limited and shaped.

Fewer Crashes

APIs and workers fail less often because overload is handled before it spreads into core services.

Predictable Latency

Latency becomes more predictable because the system avoids accepting more work than it can process safely.