Introduction
Failure is not an edge case in real systems. It is a normal condition. Designing software that works only when everything succeeds is not enough for production environments.
Failure handling should be treated as a core feature of the system. It defines how the system behaves when things go wrong, which is often when users notice it the most.
The Problem
Most applications are built around the success path. Error handling is often minimal or added late in development.
- Unhandled exceptions crash parts of the system
- APIs fail without meaningful responses
- No fallback mechanisms for dependent services
- Users experience broken flows instead of degraded ones
This leads to systems that are fragile and unpredictable under real conditions.
System Design / Approach
Failure handling starts with the assumption that every external dependency can fail. The system should be designed to absorb these failures without collapsing.
- Use retries for transient failures
- Apply timeouts to avoid blocking operations
- Provide fallback responses when services are unavailable
- Log and monitor failures for visibility
The goal is not to eliminate failures, but to control their impact.
Implementation
Step 1: Add Retry Logic
Retry requests that fail due to temporary issues.
async function fetchWithRetry(url: string) {
for (let i = 0; i < 3; i++) {
try {
return await fetch(url);
} catch {
if (i === 2) throw new Error("Retry failed");
}
}
}
Retries increase the chance of successful operations.
Step 2: Use Timeouts
Prevent long-running operations from blocking the system.
const controller = new AbortController();
setTimeout(() => controller.abort(), 3000);
fetch(url, { signal: controller.signal });
Timeouts keep the system responsive.
Step 3: Provide Fallbacks
Return alternative responses when dependencies fail.
try {
return await fetchData();
} catch {
return { data: [], fallback: true };
}
Fallbacks ensure continuity of service.
Step 4: Add Logging
Capture failures for debugging and monitoring.
console.error("Request failed", { error });
Visibility is essential for improving system reliability.
Trade-offs
| Approach | Benefit | Cost |
|---|---|---|
| Retries | Higher success rate | Increased latency |
| Timeouts | Avoid blocking | Premature termination |
| Fallbacks | Better user experience | Incomplete data |
Real-World Impact
- Reduced system downtime
- Improved user experience during failures
- More predictable system behavior
- Faster debugging and recovery