Blog

DISTRIBUTED SYSTEMS · RESILIENCE · CASCADING FAILURES

One Service Was Slow.
Five Were Down.

Nobody changed anything. No deployment, no config update, no infra event. A Tuesday afternoon, normal traffic — and then checkout stopped working. The root cause wasn't in checkout at all. It was three hops away. This is the story of how one slow service collapsed an entire platform, and the patterns that would have stopped it.

anuragdevon 15th May 2026 12 min read Part 2 — Production Debugging Playbook

The Tuesday Call

It was 3:40 PM on a Tuesday. Normal traffic. No deployments in the last 6 hours. Then the first message landed in the incident channel.

"Checkout is broken. Users can't place orders."

I opened the checkout service dashboard. Everything was green. CPU at 22%. Memory fine. No error spikes. No 5xx responses. All pods running.

But the latency graph told a different story. P99 was climbing — 200ms, 800ms, 2 seconds, 4 seconds — in a near-perfect straight line. Requests weren't failing. They were just... waiting.

Checkout CPU22% — Normal

Checkout errors0.00%

Checkout pods8/8 running

Checkout p99 latency4,200ms ↑ and climbing

User-facing errorsTimeouts everywhere

The system wasn't crashing. It was drowning. And the reason wasn't anywhere near checkout.

⚠ THE TRAP

When a cascade hits, you will look at the service that's visibly broken first. That service is almost never the root cause. The root cause is the slow dependency it's waiting on — and that dependency looks completely healthy.

Why Cascades Are Different

A regular outage has a clear shape: something crashes, alerts fire, you fix the thing. Cascading failures are structurally different and significantly harder.

Regular failureService A crashes → alert fires → fix A

CascadeService D slows → A, B, C, E all exhaust → everything broken, D looks fine

Three things make cascades uniquely brutal:

The broken service isn't the cause. You'll spend the first 20 minutes debugging the wrong thing.
Each service's metrics look healthy in isolation. The failure lives in the relationship between services, not within any single one.
Retry logic turns a degradation into a full collapse. Well-intentioned retries amplify the load on the already-struggling service. Every retry makes the problem worse.

// THE CORE INSIGHT

In a distributed system, your service's health is bounded by the health of every service you call. You can write perfect code and still be completely broken because of a service three hops away you don't own.

System Architecture

Simplified from production. Five services participate in the checkout flow. Each synchronous call is a potential cascade vector.

📱

API Gateway

routes & auth

→

🛒

Checkout Svc

orchestrates

→

💰

Pricing Svc

compute totals

🐛 slow here

📦

Inventory Svc

check stock

👤

User Svc

fetch profile

Checkout makes three synchronous HTTP calls — pricing, inventory, and user profile — before it can respond. All three must succeed. If any one hangs, checkout hangs. And every goroutine waiting on a hanging checkout call is a goroutine that can't serve anything else.

⚠ SYNCHRONOUS CALLS ARE LOAD TRANSFER

When you make a synchronous HTTP call with no timeout, you're not just calling a service — you're transferring your fate to it. If it degrades, you degrade. If it hangs, you hang. Your goroutine pool becomes a direct mirror of the upstream's problems.

What Started It

Pricing service had a DB query that worked perfectly under normal load. On this Tuesday, a merchant updated their entire catalog in a single batch — roughly 40,000 items — while Tuesday lunch-hour traffic was peaking.

The pricing DB was fine. No OOM. No CPU spike. One long-running read query on an unindexed column started taking 2.8 seconds per call instead of the usual 40ms. That's it. 70x slower, but still completing. No errors, no alerts.

Pricing p50 (baseline)38ms

Pricing p50 (incident)2,840ms

Pricing error rate0.00%

Pricing pod healthAll healthy

Pricing alert firedNone — latency alert threshold was 5s

The latency alert was set to 5 seconds. 2.8 seconds was painful enough to collapse the whole stack, but not high enough to trigger a single alert in pricing itself. We were flying blind.

⚠ ALERT THRESHOLDS ARE NOT BLAST RADIUSES

An alert threshold of 5s means you get paged when pricing is very broken. But 2.8s is already enough to exhaust your caller's goroutine pool. You need latency alerts at P95 > 500ms for any synchronous dependency — not just when things are catastrophically bad.

The Mechanism — Goroutine Exhaustion

This is the technical core of every cascade. Understanding it will make you a better distributed systems engineer.

In Go, every incoming HTTP request is served by a goroutine. When that goroutine calls an upstream service with no timeout, it blocks — waiting — for however long the upstream takes. The goroutine can't be reused until the call returns.

go — the broken caller (no timeout)

func handleCheckout(w http.ResponseWriter, r *http.Request) {
    // This goroutine will block here for however long pricing takes.
    // No timeout. No escape. Just waiting.
    price, err := pricingClient.GetTotal(r.Context(), cart)
    if err != nil {
        http.Error(w, "pricing failed", 500)
        return
    }
    // ... rest of checkout ...
}

When pricing takes 2.8 seconds per call, here's the math at 200 req/s:

Requests per second200 req/s

Pricing call duration2,800ms (was 38ms)

Goroutines blocked per second200 new goroutines stuck waiting

After 3 seconds600 goroutines all blocked on pricing

New requests after pool exhaustionQueued — can't be served

User experienceTimeout. Checkout broken.

The goroutine pool doesn't fail loudly. It fills up silently. CPU stays low because the goroutines aren't doing work — they're waiting. Memory stays low. No errors. Just an ever-growing queue of stuck requests, until the whole service effectively stops.

⚠ WHY CPU LOOKS FINE

A waiting goroutine consumes almost no CPU. It's parked by the scheduler. So you get a service that looks completely idle on CPU while being completely overwhelmed — all its goroutines blocked on a slow upstream. The CPU metric, the one everyone looks at first, tells you nothing.

How Retries Made It 10x Worse

Here's where it got catastrophically bad. Checkout had retry logic. Reasonable, right? If a pricing call fails, retry it.

Except the calls weren't failing. They were just slow. And our retry logic triggered on timeout — with a timeout of 3 seconds.

go — the retry that amplified everything

func getPriceWithRetry(ctx context.Context, cart Cart) (Price, error) {
    for i := 0; i < 3; i++ {
        // 3 second timeout — pricing is taking 2.8s.
        // First call times out at 3s. Retry immediately.
        // Second call times out at 3s. Retry immediately.
        // Third call times out at 3s. Return error.
        // Total time held: 9 seconds per checkout request.
        tCtx, cancel := context.WithTimeout(ctx, 3*time.Second)
        price, err := pricingClient.GetTotal(tCtx, cart)
        cancel()
        if err == nil { return price, nil }
    }
    return Price{}, errors.New("pricing unavailable")
}

Pricing was taking 2.8 seconds. Our timeout was 3 seconds. So most calls would just succeed — until they didn't. When they timed out, we'd immediately retry. Three times. With no delay.

Each checkout request was now making up to 3 calls to an already-overloaded pricing service instead of 1. Our 200 req/s became 600 req/s hitting pricing. Which made pricing slower. Which caused more timeouts. Which caused more retries. Which made pricing slower still.

⚠ RETRY STORM

Retries without backoff and jitter are a positive feedback loop. When an upstream is degraded, retrying immediately hits it again before it can recover — generating more load, slowing it further, causing more timeouts, generating more retries. You turn a degraded service into a completely failed one.

Timeline of Destruction

Here is exactly how eight minutes turned into a P0.

T+0:00

Pricing DB query slows — 38ms → 2,800ms

Merchant batch update triggers an unindexed read. Pricing still responding. No alerts.

DEGRADED

T+0:45

Checkout goroutines start accumulating

Each pricing call holds a goroutine for 2.8s instead of 38ms. Pool filling at 200/s.

FILLING

T+2:00

Retry storm begins

First timeouts fire (3s threshold). Checkout retries immediately — 3x load on pricing. Pricing slows further.

STORM

T+3:30

Checkout goroutine pool exhausted

New requests queue. P99 latency: 8 seconds. User-facing timeouts begin.

DOWN

T+4:15

Order service starts timing out

Order service calls checkout synchronously. Checkout is hanging. Order goroutines start piling up too.

SPREADING

T+5:30

Order history & notification services affected

Two more services calling order service start degrading. The blast radius is now 5 services.

CASCADING

T+7:00

API Gateway starts returning 504s at scale

End users see checkout, order history, and notifications all broken. Incident declared P0.

T+8:00

Root cause finally found — pricing DB

Only after manual tracing through each service. Pricing metrics still look "healthy" to automated tooling.

FOUND

The Debugging Journey

Under incident pressure, the temptation is to fix the most visibly broken service first. Resist that instinct. Cascades require you to trace backward, not fix forward.

Start at the user-facing service, look at its dependencies

Checkout is broken. Not the cause. What does checkout call? Pricing, inventory, user profile. Check each one's latency, not just its error rate.

Check upstream latency, not just errors

Pricing error rate: 0%. But p95 latency: 2,800ms. A slow upstream with no errors is harder to spot than a broken one. Always check latency histograms on your dependencies, not just their health endpoints.

Look at active goroutines / thread count

Go's /debug/pprof/goroutine endpoint shows exactly how many goroutines are blocked and on what. This is the fastest way to confirm goroutine exhaustion — you'll see hundreds waiting on the same HTTP call.

Check outbound call duration, not inbound

Most latency dashboards show inbound request latency. You need outbound call duration — how long your service waits on each dependency. These are different metrics and most teams only instrument one of them.

We found the root cause 8 minutes in by pulling pprof data from the checkout pod. 547 goroutines, all blocking on the same call: pricingClient.GetTotal. That pointed us straight to pricing, where a manual DB query showed a long-running read on an unindexed column.

// DEBUG TOOL

Add _ "net/http/pprof" to every service. Under incident, curl localhost:6060/debug/pprof/goroutine?debug=1 gives you a full goroutine dump in seconds. It's the fastest way to confirm a cascade — you'll see exactly what every goroutine is blocked waiting for.

Fix 1 — Timeouts Everywhere

The foundational rule: every outbound call must have a timeout. Not "most calls." Not "the important ones." Every single one. Without a timeout, you have handed your goroutine to the upstream service indefinitely.

go — before: no timeout

// This HTTP client has no timeout. One slow upstream = goroutine held forever.
var pricingClient = &http.Client{}

go — after: explicit timeouts at every layer

// Timeout at the HTTP client level (hard ceiling)
var pricingClient = &http.Client{
    Timeout: 800 * time.Millisecond,
}

// AND propagate the request context with a per-call deadline
func getPrice(ctx context.Context, cart Cart) (Price, error) {
    callCtx, cancel := context.WithTimeout(ctx, 600*time.Millisecond)
    defer cancel()

    return pricingClient.GetTotal(callCtx, cart)
}

Two timeouts, not one — the client-level timeout is the hard ceiling that can't be bypassed even if context propagation is wrong. The per-call context timeout is tighter and carries cancellation semantics upstream.

// PICKING THE TIMEOUT VALUE

Set your timeout at 2× the P99 latency of the dependency at normal load. If pricing normally responds in 40ms at p99, your timeout should be 80–100ms. Not 3 seconds. A 3-second timeout means you accumulate 3 seconds of blocked goroutines per failure, which at 200 req/s fills your pool in under 2 seconds.

Fix 2 — Circuit Breakers

Timeouts protect individual calls. Circuit breakers protect the system from making calls at all when an upstream is degraded. It's the difference between a fuse and a fire suppression system.

A circuit breaker has three states:

CLOSEDNormal. All calls go through.

OPENToo many failures/timeouts. Calls short-circuit immediately — no waiting.

HALF-OPENProbing. Let a small percentage through to test if upstream recovered.

go — circuit breaker with gobreaker

import "github.com/sony/gobreaker"

var pricingBreaker = gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name:        "pricing",
    MaxRequests: 5,                // max in HALF-OPEN state
    Interval:    30 * time.Second, // rolling window
    Timeout:     10 * time.Second, // how long to stay OPEN
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        // Open the breaker after 5 consecutive failures
        return counts.ConsecutiveFailures > 5
    },
    OnStateChange: func(name string, from, to gobreaker.State) {
        log.Warn("circuit_breaker_state_change",
            "service", name, "from", from, "to", to)
        metrics.Increment("circuit_breaker.state_change", "service", name)
    },
})

func getPrice(ctx context.Context, cart Cart) (Price, error) {
    result, err := pricingBreaker.Execute(func() (interface{}, error) {
        return pricingClient.GetTotal(ctx, cart)
    })
    if err != nil {
        return Price{}, err // includes gobreaker.ErrOpenState
    }
    return result.(Price), nil
}

When the breaker is OPEN, calls fail immediately with ErrOpenState — no goroutine held, no timeout waited, no retry needed. Your service can degrade gracefully (return a cached price, show an error, skip the call) instead of hanging.

// ALWAYS INSTRUMENT YOUR BREAKER

Log every state transition. CLOSED → OPEN means your dependency is degraded right now — that's an immediate page-worthy event. OPEN → HALF-OPEN → CLOSED means recovery — you want to track that too. A circuit breaker that fires silently is almost as bad as not having one.

Fix 3 — Bulkheads & Backoff

Bulkhead — Isolate your dependencies

Named after ship compartments — if one floods, it doesn't sink the whole vessel. In code: each external dependency gets its own goroutine pool (semaphore). When pricing is degraded and its pool fills up, it can't consume goroutines meant for inventory or user-profile calls.

go — semaphore-based bulkhead

// Separate semaphore per dependency — pricing can't steal slots from inventory
var (
    pricingSem   = make(chan struct{}, 20) // max 20 concurrent pricing calls
    inventorySem = make(chan struct{}, 30) // max 30 concurrent inventory calls
)

func getPriceIsolated(ctx context.Context, cart Cart) (Price, error) {
    select {
    case pricingSem <- struct{}{}:
        defer func() { <-pricingSem }()
    case <-ctx.Done():
        return Price{}, ctx.Err()
    default:
        // Pool full — fast-fail instead of queuing
        metrics.Increment("pricing.bulkhead.rejected")
        return Price{}, ErrPricingUnavailable
    }
    return pricingClient.GetTotal(ctx, cart)
}

Exponential backoff with jitter — don't synchronise your retries

Retries without jitter cause thundering herd: every caller that timed out at the same time retries at the same time, creating a synchronized spike. Adding random jitter spreads the retries out, giving the upstream breathing room to recover.

go — retry with exponential backoff + jitter

func retryWithBackoff(ctx context.Context, fn func() error) error {
    base := 100 * time.Millisecond
    for attempt := 0; attempt < 3; attempt++ {
        err := fn()
        if err == nil { return nil }

        if attempt == 2 { return err } // last attempt

        // Exponential backoff: 100ms, 200ms, 400ms...
        // Jitter: ±50% randomness to desynchronise retries
        backoff := base * time.Duration(1<<attempt)
        jitter  := time.Duration(rand.Int63n(int64(backoff) / 2))
        wait    := backoff + jitter

        select {
        case <-time.After(wait):
        case <-ctx.Done():
            return ctx.Err()
        }
    }
    return nil
}

// WHY JITTER MATTERS

Without jitter, 500 clients that all timed out at T=3s will all retry at T=3.1s, T=3.3s, T=3.7s — synchronized spikes that hammer an already-struggling service. With jitter, those retries spread across a window, giving the upstream a gentler recovery curve.

Production Checklist

Before any service ships a new outbound dependency, every item on this list must be answered:

✅

Every HTTP client has a timeout

Set at the client level and propagated via context. Timeout = 2× dependency's p99 at normal load. Not a round number you guessed.

✅

Circuit breaker on every synchronous upstream

State transitions must be logged and alerted. A circuit breaker opening is an incident signal — treat it as one.

✅

Bulkhead per dependency

Pricing's degradation should not consume goroutines that inventory and user service need. Separate pools, separate failure domains.

✅

Retries use exponential backoff with jitter

No immediate retries. No fixed-interval retries. Exponential + jitter, bounded by the request context deadline.

✅

Latency alerts on dependencies, not just self

Alert when any upstream's p95 latency exceeds 2× normal. Don't wait for your own error rate to spike — that's 3 minutes too late.

✅

Graceful degradation path for every dependency

What does your service do if pricing returns an error? Cache, default, partial response? Define it. Don't let an upstream take you fully down.

✅

pprof / goroutine dump available in prod

Add the pprof endpoint. When a cascade hits and goroutines are piling up, this cuts your debug time from 20 minutes to 30 seconds.

Key Takeaways

The broken service is rarely the root cause

In a cascade, the noisiest failure is downstream of the actual problem. Train yourself to trace backward through dependencies before touching the service that's visibly broken.

A slow upstream is more dangerous than a crashed one

A crashed upstream gives you a fast error. A slow upstream holds your goroutine for seconds, filling your pool silently. Slowness without timeouts is worse than failure.

Retries without backoff turn degradation into collapse

Every immediate retry is another call to an already-struggling service. Backoff + jitter is not optional — it's the difference between a recoverable dip and an unrecoverable storm.

Resilience is designed, not discovered

You don't find out your service has no circuit breaker when you're reviewing code. You find out at 3 PM on a Tuesday when five services are down. Design the failure mode before you ship the feature.

What's Next

Part 1

When Logs Lie — Silent failures & observability gaps

Read it →

Part 2

The Cascade — When one slow service takes down five others

This post.

Part 3

The Phantom Bug — Reproducing production-only failures locally

Coming soon.

END OF PART 2

Have you been hit by a cascade?

What did your blast radius look like? Which pattern finally stopped it? Drop your story — I read every reply.

Read on Dev.to GitHub ↗

Part 3 dropping soon — follow on Dev.to to get notified.

Blog

The Tuesday Call

Why Cascades Are Different

System Architecture

What Started It

The Mechanism — Goroutine Exhaustion

How Retries Made It 10x Worse

Timeline of Destruction

The Debugging Journey

Fix 1 — Timeouts Everywhere

Fix 2 — Circuit Breakers

Fix 3 — Bulkheads & Backoff

Bulkhead — Isolate your dependencies

Exponential backoff with jitter — don't synchronise your retries

Production Checklist

Key Takeaways

What's Next

Have you been hit by a cascade?

Contact