One Service Was Slow.
Five Were Down.
Nobody changed anything. No deployment, no config update, no infra event. A Tuesday afternoon, normal traffic — and then checkout stopped working. The root cause wasn't in checkout at all. It was three hops away. This is the story of how one slow service collapsed an entire platform, and the patterns that would have stopped it.
The Tuesday Call
It was 3:40 PM on a Tuesday. Normal traffic. No deployments in the last 6 hours. Then the first message landed in the incident channel.
"Checkout is broken. Users can't place orders."
I opened the checkout service dashboard. Everything was green. CPU at 22%. Memory fine. No error spikes. No 5xx responses. All pods running.
But the latency graph told a different story. P99 was climbing — 200ms, 800ms, 2 seconds, 4 seconds — in a near-perfect straight line. Requests weren't failing. They were just... waiting.
The system wasn't crashing. It was drowning. And the reason wasn't anywhere near checkout.
When a cascade hits, you will look at the service that's visibly broken first. That service is almost never the root cause. The root cause is the slow dependency it's waiting on — and that dependency looks completely healthy.
Why Cascades Are Different
A regular outage has a clear shape: something crashes, alerts fire, you fix the thing. Cascading failures are structurally different and significantly harder.
Three things make cascades uniquely brutal:
- The broken service isn't the cause. You'll spend the first 20 minutes debugging the wrong thing.
- Each service's metrics look healthy in isolation. The failure lives in the relationship between services, not within any single one.
- Retry logic turns a degradation into a full collapse. Well-intentioned retries amplify the load on the already-struggling service. Every retry makes the problem worse.
In a distributed system, your service's health is bounded by the health of every service you call. You can write perfect code and still be completely broken because of a service three hops away you don't own.
System Architecture
Simplified from production. Five services participate in the checkout flow. Each synchronous call is a potential cascade vector.
Checkout makes three synchronous HTTP calls — pricing, inventory, and user profile — before it can respond. All three must succeed. If any one hangs, checkout hangs. And every goroutine waiting on a hanging checkout call is a goroutine that can't serve anything else.
When you make a synchronous HTTP call with no timeout, you're not just calling a service — you're transferring your fate to it. If it degrades, you degrade. If it hangs, you hang. Your goroutine pool becomes a direct mirror of the upstream's problems.
What Started It
Pricing service had a DB query that worked perfectly under normal load. On this Tuesday, a merchant updated their entire catalog in a single batch — roughly 40,000 items — while Tuesday lunch-hour traffic was peaking.
The pricing DB was fine. No OOM. No CPU spike. One long-running read query on an unindexed column started taking 2.8 seconds per call instead of the usual 40ms. That's it. 70x slower, but still completing. No errors, no alerts.
The latency alert was set to 5 seconds. 2.8 seconds was painful enough to collapse the whole stack, but not high enough to trigger a single alert in pricing itself. We were flying blind.
An alert threshold of 5s means you get paged when pricing is very broken. But 2.8s is already enough to exhaust your caller's goroutine pool. You need latency alerts at P95 > 500ms for any synchronous dependency — not just when things are catastrophically bad.
The Mechanism — Goroutine Exhaustion
This is the technical core of every cascade. Understanding it will make you a better distributed systems engineer.
In Go, every incoming HTTP request is served by a goroutine. When that goroutine calls an upstream service with no timeout, it blocks — waiting — for however long the upstream takes. The goroutine can't be reused until the call returns.
func handleCheckout(w http.ResponseWriter, r *http.Request) { // This goroutine will block here for however long pricing takes. // No timeout. No escape. Just waiting. price, err := pricingClient.GetTotal(r.Context(), cart) if err != nil { http.Error(w, "pricing failed", 500) return } // ... rest of checkout ... }
When pricing takes 2.8 seconds per call, here's the math at 200 req/s:
The goroutine pool doesn't fail loudly. It fills up silently. CPU stays low because the goroutines aren't doing work — they're waiting. Memory stays low. No errors. Just an ever-growing queue of stuck requests, until the whole service effectively stops.
A waiting goroutine consumes almost no CPU. It's parked by the scheduler. So you get a service that looks completely idle on CPU while being completely overwhelmed — all its goroutines blocked on a slow upstream. The CPU metric, the one everyone looks at first, tells you nothing.
How Retries Made It 10x Worse
Here's where it got catastrophically bad. Checkout had retry logic. Reasonable, right? If a pricing call fails, retry it.
Except the calls weren't failing. They were just slow. And our retry logic triggered on timeout — with a timeout of 3 seconds.
func getPriceWithRetry(ctx context.Context, cart Cart) (Price, error) { for i := 0; i < 3; i++ { // 3 second timeout — pricing is taking 2.8s. // First call times out at 3s. Retry immediately. // Second call times out at 3s. Retry immediately. // Third call times out at 3s. Return error. // Total time held: 9 seconds per checkout request. tCtx, cancel := context.WithTimeout(ctx, 3*time.Second) price, err := pricingClient.GetTotal(tCtx, cart) cancel() if err == nil { return price, nil } } return Price{}, errors.New("pricing unavailable") }
Pricing was taking 2.8 seconds. Our timeout was 3 seconds. So most calls would just succeed — until they didn't. When they timed out, we'd immediately retry. Three times. With no delay.
Each checkout request was now making up to 3 calls to an already-overloaded pricing service instead of 1. Our 200 req/s became 600 req/s hitting pricing. Which made pricing slower. Which caused more timeouts. Which caused more retries. Which made pricing slower still.
Retries without backoff and jitter are a positive feedback loop. When an upstream is degraded, retrying immediately hits it again before it can recover — generating more load, slowing it further, causing more timeouts, generating more retries. You turn a degraded service into a completely failed one.
Timeline of Destruction
Here is exactly how eight minutes turned into a P0.
The Debugging Journey
Under incident pressure, the temptation is to fix the most visibly broken service first. Resist that instinct. Cascades require you to trace backward, not fix forward.
/debug/pprof/goroutine endpoint shows exactly how many goroutines are blocked and on what. This is the fastest way to confirm goroutine exhaustion — you'll see hundreds waiting on the same HTTP call.We found the root cause 8 minutes in by pulling pprof data from the checkout pod. 547 goroutines, all blocking on the same call: pricingClient.GetTotal. That pointed us straight to pricing, where a manual DB query showed a long-running read on an unindexed column.
Add _ "net/http/pprof" to every service. Under incident, curl localhost:6060/debug/pprof/goroutine?debug=1 gives you a full goroutine dump in seconds. It's the fastest way to confirm a cascade — you'll see exactly what every goroutine is blocked waiting for.
Fix 1 — Timeouts Everywhere
The foundational rule: every outbound call must have a timeout. Not "most calls." Not "the important ones." Every single one. Without a timeout, you have handed your goroutine to the upstream service indefinitely.
// This HTTP client has no timeout. One slow upstream = goroutine held forever.
var pricingClient = &http.Client{}// Timeout at the HTTP client level (hard ceiling) var pricingClient = &http.Client{ Timeout: 800 * time.Millisecond, } // AND propagate the request context with a per-call deadline func getPrice(ctx context.Context, cart Cart) (Price, error) { callCtx, cancel := context.WithTimeout(ctx, 600*time.Millisecond) defer cancel() return pricingClient.GetTotal(callCtx, cart) }
Two timeouts, not one — the client-level timeout is the hard ceiling that can't be bypassed even if context propagation is wrong. The per-call context timeout is tighter and carries cancellation semantics upstream.
Set your timeout at 2× the P99 latency of the dependency at normal load. If pricing normally responds in 40ms at p99, your timeout should be 80–100ms. Not 3 seconds. A 3-second timeout means you accumulate 3 seconds of blocked goroutines per failure, which at 200 req/s fills your pool in under 2 seconds.
Fix 2 — Circuit Breakers
Timeouts protect individual calls. Circuit breakers protect the system from making calls at all when an upstream is degraded. It's the difference between a fuse and a fire suppression system.
A circuit breaker has three states:
import "github.com/sony/gobreaker" var pricingBreaker = gobreaker.NewCircuitBreaker(gobreaker.Settings{ Name: "pricing", MaxRequests: 5, // max in HALF-OPEN state Interval: 30 * time.Second, // rolling window Timeout: 10 * time.Second, // how long to stay OPEN ReadyToTrip: func(counts gobreaker.Counts) bool { // Open the breaker after 5 consecutive failures return counts.ConsecutiveFailures > 5 }, OnStateChange: func(name string, from, to gobreaker.State) { log.Warn("circuit_breaker_state_change", "service", name, "from", from, "to", to) metrics.Increment("circuit_breaker.state_change", "service", name) }, }) func getPrice(ctx context.Context, cart Cart) (Price, error) { result, err := pricingBreaker.Execute(func() (interface{}, error) { return pricingClient.GetTotal(ctx, cart) }) if err != nil { return Price{}, err // includes gobreaker.ErrOpenState } return result.(Price), nil }
When the breaker is OPEN, calls fail immediately with ErrOpenState — no goroutine held, no timeout waited, no retry needed. Your service can degrade gracefully (return a cached price, show an error, skip the call) instead of hanging.
Log every state transition. CLOSED → OPEN means your dependency is degraded right now — that's an immediate page-worthy event. OPEN → HALF-OPEN → CLOSED means recovery — you want to track that too. A circuit breaker that fires silently is almost as bad as not having one.
Fix 3 — Bulkheads & Backoff
Bulkhead — Isolate your dependencies
Named after ship compartments — if one floods, it doesn't sink the whole vessel. In code: each external dependency gets its own goroutine pool (semaphore). When pricing is degraded and its pool fills up, it can't consume goroutines meant for inventory or user-profile calls.
// Separate semaphore per dependency — pricing can't steal slots from inventory var ( pricingSem = make(chan struct{}, 20) // max 20 concurrent pricing calls inventorySem = make(chan struct{}, 30) // max 30 concurrent inventory calls ) func getPriceIsolated(ctx context.Context, cart Cart) (Price, error) { select { case pricingSem <- struct{}{}: defer func() { <-pricingSem }() case <-ctx.Done(): return Price{}, ctx.Err() default: // Pool full — fast-fail instead of queuing metrics.Increment("pricing.bulkhead.rejected") return Price{}, ErrPricingUnavailable } return pricingClient.GetTotal(ctx, cart) }
Exponential backoff with jitter — don't synchronise your retries
Retries without jitter cause thundering herd: every caller that timed out at the same time retries at the same time, creating a synchronized spike. Adding random jitter spreads the retries out, giving the upstream breathing room to recover.
func retryWithBackoff(ctx context.Context, fn func() error) error { base := 100 * time.Millisecond for attempt := 0; attempt < 3; attempt++ { err := fn() if err == nil { return nil } if attempt == 2 { return err } // last attempt // Exponential backoff: 100ms, 200ms, 400ms... // Jitter: ±50% randomness to desynchronise retries backoff := base * time.Duration(1<<attempt) jitter := time.Duration(rand.Int63n(int64(backoff) / 2)) wait := backoff + jitter select { case <-time.After(wait): case <-ctx.Done(): return ctx.Err() } } return nil }
Without jitter, 500 clients that all timed out at T=3s will all retry at T=3.1s, T=3.3s, T=3.7s — synchronized spikes that hammer an already-struggling service. With jitter, those retries spread across a window, giving the upstream a gentler recovery curve.
Production Checklist
Before any service ships a new outbound dependency, every item on this list must be answered:
Key Takeaways
What's Next
Have you been hit by a cascade?
What did your blast radius look like? Which pattern finally stopped it? Drop your story — I read every reply.
Part 3 dropping soon — follow on Dev.to to get notified.