Blog

DISTRIBUTED SYSTEMS · OBSERVABILITY · BACKEND ENGINEERING

Everything Was Green…
But Production Was Broken.

0 errors. 0 alerts. 100% failure. At 2 AM, every dashboard said healthy — but orders were failing, inventory was stuck, and the business impact was real. This is what happened, and what it taught me about building production-grade systems.

anuragdevon 15th May 2026 9 min read Part 1 — Production Debugging Playbook
01

The 2 AM Incident

The on-call ping came at 2 AM. Orders were failing. Inventory wasn't updating. Business teams were escalating.

I opened the dashboards. Every metric was in the green.

Error rate0.00%
Latency p99Normal
Consumer lag0
Infra healthAll pods running
Alerts firedNone

No spikes. No errors. No alerts. And yet — orders were failing, inventory was stuck, and the business impact was very real.

⚠ THE PARADOX

A system can be perfectly "healthy" by every metric you track, while silently failing to do its actual job. This is the most dangerous failure mode in distributed systems.

02

Why This Matters

As a backend engineer at a P0 business, your job isn't just to write working code. It's to answer three questions under pressure at 2 AM:

  • What happens when things go wrong?
  • How will you know it went wrong?
  • Can you debug it fast enough to matter?

This incident exposed the exact gap between two very different system states that look identical from the outside:

System is runningPods up. No errors. Metrics normal.
System is workingBusiness logic executing correctly end-to-end.

Most monitoring covers the first. This bug lived entirely in the gap between them.

03

System Architecture

Simplified from production. The inventory update pipeline looked like this:

🛒
Order Service
publishes event
📨
Event Bus
Kafka / SQS
⚙️
Inventory Consumer
processes & updates
🐛 bug here
🗄️
Database
inventory store

Each piece of this pipeline was independently healthy. The bug wasn't in the infrastructure — it was in the logic of the consumer, in a path that was never instrumented.

04

Expected vs Reality

Expected Flow

1 Event published to bus
2 Consumer picks up event
3 Validation passes
4 Inventory DB updated

What Actually Happened

1 Event published to bus
2 Consumer running, logs clean, metrics normal
3 Validation silently failed — returned nil
4 Inventory never updated
// KEY INSIGHT

Steps 1 and 2 were measured. Steps 3 and 4 were not. The system reported success for what it measured, and had no knowledge of what it skipped.

05

The Debugging Journey

This is the real one — not the clean retrospective version.

01
Check logs
Nothing. No errors, no warnings, no anomalies. Clean as production should never be during an incident.
02
Check metrics
Normal. Throughput fine, latency fine, consumer lag zero. Every number looked exactly as it should.
03
Check infrastructure
Healthy. All pods running. No OOMs, no crashloops, no throttling.
04
Reproduce locally
Couldn't. The event that triggered the silent skip wasn't in our test fixtures — it was a production edge case we'd never seen before.

At this point, you hit the wall every backend engineer knows:

⚠ THE WALL

"If nothing is wrong… why is everything broken?"

This is where most debugging sessions stall. You've checked everything your tooling can see — and found nothing, because the failure is in the blind spot your tooling doesn't cover.

The breakthrough came from manually tracing a specific failing order ID end-to-end, not from dashboards. We followed the event through each hop by hand, querying each system directly. That's when we found it.

06

The Hidden Bug

Buried deep inside the inventory consumer — a perfectly reasonable-looking early return:

go — consumer.go
func processEvent(event InventoryEvent) error {
    if !isValid(event) {
        return nil  // ← the entire problem
    }

    // ... update inventory ...
    return updateInventory(event)
}

That's it. One line. return nil.

It wasn't a crash. It wasn't a timeout. It wasn't a network error. It was a deliberate, silent skip — and it left absolutely no trace of itself anywhere in the system.

⚠ WHY return nil IS WRONG HERE

Returning nil (no error) tells the message broker: "I processed this successfully, acknowledge and remove it." The event was gone from the queue, the consumer reported success, and the inventory was never updated. From every angle it looked like success.

07

Why It Was Dangerous

This single return nil caused a complete failure cascade across every layer of observability:

📋
No logs
The skip path had zero log statements. It was invisible.
📊
No metrics
No counter incremented. No histogram recorded. The branch simply didn't exist to monitoring.
🔁
No retries
Returning nil acknowledged the message. No retry was ever attempted.
📥
No DLQ
No dead-letter queue entry. The event was acknowledged and dropped forever.
🚨
No alerts
Alerts fire on what metrics track. With no metric, no alert was possible.

This is the worst possible failure mode in distributed systems. A crash is loud — it pages you immediately. This was silent. It killed orders one by one, invisibly, for hours.

⚠ REAL PROBLEM

This wasn't a "bug" in the traditional sense. The logic was intentional. The problem was a design failure in observability. The business logic was correct. The infrastructure was stable. But the visibility into decision points was completely missing.

08

The Fix

Simple, but it required a shift in thinking. The fix wasn't complicated — it was just making the invisible visible.

1 — Make the failure visible

go — after
if !isValid(event) {
    log.Warn("event_validation_failed",
        "event_id", event.ID,
        "reason", validationReason(event),
    )
    return nil
}

2 — Increment a metric for every drop

go — after
metrics.Increment("inventory.event.validation.failure",
    "event_type", event.Type,
    "reason",     validationReason(event),
)

3 — Send to DLQ for debugging and recovery

go — after
// DLQ lets you inspect failed events and redrive after a fix
if err := sendToDLQ(event); err != nil {
    log.Error("dlq_publish_failed", "err", err)
}
// FULL FIXED VERSION

Log the skip. Count it. Send it to DLQ. Alert on the metric. Now the silent failure screams at you.

09

The Mindset Shift

Before"If it fails, it will show up."
After"If I don't explicitly track it, it doesn't exist."

This is the hardest shift to make because it requires you to think about what you don't see — not just what you do. It requires designing for failure visibility as a first-class concern, not an afterthought.

The question to ask at every decision branch in your consumer is not "what happens on the happy path?" but "what happens when this branch is taken, and will I know it was taken?"

// THE PRINCIPLE

Your system is not reliable because it doesn't crash.
It's reliable because it tells you when something is wrong.

10

Production Checklist

Every time I design a consumer now, I run through this before the PR goes up:

Log every decision branch
If an event is skipped, dropped, or re-routed — there must be a log line. No silent paths.
Add metrics for drops, skips, and retries
Every non-happy-path outcome gets its own counter. You can't alert on what you don't measure.
Never return nil silently
A silent success acknowledgment is a lie to the message broker. Make it an explicit, logged, measured decision.
Add DLQ for all debugging paths
Dead-letter queues are not just for retries — they're your forensic tool for understanding production edge cases.
Think in failure scenarios first
Before writing the happy path, enumerate how this code can fail silently. Design the observability for those paths before writing the logic.
Alert on business metrics, not just infra metrics
"Orders updated per minute" is a better signal than "consumer throughput." Alert on what the business cares about.
11

Key Takeaways

01
Logs tell a story you choose
If you don't log it, it didn't happen — as far as your future self at 2 AM is concerned. Every skip is a story that needs to be told.
02
Metrics only measure what you track
No metric equals no failure signal, even if the failure is actively happening. Unmeasured paths are blind spots by definition.
03
Silent failures are worse than crashes
Crashes alert you. A crash is honest. Silence kills you slowly — one unprocessed event at a time, for hours before anyone notices.
12

What's Next

This is Part 1 of a series based on real production learnings at scale:

Part 1
When Logs Lie — Silent failures & observability gaps
This post.
Part 2
Live now. Goroutine exhaustion, retry storms, circuit breakers.
Part 3
The Phantom Bug — Reproducing production-only failures locally
Coming soon.
END OF PART 1

Have you hit this wall?

Drop your story — a bug where everything looked fine but production was broken. I read every reply.

Part 2 is live — read it here.

Contact