You have a dashboard. Green lights everywhere. 99.99% uptime. Zero incidents this quarter. Your manager loves it. But you have this nagging feeling: the setup is held together with duct tape and silent retries. That is the vanity data trap. It rewards what is easy to measure, not what is actually resilient.
Resilience engineering demands a different kind of metric. Not uptime, but recovery slot. Not error budgets, but error shapes. Not mean window between failures, but mean window to acknowledge a symptom. This article is for the person who has to pick what their crew tracks — whether you are a staff engineer, a platform lead, or a site reliability manager. We will walk through the decision, compare the options, and show you how to avoid the metrics that just make you look good.
Who Decides and Why the Clock Is Ticking
A shop-floor trainer explained that the pitfall is treating symptoms while the root cause stays in the checklist.
The decision maker: platform lead vs. staff-level SRE
Someone has to own the choice. In most orgs I have worked with, it is either the platform lead — the person who signs off on infrastructure spend — or a senior SRE who lives inside the incident rotation. The platform lead cares about surface-level stability: uptime, error budgets, dashboards that stay green for the quarterly review. The SRE cares about how the setup actually bends under pressure. These two agendas rarely align on day one. The platform lead wants a number that tells a clean story. The SRE wants a number that catches the next outage before it hits pager duty. That tension is productive — but only if someone resolves it fast. Otherwise, the metric defaults to whatever is easiest to scrape from a Prometheus endpoint. Easy is not resilient.
The deadline: before the next quarterly review or incident postmortem
The clock isn't theoretical. You have maybe two sprints before your next review asks for a resilience number — or worse, before a postmortem forces you to defend why you measured the faulty thing. I have watched a staff pick 'deployment frequency' as their resilience metric simply because they needed something for a Q3 slide. All-hands slide. No one pushed back. The result? They optimized for faster rollouts, not safer ones. Deploy frequency climbed, but recovery slot after a bad deploy stayed flat. That is vanity dressed up as rigor. The catch is: if you don't pick a metric before the deadline, the deadline picks one for you — usually the one that makes the dashboard look good rather than the one that surfaces fragility.
'We chose mean window to recovery because it was already in our existing monitoring stack. We didn't ask what it was missing until the second outage.'
— Staff SRE, e-commerce platform (from a postmortem I reviewed last year)
The stakes: misaligned metrics can kill innovation
Here's the part nobody puts in the charter. A bad resilience metric doesn't just waste dashboard space — it changes what engineers build. If you track 'number of incidents per week' as your north star, units will launch not reporting incidents. Paper over the small ones. Merge risky PRs on Friday afternoon because the incident count is already zero for the week so who's counting? That's not resilience; that's risk deferred. The real stakes are invisible: a misaligned metric quietly kills the experimentation your platform needs. Groups stop touching anything outside their comfort zone. Deployments slow. The SLOs drift, but nobody rewrites them because the metric says things are fine. faulty order. Not yet. That hurts. And once the quarterly review passes with a green dashboard, it is twice as hard to argue for a better metric next quarter.
So who decides, and why is there a deadline? You do, and because the next review or postmortem is closer than you think. Procrastinate, and you hand the choice to someone who doesn't run the pager — or worse, to the default chart in your observability vendor. Quick reality check: a default chart has never saved an incident response.
Three Ways to Measure Resilience (and One Trap)
Error budgets: pros and cons beyond Google's playbook
Error budgets give you a hard ceiling on failure — spend it on downtime or latency, and you stop shipping. That clarity is addictive. Units I have worked with love the binary decision: green light or red light. The catch is subtle. Error budgets only measure what you already decided to measure. If your SLO tracks availability at the HTTP layer but your real pain lives in database connection storms, the budget stays green while users smash refresh. Quick reality check — error budgets reward the appearance of reliability, not necessarily its guts. They also assume you can agree on an SLO target. That sounds fine until two product managers and one infrastructure lead can't decide between 99.9% and 99.95% for three sprints. The budget goes unused, and nobody learns a thing.
Latency distributions (p50, p95, p99): what they hide
I once watched a crew celebrate shaving 20ms off their p99. Meanwhile, their p999 had doubled — they tuned the off tail. That is the trap baked into three-nines thinking: you pick the percentiles that match your dashboard sliders, not the ones that match user pain. P50 tells you what the median user feels; p99 tells you what the unlucky one percent feel. But what about the requests that slot out entirely and never appear in the distribution? They vanish. Most latency dashboards drop errors from the histogram — so a 30-second timeout becomes invisible until the p999 spikes or the customer complains. The honest limitation: percentiles compress reality into a handful of numbers, and compression loses edges. Use them, sure, but pair them with a raw error rate and a timeout count. Otherwise you are optimizing a map that cut off the border.
'We cut our p95 by half and still failed the audit because nobody looked at the requests that died before they could even measure.'
— Site reliability lead, post-incident review
Failure injection experiments: direct but painful
flawed order. Injecting failures is the most honest resilience metric you can run — you literally break something and watch what happens. The pain is real. Not everyone has the stomach to kill a production database replica on a Tuesday afternoon. The experiments take engineering slot, scare stakeholders, and require rollback plans that your runbook might not have. Yet the signal-to-noise ratio beats any dashboard. A five-minute chaos experiment reveals cascading timeouts that no latency percentile would show you for weeks. The downside? It's a snapshot, not a trend. You test a specific failure path on a specific Tuesday at 3pm — what about the midnight pattern when your DB patching window overlaps with a traffic spike? Experiment results decay fast. Repeat them, or the metric becomes a museum piece.
Then there is the vanity trap: counting incidents instead of learning. I see units report 'incident count' as their primary resilience metric — flat or down, they declare victory. That is hollow. One major systemic collapse and twelve minor procedural mistakes both count as one incident each. You learn nothing about depth. Worse, it incentivizes groups to merge small outages into one big ticket or call every alert an incident, inflating the number. Stop counting. open categorizing by root cause type, by blast radius, by recurrence. The number alone is a lie wrapped in a bar chart.
How to Judge a Metric: Five Criteria
Actionability: can you flip the switch today?
A metric you cannot move is a vanity number dressed up as insight. I have watched units track 'mean window to innocence' — a clever measure of how fast ops blames dev — and do nothing with it. The trap is comfort: the number looks precise but nobody owns the lever. Ask yourself: if this metric turns red at 3 PM on a Friday, what single action do you take? faulty order means the data sits dead on a dashboard. Actionability demands a concrete feedback loop — pager duty responds, code freeze kicks in, or a toggled feature flag shifts the load. That's the difference between a lead indicator and a museum piece.
Sensitivity: does it twitch before customers launch shouting?
Robustness to Goodhart's Law: can it be gamed?
Cost of collection vs. value of insight
Distributed tracing gives you gorgeous waterfall charts — and a monthly bill that rivals a junior engineer's salary. Is that worth it? Depends. If a single trace saves you two hours of debugging per incident, the math flips. But many units collect exhaustively because 'data is good' — an expensive fallacy. The trade-off is real: granular metrics drain compute, storage, and cognitive overhead. Judge every candidate by the ratio: how many decisions per dollar? If you cannot name three actions that depend on a specific metric, drop it. The cheapest metric that moves the needle wins.
Trade-Offs at a Glance: A Comparison Table
Error Budget vs. Latency vs. Failure Injection — Three Flavors, One Reality
Let's put them on the table. An error budget tells you how much downtime your users will tolerate — expressed as a Service Level Objective, monitored by Service Level Indicators. It's elegant when your setup is stable, but it goes silent during cascading failures. Latency metrics — p50, p99, tail latency — give you real-window feel for user experience. They're your opening warning when a database query suddenly triples. The catch? They don't tell you if that slow request is actually breaking anything.
Failure injection is different. You purposely break things — kill a pod, throttle a network — and measure the setup's response. It reveals seams that no dashboard can predict. But it's noisy. groups run it weekly and then ignore the results because 'production is too fragile to test right now.' That hurts. Each approach has a blind spot. Error budgets go blind during black-swan events. Latency metrics miss silent data corruption. Failure injection misses the boring, slow drift that kills reliability over months. The trick is knowing which blind spot hits your staff hardest — and that changes with every deployment.
When Each Approach Breaks Down — Concrete Scenarios
Error budgets break opening when your customers don't complain. A 99.9% SLO feels fine until you realize your users are gone — not because the system is down, but because it's so slow they gave up. Latency metrics? They fail you during a queue buildup that creates perfect responses for old data while new requests pile up in the background. I have seen a crew celebrate a flat p50 while their workers silently dropped 30% of jobs. Failure injection breaks when your test environment doesn't match production — a script kills a container, everything looks fine, but in real traffic the circuit breaker would have triggered cascading timeouts. The test passed; the actual system didn't.
The worst pitfall? Using one metric exclusively. units that rely solely on error budgets tend to fill their SLO with 'acceptable' failures until the seam blows out. Pure latency shops chase sub-millisecond p99s while users can't complete purchases. Failure-injection zealots break things joyfully but never check whether those broken things actually affect user stories. Pick one lens and you're blind in one eye.
'A single metric is a weapon; a pair is a compass. Three make a map — but only if you actually look at all three.'
— SRE lead, after her staff's sixth postmortem in a quarter
Hybrid Strategies That Real units Use — Not Theory
Most resilient groups don't choose. They layer. Error budget as the contract — you promise 99.9% uptime, you measure it daily. Latency as the canary — if p99 creeps above 200ms, you investigate before the error budget burns. Failure injection as the stress test — run it weekly against a shadow copy of production traffic, and compare results to both budgets and latencies. That triad gives you three views of the same system: contractual, experiential, and adversarial.
The trade-off? Complexity. You now monitor three times as many dashboards. Alert fatigue rises. Someone has to maintain the injection harness. But what usually breaks initial is not the complexity — it is the staff's will to triage conflicting signals. An error budget says 'fine,' latency says 'slow,' injection says 'vulnerable.' Which do you act on? My recommendation: use the comparison as your triage table. If two of three metrics are red, stop the release. If one is red, investigate. If zero are red — you have no data yet. That's the trap you avoid by layering, not choosing.
Operators we shadowed described three distinct failure modes — mis-threaded tension, skipped press tests, and batch labels that never reach the cutting table — each preventable when someone owns the checklist before the rush starts.
According to field notes from working units, the long-form version of this chapter needs concrete scenarios: who owns the handoff, what fails first under pressure, and which trade-off you accept when budget or slot tightens — that depth is what separates a checklist from a usable playbook.
From Choice to Practice: Implementing Your Metrics
Tooling: OpenTelemetry vs. Datadog vs. Custom
You have your metrics chosen — resilience scores, error budgets, whatever survived the filter. Now comes the part that trips most units: picking the tool that actually surfaces them without drowning you in new toil. OpenTelemetry is the default for a reason: it's vendor-neutral, open, and you can wire it into almost anything. But neutral doesn't mean simple. The setup curve bites — expect two to three days just to get spans flowing correctly across a single Kubernetes namespace. Datadog, by contrast, hands you beautiful dashboards in an afternoon. The catch? You pay per host, per metric, per query — and those resilience dashboards can double your monthly observability bill overnight if you're not careful with cardinality. Custom tooling? I have seen exactly one crew pull it off without regret, and they had three dedicated platform engineers. For the rest of us, the honest advice is: open with OpenTelemetry + a thin visualization layer (Grafana, Honeycomb, whatever your ops staff already tolerates), and defer the fancy custom stuff until your metrics prove they're worth the investment. off order? You burn budget before you have evidence.
Process: How to Introduce New Metrics Without Revolt
Drop a new dashboard on the staff Monday morning and watch the Slack channel go quiet. Not the good quiet. The 'we're ignoring this until it goes away' quiet. Most groups skip the onboarding step entirely — they announce the metrics, assume everyone understands the definitions, and wonder why nobody looks at them. Don't do that. Instead, run a single thirty-minute brown-bag per squad: show them exactly which metric maps to which failure mode, what a healthy reading looks like versus an ugly one, and — this is critical — what action they should take when the number turns red. If the metric doesn't tell you to roll back, page someone, or ignore it, you haven't finished defining it. I fixed this once by pairing each resilience metric with a specific runbook link. Suddenly the dashboard wasn't abstract; it was a triage tool. The revolt evaporated.
'The metric that requires a meeting to interpret is the metric that will be ignored by 4 pm Friday.'
— Staff engineer, post-incident retro on a platform crew
Pilot Phase: One Service, Two Weeks, Three Questions
Never roll resilience metrics across the entire org at once. That's how you get fifteen competing dashboards and zero adoption. Pick one service — ideally a middle-weight one, not the critical path, not the abandoned side project — and run a two-week pilot. Set three questions to answer at the end: (1) Did the metric catch something the staff's existing alerts missed? (2) Did anyone actually look at it during an incident, or was it only checked during the standup review? (3) Did the data collection cost more (window, money, alert fatigue) than the insight it produced? If the answer to question one is 'no' twice in a row, kill the metric or redesign it. If question two comes back negative, your onboarding failed — go back to the brown-bag step. If question three stings, trim cardinality: fewer tag combinations, longer sampling intervals. The pilot phase is your pressure test. Treat it like one. What usually breaks first is the latency between the metric moving and the staff caring — close that gap before you expand.
What Happens When You Pick the flawed Metric
Alert Fatigue from High-Signal, Low-Value Alerts
You pick a metric that feels operational — say, p99 latency under 200ms. Great. Your monitoring tools light up every window a single outlier request crosses that line. Within a week, your on-call rotation has burned through its goodwill. The catch: most of those alerts are benign. A mobile client on a ferry, a DNS hiccup in one edge region — none of them affect real user workflows. But the alerting system treats each like a house fire. I have seen units open ignoring the dashboard entirely after two weeks of this. That's worse.
High-signal, low-value alerts poison trust. You stop chasing the metric because it cries wolf hourly. Meanwhile, actual degradation — a gradual memory leak, a database connection pool shrinking — slips past because everyone is numb. faulty metric, off threshold, faulty outcome.
False Confidence from Flatlined Uptime
Uptime at 99.9% for six straight months. Sounds invincible. But here's the trap: uptime measures whether the service responds, not whether it responds usefully. A login page that returns HTTP 200 but takes seven seconds to render? Still 'up.' A checkout endpoint that drops every third user into an infinite spinner? Also 'up.' The vanity of flatlined availability masks silent degradation.
'We thought we were invincible. Then the abandonment rate hit 12% and nobody looked at uptime.'
— Lead SRE at a mid-market e-commerce platform, after they discovered their perfect uptime hid a 12% abandonment rate
That gap — availability without quality — is where customer complaints pile up while engineering high-fives. When you measure the off thing, you celebrate the wrong success. Then the hard conversation starts: 'But dashboards say everything is green.'
Missed Near-Misses and Silent Degradation
Teams obsessed with aggregate metrics — mean window between failures, average request duration — miss the cracks. A near-miss is a cascading failure that self-healed before anyone noticed. No page, no ticket, no postmortem. Wrong-metric teams never even register it happened. Silent degradation is worse: you degrade throughput by 15% over three months, and your average-based metric stays flat because you happened to provision extra headroom in week two. The metric doesn't lie — but it doesn't tell the truth either.
What usually breaks first is the crew's intuition. They stop trusting their gut because the numbers say fine. I have fixed exactly this by switching from 'average latency' to 'worst-user latency at tail' and suddenly the same system looked terrifying. One metric swap, and the near-misses turned into urgent items.
staff Burnout from Chasing the Wrong Number
Wrong metrics create perverse incentives. Example: you target 'mean time to resolve (MTTR) under 30 minutes.' Your engineers begin closing incidents fast — rebooting boxes, hot-patching without root cause, merging quick rollbacks that break later. MTTR drops. But stability does not improve. Next quarter, the same incidents recur. The staff is sprinting on a hamster wheel, and the metric gives them no feedback loop for prevention. Burnout follows.
Think about it: you can hit 29 minutes MTTR for a year and still have the worst SLO-driven reliability on your team. That hurts. The metric says you succeed. Your people say they're exhausted. Pick the wrong number, and you optimize for exhaustion, not resilience.
Mini-FAQ: Four Questions Practitioners Ask
Should we stop tracking uptime?
Not entirely — but you should stop rewarding it. Uptime is a lagging output, not a resilience input. A system at 99.999% uptime can still be brittle: one cascading failure and the whole thing folds because nobody exercises failure modes. I have seen teams celebrate a perfect uptime quarter while their pager rotates through the same two engineers every night — that's not resilience, that's deferred debt. Use uptime as a sanity check, not a steering wheel. The trap is when the board demands a single number and uptime is the easiest to manufacture.
How many metrics is too many?
If your team can't recite the top three from memory, you're in dashboard clutter hell. The pitfall is metric accumulation — teams add one per incident, then defend each like a sacred artifact. Usually the answer is four to seven actionable metrics, plus maybe two leading signals you're still learning to trust. More than ten and you're statistically guaranteed to find a red number that means nothing. Quick reality check — if a metric goes red and nobody changes their next action, it's wallpaper.
What about SLOs? Aren't they the answer?
SLOs are a contract, not a resilience strategy. They tell you when the user experience degrades, but they won't tell you why your recovery paths are slow or why your team hesitates during rollback. The catch is that teams often measure SLO attainment to hit a target, then game the burn rate — slowing deploys, freezing features — instead of improving the system's ability to absorb surprise. One SLO is never enough; you need a companion metric about response effectiveness. Otherwise you're grading the exam, not teaching the class.
'We hit our SLO. Then the DB failed and our runbook said 'contact the DBA team.' The DBA team was asleep.'
— Platform engineer, post-mortem retrospective
That is not an SLO problem. It's a metric gap: nothing measured the time to first competent action during an unfamiliar failure.
How often should we revisit our metric set?
Every quarter minimum. Every major incident — immediately. Metrics that survive six months without scrutiny become organizational habits, and habits hide decay. The best teams I've seen do a thirty-minute metric audit after each significant outage: what did we look at, what did we miss, what led us astray. That practice is cheap compared to the alternative — chasing a vanity number while the real seam blows out. Set a recurring calendar invite. Miss it twice and your metric set is stale.
The Only Recommendation You Need (No Hype)
begin with two metrics: time to detect and time to mitigate
Forget the dashboard full of shiny graphs. If your team can only track two numbers, make them time to detect (TTD) and time to mitigate (TTM). That's it. TTD tells you how long a problem sits unnoticed — dead minutes where users suffer and you're still sipping coffee. TTM measures how fast you stop the bleeding once you know. I've seen teams obsess over 'uptime percentage' while a partial outage ran for forty minutes undetected. Uptime showed 99.9%. Users saw a broken checkout. The catch is that both numbers feel uncomfortable at first — they reveal blind spots your org has learned to live with. That's the point. Don't polish them. Track them raw, track them honestly, and let the discomfort drive improvement.
Add one context metric: change velocity
Two metrics alone create a dangerous incentive: you can game TTD and TTM by never deploying. Ship nothing, detect nothing, mitigate nothing. Perfect score — zero resilience. That's a trap. Add change velocity — simple count of deployments per week, or better yet, number of production changes that touched critical paths. Now you see the trade-off live. Faster deploys usually spike TTD/TTM at first. Good. That means you're actually measuring reality. The pitfall here is treating change velocity as a target instead of a context signal — do not celebrate 50 deploys if your detection time is climbing. They're a pair. One without the other is vanity.
'We cut our detection time in half. Then we stopped deploying because everyone was too scared to touch the system. That's when I learned we were measuring the wrong thing.'
— Platform engineer, after a postmortem that hurt more than the outage
Review quarterly, drop vanity, keep what hurts
Every three months, sit down with a single question: What would we stop tracking if no one asked for it? Most teams accumulate metrics like old code — once useful, now clutter. Cut them. Keep the ones that make your team wince when they look at the numbers. Those are the ones that matter. Quick reality check — if a metric hasn't changed your behavior in two quarters, it's decoration. Not resilience. Wrong order. You don't add more metrics to get clarity; you strip away everything until only the painful truths remain. That nucleus — TTD, TTM, change velocity, reviewed quarterly — won't scale to every team forever. But it's honest. It's minimal. And it's the only starting point where you won't fool yourself.
Start today. Pick one service. Track TTD and TTM for two weeks. See what you find. That alone will surface more truth than any dashboard you've ever built.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!