Service level objectives and error budgets: turning reliability into a product decision

How SLIs, SLOs, and error budgets connect user-visible reliability to engineering trade-offs: choosing indicators, setting targets, and using burn alerts without drowning in metrics.

Autor: Matheus Palma21. April 20269 Min. Lesezeit

Software engineeringSite reliability engineeringObservabilityBackendDevOps

A team ships a caching layer to cut database load. Latency improves for the median request, but a subtle bug causes 1% of checkout calls to fail with 500 for two hours before anyone notices—because the dashboard still shows “green” CPU and healthy pods, and synthetic uptime checks only hit the happy path. The incident is not only a code defect; it is a signal design defect. The organization never agreed on what “good” means for real users, so nobody knew how fast unacceptable was accumulating.

Service level indicators (SLIs) describe measurable aspects of service health. Service level objectives (SLOs) attach targets to those indicators over a window of time. Error budgets are the complement of an SLO: the amount of unreliability you are willing to spend before reliability work must take priority over new features. Together, they turn reliability from a vague virtue into something you can budget, alert on, and discuss with product the same way you discuss capacity or cost.

This article walks through how to choose SLIs that match user journeys, how to set SLOs without cargo-culting “five nines,” and how error budgets change behavior in teams building production backends—drawing on patterns that show up repeatedly in product and consulting engagements where scale and clarity both matter.

From vague uptime to user-centric SLIs

Why “percentage uptime” is often the wrong SLI

Classic availability metrics—binary probes that hit /health every minute—measure whether something responds. They do not measure whether users accomplish their goals: authenticated reads, writes with correct consistency, payments that settle, search results that return within an acceptable tail latency. A service can be “up” while systematically failing a critical cohort.

An SLI should therefore be grounded in user value (or in a proxy very close to it), not only in infrastructure liveness. That does not mean every SLI must be a business KPI; it means the event set and success criteria should map to what “working” means for a slice of traffic you care about.

Good SLI shapes

Most SLIs fall into a few families:

Request-based availability: proportion of valid requests that complete successfully (defined per service—for example HTTP 2xx/3xx excluding client errors you do not own, or gRPC status semantics).
Latency: proportion of requests below a threshold, or a set of thresholds (for example “99% under 300 ms” as a separate objective from the median).
Freshness or correctness (common in data pipelines): proportion of jobs or partitions within an acceptable lag bound.
Durability (storage): probability of not losing committed data over a period—often harder to measure directly and therefore approximated with replication health, checksums, and audit tooling.

Whatever shape you pick, the SLI needs a clear numerator and denominator. Ambiguity breeds gaming: if “errors” are undefined, teams will exclude inconvenient status codes or blame “client misuse” inconsistently.

Multi-window and multi-burn-rate ideas (conceptually)

Google’s SRE material popularized multi-window, multi-burn-rate alerts: short windows catch sudden budget consumption; long windows catch slow leaks. You do not need a specific vendor to apply the idea. The point is that SLO violations are not binary day-of events; they are rates at which you consume your error budget. Alerting on burn rate—how fast you are spending relative to the budget implied by your target—surfaces problems while there is still time to react.

SLOs: targets that stakeholders can understand

Choosing a target is a product decision

An SLO is a statement of the form: “Over a rolling or calendar window, we aim for SLI ≥ X.” For example: “99.9% of checkout API requests (excluding 4xx from bad input) return 2xx within 200 ms at the server boundary.” The number 99.9% is not magic; it implies an error budget of 0.1% of requests in the window. For one million requests, that is 1,000 failing requests—enough to matter, small enough that you must prioritize which failures count.

Setting 99.99% without understanding cost often produces false confidence: tighter targets demand more redundant architecture, more operational rigor, and more expensive observability. In freelance and advisory contexts, the useful conversation is rarely “which nine do we want?”; it is which user journeys justify which spend, and what happens when the budget is exhausted.

Rolling windows versus calendar windows

Rolling windows (for example 30 days) keep the team continuously honest: yesterday’s outage still affects the budget tomorrow.
Calendar windows (for example per quarter) align with planning cycles and external contracts but can encourage end-of-quarter panic or quiet neglect early in the period.

Many teams expose a rolling SLO internally while reporting calendar summaries to customers. The important part is to pick one source of truth for “are we in budget?” to avoid two conflicting stories.

What to exclude from the denominator

Excluding traffic is politically sensitive but technically necessary when certain classes of requests are not promises you make—for example synthetic canaries (count them consistently, either in or out), admin routes, or deprecated endpoints on their way out. Document exclusions in the same place engineers and PMs read the SLO; otherwise on-call will debate numbers during an incident.

Error budgets: the governance mechanism

Defining the budget

If your availability SLO is 99.9% success over 30 days, your error budget for that SLI is 0.1% failures—roughly 43.2 minutes of bad availability if failures were concentrated and the SLI were pure uptime (the familiar napkin math: (0.001 \times 30 \times 24 \times 60 \approx 43.2) minutes). For request-based SLIs, translate the percentage into allowed bad events in the window using traffic volume.

The budget is shared between feature velocity and reliability work:

Budget healthy: aggressive launches, refactors, and dependency upgrades are appropriate within normal change management.
Budget burning fast: freeze risky changes, focus on mitigation, root cause, and guardrails.
Budget exhausted or negative: policy should say what happens—feature freeze, mandatory reliability sprint, or executive escalation—before the crisis, not during it.

Policy without theater

Error budgets fail when they are metrics on a wall but product still ships arbitrary deadlines. The fix is lightweight governance: a recurring reliability review that includes the SLO chart, burn rate, top incidents, and explicit trade-offs for the next cycle. That is how SLOs connect to roadmaps instead of living only in the SRE folder.

Practical example: defining and monitoring a checkout API SLO

Suppose a checkout HTTP API sits behind a gateway. You want an SLO aligned with user-perceived success, not only TCP reachability.

1. Choose the SLI

Good events: POST /v1/checkout returns 2xx within 250 ms server-side (from your service’s metrics, not the user’s Wi-Fi), excluding 4xx caused by invalid carts (validated by a stable error contract).
Valid events: all POST /v1/checkout attempts that passed authentication and rate limiting (so abusive traffic does not silently consume the budget without engineering visibility—handle abuse as a separate objective if needed).

2. Set the target

Start with historical data: if the 30-day success ratio has been ~99.2%, jumping to 99.95% in one quarter is likely fiction. A first SLO might be 99.5% with a commitment to tighten once known defects are fixed—honest targets beat aspirational ones nobody defends.

3. Implement measurement

In Prometheus-style systems, you might track counters checkout_requests_total and checkout_requests_failed_total with consistent labels, or derive latency histograms and count requests under SLO from buckets. The exact query language matters less than consistency with how you define good vs bad events.

Example recording intent (illustrative PromQL-style expressions—adapt to your stack):

# Availability over a 30d window (pseudo-PromQL; names illustrative)
sum(rate(checkout_requests_failed_total[30d]))
/
sum(rate(checkout_requests_total[30d]))

For latency SLO “99% < 250 ms,” you would use histogram le buckets rather than averaging averages—average latency hides tail violations that users feel.

4. Alert on burn rate, not on every blip

Configure alerts when budget burn over a short window implies you will exhaust the 30-day budget early—for example if the last hour’s failure rate, if sustained, would consume more than 14 days of budget in a few hours. That is the essence of multi-burn-rate alerting: compare observed bad rate to budget allowance.

5. Run a prewritten policy

If burn persists for >24 hours at elevated rate, the team defaults to: no non-critical deploys, daily incident review until burn returns to normal. Writing that down removes improvisation and blame.

This pattern scales from a single Node service behind a load balancer to larger systems: the mechanics change, but SLI clarity, honest targets, and budget policy stay constant.

Trade-offs, limitations, and best practices

SLIs are only as honest as your instrumentation. If timeouts are misclassified as client errors, or if partial failures return 200 with an error payload, the SLO will lie. Contract tests and canonical HTTP semantics across gateways matter.

Composite “one number” SLOs can hide failure modes. Prefer a small set of objectives (availability, tail latency, freshness) over a single opaque score unless the score is documented and debuggable.

Dependencies: your SLO may be conditional on a vendor. That is fine—document shared fate and consider fallbacks or degraded modes as part of the error budget discussion.

People cost: maintaining SLO dashboards and quarterly reviews takes time. The return is fewer subjective arguments about whether to pause feature work; the data defines the trade-off.

Common mistakes and pitfalls

Optimizing the metric instead of the experience. If the SLI ignores mobile tail latency, teams will tune for the desktop median and still lose users.

Setting SLOs nobody would defend in an outage. If leadership would never accept 43 minutes of checkout downtime per month, do not pretend a 99.9% target exists without architectural investment.

Alert fatigue from static thresholds. Page humans on budget burn and user impact, not on every small deviation from an arbitrary graph.

Ignoring the customer-facing story. External SLAs (legal promises) should be stricter than or equal to internal SLOs, not the reverse—otherwise you pay penalties while “green” internally.

Forgetting data quality SLIs for ML-backed or async flows. Traditional uptime misses wrong answers; pair latency and availability with evaluation or business outcome checks where appropriate.

Conclusion

SLIs make reliability measurable in terms aligned with real work: successful requests, tolerable latency tails, acceptable lag. SLOs choose how good you intend to be for a defined period. Error budgets make that intention actionable: they tell you when reliability is no longer a background task but the main task.

Key takeaways:

Ground SLIs in user journeys, not only in health checks
Set SLOs from data and organizational willingness to invest, not from nine-counting folklore
Use error budgets to sequence work between features and stability, and alert on burn rather than cosmetic blips

Teams that treat reliability as a product parameter—negotiable, measurable, and budgeted—ship with fewer surprises and recover faster when the world misbehaves. For background on engineering focus areas or to discuss collaboration on scalable systems, see About; for direct inquiries, Contact.

E-Mail erhalten, wenn neue Artikel erscheinen. Kein Spam — nur neue Beiträge von diesem Blog.

Über Resend. Abmeldung in jeder E-Mail möglich.