Server-side feature flags in distributed backends: evaluation, consistency, and kill switches

How to use feature flags beyond the frontend: where to evaluate rules, how to keep behavior consistent across services, and operational patterns for safe rollouts and instant rollback.

Author: Matheus PalmaApril 18, 20268 min read

Software engineeringBackendArchitectureDevOpsAPI designTypeScript

Introduction

You finish a risky change to checkout pricing logic on a Friday. Staging looks fine. Ten minutes after deploy, support reports mismatched totals between the cart service and the payment service. You need to stop the bleeding without waiting for another full release pipeline. If the only lever you have is “revert the commit and redeploy,” you are one incident away from learning why server-side feature flags belong in the same toolbox as circuit breakers and idempotency keys.

This article is about using flags to control behavior in APIs, workers, and data paths—not just toggling UI components. The focus is on evaluation semantics, consistency across processes, and operational hygiene so flags stay a reliability tool instead of a distributed state bug factory. These patterns show up constantly when helping teams harden production systems: the goal is progressive exposure with a credible kill switch, not a second configuration system nobody trusts.

What “server-side” changes about feature flags

Client-side flags (bundled into a web or mobile app) optimize for delivery speed and experiments on presentation. They are weak at:

Authoritative decisions — anything involving money, entitlements, or compliance should not rely on a client that can be tampered with.
Uniform enforcement — multiple clients and internal jobs must see the same rules.
Instant reaction — you cannot rely on users refreshing to pick up a new bundle when you need to halt a bad path now.

Server-side flags move the decision to a trusted boundary: your API gateway, application services, background workers, or database access layer. The trade-off is engineering and operational cost: you must define who evaluates, with what inputs, how often configuration updates, and how you observe the outcome.

Core concepts: flag types and when to use them

Not every toggle deserves the same machinery. A useful split:

Release toggles

Release toggles hide incomplete work behind a default-off path. They exist to integrate continuously while keeping trunk deployable. Lifetime should be short: merge, validate, remove the branch in code, delete the flag. If release toggles accumulate, every code path becomes a combinatorial test nightmare.

Ops toggles and kill switches

Ops toggles (often long-lived) gate capacity-sensitive or dependency-sensitive features: a new cache layer, a call to an external scorer, an experimental serializer. A kill switch is the extreme case: default-on in normal times, flipped off in seconds when an upstream fails or a bug is confirmed. These belong in runtime configuration with clear ownership and audit trails.

Experiment flags

Experiment flags assign users to variants for measurement. They need stable assignment (the same subject should not flip variants mid-session unless you explicitly allow it), and they need telemetry tied to the variant key. Running experiments across microservices without a shared assignment function is a common source of “the dashboard says variant B won, but revenue did not move” confusion.

Where to evaluate: gateway, service, or library

You can evaluate flags in more than one place; the choice is about latency, blast radius, and coupling.

API gateway or edge

Evaluating at the gateway keeps services simpler and centralizes policy (“this route is disabled for tenant X”). Downsides: the gateway needs rich context (tenant, plan, region) and can become a god object if every product rule lives there.

Inside each service

Evaluating inside the service that owns the behavior keeps domain rules near domain code. The downside is duplication risk: two services might interpret “flag new_pricing_v2” differently unless you share a library or contract.

Shared evaluation library

A small evaluation SDK (same semver’d package everywhere) with a typed flag catalog reduces drift. In consulting engagements, the teams that skip this step almost always ship a subtle bug where service A checks a string name with a typo and service B checks the correct spelling—both compile, both run, both disagree.

The hard part: consistency across distributed calls

The painful incidents rarely come from “flag on vs off” in a monolith. They come from two hops in one request seeing different values because:

Propagation delay — one pod refreshed config 30 seconds before another.
Different inputs — one service passes user_id, another passes account_id, and the targeting rule keys off the wrong identifier.
Caching layers — a CDN or API cache serves a response computed under an old flag snapshot.

Same request, same snapshot

For a single logical operation that spans services (checkout, provisioning, fraud review), prefer a single evaluation snapshot carried on the context:

Generate a flags_revision or flags_payload at the entry point (BFF or first service).
Propagate it through internal headers or a request-scoped context object.
Downstream services must not re-fetch live flag state independently for decisions that must align.

This is analogous to read-your-writes discipline: you are choosing coherent staleness over fresh inconsistency.

When independent evaluation is acceptable

Independent refresh is fine when services are loosely coupled and inconsistency is product-tolerable—for example, a recommendations service can lag a marketing banner service without corrupting data.

Storage and delivery: how flags reach your processes

Common patterns, from simplest to most dynamic:

Environment variables / static config — good for kill switches with rare changes and clear deploy coupling. Bad for frequent experiments; every flip requires a rollout.
File or sidecar sync — periodic pull from object storage or a config repo. Predictable, but watch propagation time across regions.
Hosted flag service with streaming or polling — best for product and growth teams that change rules often. You inherit availability dependencies: if the vendor is down, decide whether you fail closed (safe default) or fail open (last known value).

Whatever you pick, document the failure mode: “If flag resolution fails, do we disable the risky path or keep serving the last known state?” Wrong defaults turn a vendor blip into a security or revenue event.

Practical example: typed flags, stable bucketing, and a request snapshot

The following sketch shows a minimal server-side pattern in TypeScript: a deterministic hash bucket for percentage rollouts, a snapshot attached to the request, and a single function used by multiple modules so naming cannot drift.

import { createHash } from "node:crypto";

export type FlagSnapshot = Record<string, boolean>;

const FLAG_VERSION = "2026-04-18T00:00:00Z"; // bump when rule semantics change

function stableBucket(subjectId: string, salt: string): number {
  const h = createHash("sha256").update(`${salt}:${subjectId}`).digest();
  return h.readUInt32BE(0) / 0xffffffff;
}

export function buildFlagSnapshot(input: {
  tenantId: string;
  userId: string;
  /** 0–1 fraction of tenants that should see `newCheckoutFlow` */
  newCheckoutRollout: number;
}): FlagSnapshot {
  const tenantBucket = stableBucket(input.tenantId, `newCheckoutFlow:${FLAG_VERSION}`);

  return {
    newCheckoutFlow: tenantBucket < input.newCheckoutRollout,
  };
}

export function priceCart(
  snapshot: FlagSnapshot,
  lineItems: { sku: string; qty: number }[],
): { total: number; engine: "legacy" | "v2" } {
  if (snapshot.newCheckoutFlow) {
    return { total: computeV2(lineItems), engine: "v2" };
  }
  return { total: computeLegacy(lineItems), engine: "legacy" };
}

function computeLegacy(_items: { sku: string; qty: number }[]): number {
  return 0;
}
function computeV2(_items: { sku: string; qty: number }[]): number {
  return 0;
}

In a real service mesh, buildFlagSnapshot would run once at the edge; internal calls would pass snapshot (or a signed token encoding it) so payment and cart never disagree for the same user action. Rollout parameters would come from your flag store; the hashing contract stays in code so assignment remains stable when you adjust percentages slowly.

Observability and accountability

Flags without telemetry are blind toggles. At minimum:

Log flag_key, variant, and evaluation_reason on critical paths (PII-safe fields only).
Metric counters per variant for latency, errors, and business KPIs you are protecting.
Audit who changed a flag, when, and in which environment—especially for financial or data export gates.

When partnering with teams on launch readiness, the difference between a stressful Friday and a boring one is often a dashboard that answers “what percentage is on v2, and is v2 erroring more than v1?” in one glance.

Common mistakes and pitfalls

Permanent “temporary” flags

If every shortcut becomes a long-lived branch in production, test matrices explode and engineers stop trusting CI. Enforce TTLs or backlog items to delete flags after the window you promised.

Inconsistent subjects for targeting

Using email in one service and user_id in another breaks cohort analysis and can split the same person across variants. Pick stable primary keys and treat them as part of the flag contract.

Percentage rollouts without sticky assignment

Re-evaluating a random draw on every request jitters users between experiences and corrupts experiments. Hash bucketing (as in the example) or server-side sticky assignment tables fixes that.

Cache poisoning

If you cache HTTP responses that depend on flags, the cache key must include the flag dimensions that influenced the body. Otherwise you will serve user A’s variant to user B.

Silent defaults on provider failure

Failing open into the new path during an outage of your flag vendor is how you accidentally enable untested behavior under load. Prefer last-known-good with bounds, or fail closed into the conservative path when risk is asymmetric.

Conclusion

Server-side feature flags are control-plane tooling for production backends: they let you separate deploy from exposure, drain risk gradually, and cut off bad behavior without a full redeploy—when evaluation is consistent, typed, and observable. The implementation details (gateway vs service, push vs pull, vendor vs in-house) matter less than the invariants: stable assignment for experiments, shared snapshots for multi-step operations, and discipline about deleting toggles once they have done their job.

If you are designing a platform layer or unifying how several services launch risky changes, the about page outlines the kind of backend and architecture work I take on. For a concrete conversation about your stack, use the contact form.

Get an email when new articles are published. No spam — only new posts from this blog.