Version skew in rolling deployments: backward compatibility beyond the database

Rolling deploys run old and new code side by side. Harden APIs, queues, caches, and flags—not only SQL—with additive contracts, phased payload changes, and skew-aware tests.

Autor: Matheus Palma8 Min. Lesezeit
Software engineeringBackendKubernetesReliabilityAPI designArchitecture

You finish a careful expand/contract migration, run the suite green, and ship. Ten minutes into the rollout, support pastes a stack trace: version N+1 writes a JSON field your mobile client on version N never heard of, and the app crashes on decode. Another pod, still on N, consumes a message produced by N+1 and drops it as “invalid.” Neither failure is visible in migration logs—the schema is fine. The failure is version skew: two builds of your system legitimately coexist during a rolling deploy, and their implicit contracts drifted.

This article treats version skew as a first-class production problem: not an edge case you “should not have,” but the normal state of any continuously delivered backend. The goal is predictable behavior when traffic, queue consumers, cron jobs, and admin scripts span multiple builds. The patterns below show up repeatedly when helping teams harden APIs and worker fleets for scale; they complement database tactics with application-level compatibility you can test before merge.

What version skew actually is

Version skew is the overlap window where different binaries (or different configuration revisions) handle related work: HTTP requests, WebSocket sessions, queue messages, scheduled jobs, or cache entries. Skew is guaranteed in:

  • Kubernetes rolling updates — old pods drain while new ones take traffic.
  • Blue/green or canary — a slice of users or traffic hits the new build first.
  • Long-lived clients — browsers and native apps ship on their own cadence; your API must tolerate older parsers.
  • Asynchronous pipelines — a producer on N+1 enqueues work consumed seconds later by N, or the reverse.

Skew is not the same as “eventual consistency” in storage, though the two interact. It is specifically about code and configuration disagreeing on message shapes, defaults, error semantics, or feature behavior while both are authoritative for some slice of traffic.

Why migrations alone are insufficient

Schema migrations govern what the database accepts. They do not automatically govern:

  • JSON field presence and strict client decoders.
  • Protobuf or Avro evolution rules across language boundaries.
  • Side effects (emails sent twice, webhooks fired from both versions).
  • In-memory caches populated by one version and read by another.

Production systems need compatibility contracts at each boundary, not only at the SQL layer.

Compatibility layers: where contracts live

Think in terms of edges—anything that crosses a process or time boundary:

EdgeTypical skew symptomContract lever
HTTP/JSON public APIUnknown fields crash strict modelsTolerant readers, additive changes, documented unknown handling
Internal gRPC/ProtobufDecode errors on new oneofsField numbers, optional, deprecation policy
Message queuesPoison messages after deploySchema versioning, optional payloads, DLQ + replay
Shared cache keysStale shape interpreted wrongVersioned key prefixes, TTL, explicit invalidation
Feature flagsDivergent behavior per podConsistent evaluation inputs, server-side defaults
Background jobsJob type unknown on old workerForward-compatible job envelopes, delayed rollout of consumers

Each row is a place where freelance and consulting engagements often surface “mystery” incidents: metrics look healthy, databases are clean, yet a fraction of requests fail until the rollout completes.

Forward and backward compatibility: precise meanings

These terms are often used loosely. For rolling deploys, use this shorthand:

  • Backward compatible (new code reading old data/messages) — N+1 must accept everything N produced. Example: N+1’s deserializer ignores unknown JSON keys; N+1’s consumer handles old job schema revision v1.
  • Forward compatible (old code reading new data/messages) — N must tolerate artifacts from N+1 that already leaked into the system. Example: N ignores unknown JSON keys; N’s worker treats an unrecognized job type as noop or deferred rather than crash (often paired with routing only new types to N+1—see pitfalls).

Symmetric compatibility (both directions) is ideal for queues and shared caches; backward-only is often enough for request/response APIs if only the server upgrades during a single request’s lifetime.

Practical patterns for HTTP and JSON APIs

Tolerant readers and additive changes

Servers and clients should follow Postel’s robustness selectively: accept unknown fields, emit a stable core. In TypeScript, z.object({ ... }).passthrough() or explicit .catchall(z.unknown()) encodes “we may add keys later.” Mobile and web bundles shipped last week must not assume a closed world.

For responses, additive changes are safe: new optional fields, new enum values documented as “clients must ignore unknown values.” Semantic changes to existing fields are breaking even if the key name stays the same—treat them like new resources or versioned media types.

Defaults and nullability

A common regression: N+1 stops sending a field the server used to omit when falsy; clients interpreted “missing” as false and now break on explicit null. Document defaulting rules in your OpenAPI or JSON Schema and test three states: missing, null, and set.

Contract tests at the boundary

Consumer-driven contract tests (for example Pact) catch “N+1 response no longer satisfies N mobile” before production. They do not replace skew thinking—they encode it as executable expectations.

Practical patterns for asynchronous work

Envelope with schema revision

Every message carries a revision or schemaVersion integer. Handlers switch on it:

type JobEnvelope =
  | { schemaVersion: 1; type: "send-email"; to: string; templateId: string }
  | { schemaVersion: 2; type: "send-email"; to: string; templateId: string; locale?: string };

function handleSendEmail(job: JobEnvelope): void {
  switch (job.schemaVersion) {
    case 1:
      dispatchEmail({ ...job, locale: "en" });
      break;
    case 2:
      dispatchEmail(job);
      break;
    default:
      throw new Error(`Unsupported send-email schemaVersion: ${(job as JobEnvelope).schemaVersion}`);
  }
}

Deploy consumers that understand v2 before producers emit v2, or use a two-phase flag: enable v2 parsing everywhere, then enable v2 emission. Order matters.

Unknown job types on old workers

When N receives a job type introduced in N+1, crashing poisons the queue. Safer interim behaviors:

  • Negative acknowledgment with delay if your broker supports it—gives time for N pods to finish draining.
  • Explicit “defer” queue visible in dashboards—requires operational runbook.

The best fix is routing: new job types go to a queue only N+1 workers consume. That needs infra discipline but removes ambiguity.

Feature flags and configuration skew

Server-side feature flags can diverge across pods if evaluation depends on local cache refreshed at different times, or if the flag SDK initializes before the latest config sync. Under skew, two replicas may disagree on whether a code path is active—worse than all-old or all-new.

Mitigations:

  • Centralized evaluation with short TTL and stamped config version in logs.
  • Defaults that favor safe behavior when the flag service is unavailable (often “off” for risky paths—document the choice).
  • Kill switches implemented as configuration, not only as code paths, with the same consistency concerns.

Practical example: safe rollout of a renamed payment field

You rename amountCentsamount_minor in an internal JSON event consumed by a reporting worker.

Phase 1 — dual write in producer (N+1):

interface PaymentCapturedEventV1 {
  event: "payment_captured";
  amountCents: number;
}

interface PaymentCapturedEventV2 extends PaymentCapturedEventV1 {
  amount_minor: number;
}

function emitPaymentCaptured(amountMinor: number): PaymentCapturedEventV2 {
  return {
    event: "payment_captured",
    amountCents: amountMinor,
    amount_minor: amountMinor,
  };
}

Phase 2 — workers accept both (deploy to all replicas):

function amountMinorFromEvent(raw: Record<string, unknown>): number {
  if (typeof raw.amount_minor === "number") return raw.amount_minor;
  if (typeof raw.amountCents === "number") return raw.amountCents;
  throw new Error("payment_captured missing amount field");
}

Phase 3 — stop emitting amountCents after all workers and downstream archives are upgraded.

This is the same expand/contract rhythm as SQL, applied to event payloads. Skew windows are harmless because every replica understands the union of shapes.

Verification: how to test skew before it hits users

  1. Dual-version integration tests — run the current artifact against fixtures produced by main and against fixtures from your feature branch; assert parsers and handlers.
  2. Traffic replay — sample production traffic (sanitized) through both builds in a staging mesh.
  3. Synthetic canaries — after deploy, exercise critical paths from outside the cluster before widening traffic (pairs naturally with SLO/error budget practice).

For teams building production-ready platforms, these checks are cheaper than incident bridges and customer apologies.

Common mistakes and pitfalls

Assuming “deploy is fast enough”

Even a thirty-second skew at high QPS means thousands of cross-version interactions. Job runtime can extend skew to hours if a worker started on N processes a message enqueued under N+1.

Forward compatibility without routing

Letting N silently ignore unknown job types feels safe but can drop work. Prefer explicit deferral, metrics on unknown types, and queue separation.

Strict deserialization everywhere

fail-fast is good for internal invariants, but at the wire boundary it becomes an outage multiplier. Reserve strictness for authenticated internal modules after normalization.

Coupling feature launch to binary rollout

A flag that flips globally the moment N+1 starts rolling creates half-on/half-off clusters. Prefer staged flag changes orthogonal to pod replacement when the behavior is user-visible.

Shared cache without key versioning

If N+1 writes a richer object into Redis and N reads it with an older parser, you get subtle corruption or exceptions. Version the key (user:123:v2) or use a shared schema revision in the payload.

Conclusion

Version skew is the default during continuous delivery: multiple builds share traffic, queues, and caches. Treating compatibility as schema plus every serialized boundary—HTTP, events, cache entries, and flags—turns rolling deploys from a recurring source of surprises into a boring operational routine. Pair additive API evolution with explicit two-phase rollouts for payloads, centralize flag evaluation where it matters, and verify with dual-version tests and canaries.

If you are evolving a distributed backend and want a second pair of eyes on rollout safety or API contracts, see about for background and contact to reach out.

Newsletter abonnieren

E-Mail erhalten, wenn neue Artikel erscheinen. Kein Spam — nur neue Beiträge von diesem Blog.

Über Resend. Abmeldung in jeder E-Mail möglich.