Read-your-writes consistency: replicas, routing, and session tokens

After a write, users expect to see their change immediately. Async replication breaks that illusion—here is how sticky routing, monotonic tokens, and cache discipline restore read-your-writes without giving up scale.

Autor: Matheus Palma14. April 20268 Min. Lesezeit

Software engineeringBackendDistributed systemsDatabaseAPI designArchitecture

You ship a profile update. The API returns 200 OK. The user hits refresh—and the old email flashes back for a second before the new one appears. Support tickets say the product “does not save.” Your traces show healthy writes to the primary and reads served from a replica that is milliseconds behind. That gap is long enough to violate read-your-writes consistency: the guarantee that a session observes its own prior writes in order.

This is not an academic concern. It shows up in multi-AZ relational databases, managed Postgres read replicas, global DynamoDB or Cassandra-style deployments, and any stack that separates write and read paths for throughput. The fix is rarely “turn off replicas”; it is routing, tokens, and cache rules that align the user’s read with the data their write just touched.

The patterns below are the ones I reach for when helping teams harden production APIs where UX expectations are stronger than what “eventually consistent” quietly promises.

What “read-your-writes” means in practice

Read-your-writes (RYW) is a session-centric guarantee: for a given client session (however you define it), after a successful write W, any subsequent read R in that session should reflect W and earlier writes from the same session—not necessarily every write in the world, but this actor’s causal history from their perspective.

It sits between strong linearizability (expensive, often single-leader or sync replication) and pure eventual consistency (cheap, but users see ghosts). Most product backends target something in the middle: replicas for scale, primary for truth, and explicit rules for which reads must go where after a mutation.

Formal models (e.g., in distributed systems literature) often discuss session guarantees alongside monotonic reads (you never see time go backward within a session) and monotonic writes (your writes appear in order). RYW is the subset users feel first when it breaks.

Why replicas violate RYW by default

Typical cloud database setups:

Asynchronous read replicas apply the write-ahead log after the primary commits. Lag is usually small but non-zero and spiky under load or failover.
Multi-region replication adds network RTT; “global secondary indexes” or regional caches amplify the window.
Application-level caches (Redis, CDN, browser) can serve older representations even when the database is already consistent.

So the bug is not “replication is broken”; it is routing a read to a stale copy without telling the stack that this read is causally dependent on a write that just landed elsewhere.

Strategy 1: Route session reads to the primary after writes

The blunt instrument: after any mutating request, pin subsequent reads for that user to the writer (primary) for a short window.

Mechanics:

Sticky sessions at the load balancer are rarely enough by themselves—they tie a user to an app instance, not to a database role.
What you want is read/write split awareness in the data layer: a router that sends SELECT to a replica by default, but to the primary when the request carries a flag (cookie, header, or server-side session state) meaning “this session recently wrote.”

Sketch (conceptual):

// After a successful mutation, mark the session (server-side store or signed cookie).
function afterWrite(session: Session) {
  session.readYourWritesUntil = Date.now() + 5_000; // tune per observed replica lag p99
}

function pickReadTarget(session: Session, operation: "default" | "strong"): DbRole {
  if (operation === "strong" || Date.now() < session.readYourWritesUntil) {
    return "primary";
  }
  return "replica";
}

Trade-offs:

Pros: Simple mental model; works with any SQL primary/replica topology; no schema changes.
Cons: Bursts of writes push more read load to the primary—you must watch primary CPU and connection limits. The window must cover p99 replication lag, not just the median, or you still lose RYW under stress.

In consulting engagements, teams often start with a 3–10 second server-side window and tighten it using observed lag metrics from the provider (e.g., replica lag in milliseconds).

Strategy 2: Monotonic read tokens (version vectors, LSNs, “read after seq”)

Instead of sending all traffic to the primary, you pass a minimum position the read must satisfy: “do not answer from a replica that has not applied at least sequence N.”

Examples in the wild:

Postgres: compare Log Sequence Numbers (LSN) or use pg_current_wal_lsn() / pg_last_xact_replay_lsn() on replicas to ensure the replica has caught up past the write’s commit LSN before serving the read.
MySQL: SHOW SLAVE STATUS fields such as Exec_Master_Log_Pos (details vary by binlog format)—same idea: wait until caught up or fail over to primary.
DynamoDB global tables / DAX: model-dependent; often you combine strongly consistent reads on the same key immediately after write, or use version attributes in application logic.

Flow:

Write commits on primary; response includes x-min-read-position: <opaque-token> (or a session store holds it).
Read path sends that token to a router or connection pool that selects a replica only if replica_lag <= acceptable and replica_applied_position >= token. Otherwise it falls back to primary or blocks briefly (WAIT FOR style semantics).

Sketch:

POST /profile → 200 + Header: X-Read-After: lsn:0/1A2B3C
GET  /profile + X-Read-After: lsn:0/1A2B3C
  → router waits for replica R1 until R1 >= 0/1A2B3C OR timeout → then serve

Trade-offs:

Pros: Keeps most read traffic on replicas; scales better than “primary only” after writes.
Cons: Infrastructure-specific; you need metrics and hooks from your DB layer. Blocking reads can become tail latency problems if replicas fall behind—timeouts and degradation to primary must be explicit.

This is where observability pays off: dashboards for replication lag and token wait time tell you whether your RYW implementation is cheap or accidentally synchronous.

Strategy 3: Invalidate or version the application cache

Even with perfect database routing, a Redis cache keyed by user:123:profile can return the pre-write blob. RYW requires write-through or immediate invalidation:

On successful PATCH /profile, delete or bump version on user:123:profile before returning 200.
For CDN edge caches of personalized JSON—avoid caching authenticated mutating resources aggressively; if you must, use short TTLs plus cache keys that include a profile version column updated in the same transaction as the write.

Teams building scalable, production-ready APIs often codify this as: the cache is part of the consistency boundary, not an optimization bolted on later.

Practical example: Express + Postgres-style router

Below is a realistic skeleton (not a drop-in library): an API that records the last committed LSN for a session after a write and ensures the next read either uses a caught-up replica or the primary. Replace LSN plumbing with your driver’s primitives.

import express from "express";

type Session = { lastWriteLsn: string | null; expiresAt: number };
const sessions = new Map<string, Session>(); // production: Redis + TTL

const app = express();

app.patch("/me", async (req, res) => {
  // 1) Transactional update on primary
  const { lsn } = await dbPrimary.query(
    "UPDATE profiles SET email = $1 WHERE id = $2 RETURNING pg_current_wal_lsn() AS lsn",
    [req.body.email, req.user.id]
  );

  const sid = req.sessionId;
  sessions.set(sid, { lastWriteLsn: lsn, expiresAt: Date.now() + 10_000 });

  res.setHeader("X-Min-Read-LSN", lsn);
  res.status(204).end();
});

app.get("/me", async (req, res) => {
  const sid = req.sessionId;
  const s = sessions.get(sid);

  if (!s || s.expiresAt < Date.now()) {
    return res.json(await readFromReplica(req.user.id));
  }

  const target = await pickNodeForLsn(s.lastWriteLsn);
  // pickNodeForLsn: choose replica with replay >= lastWriteLsn, else primary
  return res.json(await readProfile(target, req.user.id));
});

Operational notes:

Expire session hints quickly so you do not permanently bias traffic to the primary.
Fall back on timeout: if no replica satisfies the LSN within e.g. 100–300 ms, read from primary.
Do not trust client clocks for ordering; server-generated LSN or commit token only.

Common mistakes and pitfalls

Treating “read replica” as interchangeable without a rule

Pointing the ORM’s read pool at replicas without post-write routing is the most common cause of flaky UI after mutations. Tests often pass because lag is zero in CI.

Forgetting mobile and multi-tab clients

Optimistic UI may show the new state while a background refetch hits a stale replica. Either delay the refetch until the write response includes a version, or carry the X-Min-Read-* token across client retries.

Caching `GET` responses in the browser “for performance”

Cache-Control: private, no-store on user-specific reads is boring and correct. ETags without coupling to server-side version columns can still serve stale bodies.

Unbounded “wait for replica”

Without a timeout and fallback, a sick replica turns read latency into unbounded waits—worse than showing slightly elevated primary load.

Ignoring cross-service writes

If the profile service writes and a search index updates asynchronously, the user may read from Postgres correctly but see stale results in search. RYW is per data product; define which surfaces must be synchronous for your UX.

Conclusion

Read-your-writes is the bridge between scalable replication and credible UX. You get it by choosing read targets intentionally after mutations—whether through a short primary bias, LSN-style tokens, or cache versioning—and by measuring lag, timeouts, and primary load as first-class metrics.

The underlying theme in freelance and consulting work is consistent: consistency is not a database checkbox, it is a whole-path contract from the write response through routing, caches, and the next read. Making that contract explicit is how teams ship reliable systems without pretending every replica is magically instantaneous.

E-Mail erhalten, wenn neue Artikel erscheinen. Kein Spam — nur neue Beiträge von diesem Blog.

Über Resend. Abmeldung in jeder E-Mail möglich.