HTTP health checks in production: liveness, readiness, and honest dependency probes

Separate liveness from readiness, model dependency health without lying to the orchestrator, and wire Kubernetes HTTP probes so Node.js services restart and route traffic predictably.

作者: Matheus Palma约 6 分钟阅读
Software engineeringBackendKubernetesNode.jsSite reliability engineeringHTTP

Your service returns 200 OK from /health, yet users see timeouts and the dashboard shows cascading failures. In the worst case, Kubernetes keeps the pod running because liveness passes while the load balancer keeps sending traffic because readiness also passes—both checks are green while the app is effectively down. That mismatch usually comes from treating “health” as a single boolean instead of two different contracts with the scheduler and with traffic.

This article breaks down liveness versus readiness, how to structure HTTP probes for Node-style HTTP servers, and why “deep” checks belong in the right place—or not at all. These patterns show up in almost every engagement where teams move from a single VM to orchestrated workloads: the goal is predictable restarts, routing, and operator signals, not a vanity metric endpoint.

Why two probes exist

Kubernetes (and similar platforms) ask different questions at different times:

  • Liveness: “Should this process be killed and restarted?” If liveness fails, the container is restarted. Use this only for unrecoverable states where a restart is cheaper than limping along (deadlocks, corrupted in-process state).
  • Readiness: “Should this instance receive traffic?” If readiness fails, endpoints are removed from Service load balancing. The process keeps running so it can warm caches, drain work, or wait for dependencies.

If you merge both into one endpoint that fails whenever Redis blips, you risk restart loops (liveness) or unnecessary traffic flaps (readiness), depending on which probe you wired incorrectly.

Startup probes matter too

Slow-start JVMs and large Node bundles can exceed default probe timeouts. A startup probe gates liveness/readiness until the process has finished bootstrapping. Without it, Kubernetes may restart a pod that was simply still initializing.

Designing the HTTP surface

A minimal, honest layout:

RoutePurpose
/health/liveCheap: event loop responsive, process not wedged
/health/readyCan this replica safely take traffic right now?

Some teams add /health/startup or reuse /health/live with a startup probe configuration that allows longer initialDelaySeconds and higher failureThreshold.

What belongs on liveness

Liveness must be fast and local. Typical checks:

  • Process is up and the HTTP server accepts connections.
  • A trivial in-process counter or watchdog proves the event loop is not starved (optional but useful under CPU pressure).

Do not put downstream dependency checks on liveness. If PostgreSQL is slow, restarting your API pod does not fix the database; it adds churn and can amplify outages.

What belongs on readiness

Readiness answers: “If we send a user request here, can we expect a reasonable outcome?” That often includes:

  • Mandatory dependencies for the request path you serve: primary DB connectivity, auth JWKS fetch cache warmed, feature-flag SDK initialized—whatever your SLO assumes is available on the hot path.
  • Overload signals: optional but powerful—if your queue depth or in-flight request count exceeds a threshold, fail readiness to shed load upstream while keeping the process alive to recover.

Treat readiness as a traffic valve, not a full observability stack. It should return quickly (tens of milliseconds to low hundreds), with timeouts aligned to your probe timeoutSeconds.

Trade-offs: “deep” checks vs lying endpoints

Deep checks (run a SELECT 1, ping Redis) on readiness are valuable when failure accurately means “do not route here.” The trade-offs:

  • False negatives: transient network glips remove you from rotation; that is usually acceptable if you have more than one replica.
  • False positives: the classic “we only check TCP to Postgres” while queries are failing—traffic still hits a broken instance. Prefer a check that mirrors minimal real work (pool checkout + simple query) over a socket connect.

Shallow “always 200” endpoints are worse than they look: they train operators to ignore real signals and delay incident detection. If you are not ready to fail readiness on a dependency, document that decision explicitly and surface dependency state in metrics and traces instead.

Practical example: node:http with bounded dependency checks

Below is a compact pattern using separate routes, Promise.race deadlines, and short-lived caching on readiness so concurrent kube-probes do not stampede your database. The same structure maps directly onto Express, Fastify, or Hono—you only swap the router glue.

import http from "node:http";

/** Minimal stand-in for `pg.Pool` / `mysql2` pool — inject your real pool here */
export type SqlPool = { query: (text: string) => Promise<unknown> };

type ReadyCache = { ok: boolean; checkedAt: number; detail?: string };
let readyCache: ReadyCache | null = null;
const READY_CACHE_MS = 2_000;

function sendJson(res: http.ServerResponse, status: number, body: unknown) {
  res.writeHead(status, { "content-type": "application/json; charset=utf-8" });
  res.end(JSON.stringify(body));
}

async function checkPrimaryDb(pool: SqlPool, deadlineMs: number): Promise<void> {
  await Promise.race([
    pool.query("SELECT 1"),
    new Promise<never>((_, reject) =>
      setTimeout(() => reject(new Error("db readiness timeout")), deadlineMs),
    ),
  ]);
}

export function createHealthServer(pool: SqlPool) {
  return http.createServer(async (req, res) => {
    const path = req.url?.split("?")[0] ?? "/";

    if (req.method === "GET" && path === "/health/live") {
      sendJson(res, 200, { status: "live" });
      return;
    }

    if (req.method === "GET" && path === "/health/ready") {
      const now = Date.now();
      if (readyCache && now - readyCache.checkedAt < READY_CACHE_MS) {
        sendJson(res, readyCache.ok ? 200 : 503, {
          status: readyCache.ok ? "ready" : "not_ready",
          detail: readyCache.detail,
        });
        return;
      }

      try {
        await checkPrimaryDb(pool, 400);
        readyCache = { ok: true, checkedAt: now };
        sendJson(res, 200, { status: "ready" });
      } catch (e) {
        const detail = e instanceof Error ? e.message : "unknown";
        readyCache = { ok: false, checkedAt: now, detail };
        sendJson(res, 503, { status: "not_ready", detail });
      }
      return;
    }

    res.writeHead(404);
    res.end();
  });
}

Wire pool to your real client (pg, mysql2, etc.) and prefer driver- or server-side statement timeouts in addition to the race above so a wedged query cannot hold a connection forever. The 2s cache coalesces probe traffic; tune it so stale readiness is acceptable relative to how quickly endpoints are removed (periodSeconds).

Example probe YAML sketch:

startupProbe:
  httpGet:
    path: /health/live
    port: 8080
  failureThreshold: 30
  periodSeconds: 2
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 10
  timeoutSeconds: 1
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 2

Align readinessProbe.timeoutSeconds with your handler’s worst-case dependency check plus margin.

Common mistakes and pitfalls

  1. Using one /health for everything — couples restart policy to traffic policy; almost always wrong at scale.
  2. Liveness hits the database — restarts during DB incidents create flapping pods and slower recovery.
  3. Readiness never fails — users hit instances that return 500 for every request while Kubernetes still shows “healthy.”
  4. Probe storms — every replica runs deep checks every few seconds without caching; this can saturate small databases or shared Redis instances used only for sessions.
  5. Ignoring graceful shutdown — readiness should fail before you stop accepting new connections; combine with preStop hooks and sufficient terminationGracePeriodSeconds (see graceful shutdown patterns for your stack).
  6. Misleading HTTP codes — returning 200 with { "status": "down" } breaks load balancers and kube; use 503 for not ready.

Conclusion

Treat liveness, readiness, and startup probes as API contracts with your orchestrator: liveness stays local and cheap, readiness reflects whether traffic should land, and deep checks are bounded with timeouts and light caching to avoid self-inflicted load. Getting this right is some of the highest leverage work when hardening APIs for multi-replica, production deployments—whether you are shipping internally or helping a client build a platform that survives dependency blips without mystery restarts.

The operational payoff is simple: fewer false-stable states, faster incident detection, and behavior that matches what operators and users already assume “healthy” should mean.

订阅邮件通讯

新文章发布时收到邮件。无垃圾信息 — 仅本博客的新文章通知。

由 Resend 发送,可在邮件中退订。