Distributed sagas: choreography vs orchestration in production systems

Compare event-driven choreography and orchestrator-led sagas for multi-step workflows: coordination models, failure handling, observability, and when to choose each.

Autor: Matheus Palma8 min de leitura
Software engineeringArchitectureDistributed systemsMicroservicesBackendReliability

A customer completes checkout: inventory must decrement, payment must capture, and a fulfillment message must reach the warehouse. Step two fails after step one already committed. In a monolith, you wrap the whole flow in one database transaction. Across services, there is no single transaction boundary—only eventual consistency, compensating actions, and a design choice about who owns the story of the workflow.

That design choice is usually framed as choreography (each service reacts to events and emits the next) versus orchestration (a dedicated coordinator drives steps in order). Both implement saga-style processes: a logical long-running transaction split into local transactions with defined compensation paths. In consulting and product work, picking the wrong model shows up late—as duplicated logic, impossible debugging, or “lost” workflows when messages reorder or retry.

This article compares the two models at the level you need to implement or review them: message flows, failure semantics, operational trade-offs, and a concrete sketch of each style.

Why sagas exist

Distributed systems rarely offer ACID across service boundaries. Patterns like the transactional outbox help publish intent reliably after a local commit, but they do not by themselves define how multiple services agree on a global outcome. Sagas fill that gap:

  • Each step is a local transaction with a clear commit or rollback story.
  • Cross-service failure is handled by compensating transactions (undo credit, release stock) or by forward recovery (retry, alternate path), depending on business rules.

The saga is not the same as “use Kafka.” It is a protocol for splitting work and recovering from partial completion. Choreography and orchestration are two ways to implement that protocol.

Choreography: implicit coordination

In choreography, no central component owns the end-to-end script. Service A completes its work and publishes a domain event. Service B subscribes, performs its step, and publishes another event. The “workflow” exists only as causal chains of messages and the collective behavior of handlers.

Strengths

Loose coupling. Producers do not know the full list of consumers; new reactions can be added by subscribing to existing events (within bounded contexts and versioning discipline).

Natural fit for domain events. When the business already speaks in terms of “OrderPlaced,” “PaymentCaptured,” choreography aligns with how teams model aggregates and boundaries.

Horizontal scaling. Handlers scale with partitions and consumer groups; there is no single coordinator process to size for peak.

Weaknesses

Implicit global behavior. Understanding “what happens when an order is placed” requires reading every subscriber and every emitted follow-up event. Onboarding and audits suffer.

Ordering and idempotency pressure. If “ShipOrder” must run only after “PaymentCaptured,” that invariant must be enforced in handlers (state machines, deduplication) or in topic design—not by a single ordered script.

Compensation chains are hard to visualize. Who sends StockReleased when payment fails—inventory service reacting to PaymentFailed, or a dedicated process? Without a map, compensations duplicate or miss edge cases.

Orchestration: explicit coordination

In orchestration, a workflow engine or saga coordinator (sometimes a dedicated service, sometimes a worker in an existing app) executes a defined sequence: call inventory, then billing, then shipping; on specific failures, invoke compensations in reverse order or branch to human review.

The orchestrator holds state for each saga instance: current step, correlation id, payloads, timeouts. It typically persists that state durably so crashes can resume.

Strengths

Clear workflow definition. The saga is readable as code, BPMN, or state-machine configuration. That helps compliance, support, and incident response (“instance saga-4821 is stuck on CapturePayment”).

Centralized failure policy. Retry intervals, circuit breaking, and “call support after N failures” live in one place.

Easier reasoning about ordering. The coordinator explicitly does not advance until the previous step succeeds (or takes a documented failure path).

Weaknesses

Availability coupling. If the orchestrator is down, new sagas may not start; mitigations include active-active deployment, durable queues feeding the orchestrator, and health monitoring.

Risk of a “smart god service.” The orchestrator can accumulate domain logic that belongs inside bounded contexts, leading to change coupling and deployment bottlenecks.

Scaling model. Throughput is often limited by orchestrator capacity and downstream call patterns; sharding by tenant or workflow type is usually required at high volume.

Comparison at a glance

ConcernChoreographyOrchestration
Workflow visibilityEmergent (event graph)Explicit (definition + instance state)
Adding a new stepSubscribe / emit (watch compatibility)Change coordinator and deploy
Debugging a stuck flowTrace across many servicesInspect orchestrator state + logs
CouplingLow direct coupling, high conceptual couplingOrchestrator depends on service APIs
Typical failure handlingDistributed compensationsCentral policy + compensations

Neither is universally “better.” Many production systems use both: choreography inside a bounded context, orchestration for cross-context business processes that need a single accountable narrative.

Practical example: same business flow, two shapes

Consider reserve inventory → charge card → create shipment. Idempotency and outbox are assumed at each service boundary.

Orchestrated sketch (TypeScript)

A single worker runs instances keyed by orderId. State is stored so restarts resume safely.

type SagaState =
  | { phase: "reserve"; orderId: string }
  | { phase: "charge"; orderId: string; reservationId: string }
  | { phase: "ship"; orderId: string; paymentId: string }
  | { phase: "done"; orderId: string }
  | { phase: "compensating"; orderId: string; reason: string };

async function runCheckoutSaga(initial: { orderId: string }): Promise<void> {
  let state: SagaState = { phase: "reserve", orderId: initial.orderId };

  for (;;) {
    switch (state.phase) {
      case "reserve": {
        const r = await inventory.reserve(state.orderId);
        if (!r.ok) {
          await persistTerminal(initial.orderId, "inventory_unavailable");
          return;
        }
        state = { phase: "charge", orderId: state.orderId, reservationId: r.reservationId };
        await saveCheckpoint(state);
        break;
      }
      case "charge": {
        const p = await billing.capture(state.orderId, state.reservationId);
        if (!p.ok) {
          await inventory.release(state.reservationId);
          await persistTerminal(state.orderId, "payment_failed");
          return;
        }
        state = { phase: "ship", orderId: state.orderId, paymentId: p.paymentId };
        await saveCheckpoint(state);
        break;
      }
      case "ship": {
        const s = await shipping.createLabel(state.orderId, state.paymentId);
        if (!s.ok) {
          // Policy: refund and release — exact steps depend on SLAs
          await billing.refund(state.paymentId);
          await inventory.release(state.reservationId);
          await persistTerminal(state.orderId, "shipping_failed");
          return;
        }
        state = { phase: "done", orderId: state.orderId };
        await saveCheckpoint(state);
        return;
      }
      default:
        return;
    }
  }
}

async function saveCheckpoint(_s: SagaState): Promise<void> {
  /* durable store: Postgres, DynamoDB, etc. */
}
async function persistTerminal(_orderId: string, _status: string): Promise<void> {}

The important properties: checkpointing after each transition, explicit compensation when a step fails after prior commits, and a single place to attach metrics (saga_step_latency, saga_failed_total).

Choreographed sketch (event handlers)

Here, each service reacts to an event and publishes the next. Global behavior is distributed.

// Inventory service
on("OrderCheckoutRequested", async (ev) => {
  const r = await reserve(ev.orderId);
  if (r.ok) await publish("InventoryReserved", { orderId: ev.orderId, reservationId: r.id });
  else await publish("InventoryUnavailable", { orderId: ev.orderId });
});

// Billing service
on("InventoryReserved", async (ev) => {
  const p = await capture(ev.orderId, ev.reservationId);
  if (p.ok) await publish("PaymentCaptured", { orderId: ev.orderId, paymentId: p.id });
  else {
    await publish("PaymentFailed", { orderId: ev.orderId, reservationId: ev.reservationId });
  }
});

// Inventory listens for payment failure
on("PaymentFailed", async (ev) => {
  await release(ev.reservationId);
});

// Shipping service
on("PaymentCaptured", async (ev) => {
  await createShipment(ev.orderId, ev.paymentId);
});

This works, but invariants are implicit: every team must agree that PaymentFailed always triggers release, that handlers are idempotent, and that ordering (e.g. duplicate InventoryReserved) does not double-charge. Documentation and contract tests become essential.

Trade-offs that matter in production

Observability. Orchestrated sagas give a natural place to attach correlation and trace ids per instance. Choreography requires consistent propagation through every event and discipline in log indexing—often implemented via OpenTelemetry and baggage, similar in spirit to distributed tracing practices.

Versioning. Adding a step in orchestration is a coordinated deploy of the coordinator. In choreography, new consumers can subscribe to old events if the payload remains compatible; incompatible changes need upversioned event types or dual publishing.

Human workflows. Escalation (“approve refund over $10k”) fits orchestration or a specialized rules engine; pure choreography can model it but often sprinkles approval state across services.

Regulatory and audit. A single persisted saga history simplifies “show me everything that happened to order X.” Choreography can achieve the same with an audit log projection fed by all events—extra engineering, strong immutability guarantees.

Common mistakes and pitfalls

Confusing choreography with “no design.” Event chains without explicit saga invariants become spaghetti; define allowed states and transitions per aggregate, even if there is no orchestrator.

Double execution under redelivery. Without idempotency keys and deduplication, both models duplicate side effects when brokers redeliver.

Orchestrator as a second domain model. If the coordinator re-implements pricing or inventory rules, teams will fight over every release. Keep orchestration thin: sequence calls and policies, not business rules that belong inside services.

Ignoring poison messages. A handler that always throws can block partition progress; use dead-letter queues, alerting, and replay tooling for both styles.

Uniform choice. Forcing orchestration for every micro-interaction adds latency and coupling; forcing choreography for every cross-team regulatory workflow creates invisible state machines. Match the tool to the process.

Conclusion

Choreography and orchestration are coordination strategies for the same underlying problem: multi-step work across boundaries without a global lock. Choreography optimizes for evolution and decoupling at the cost of implicit global behavior. Orchestration optimizes for clarity, ordering, and operability at the cost of a critical component and careful boundary design.

Key takeaways:

  • Treat sagas as protocols with local commits, compensation or forward recovery, and explicit handling of partial failure
  • Prefer orchestration when the business needs a single narrative, strong ordering, or heavy operational visibility for a workflow
  • Prefer choreography when domains are stable, events are first-class, and you can invest in contracts, idempotency, and tracing
  • Hybrid approaches are normal: orchestrate the cross-context business process, choreograph within a service

Teams building scalable, production-ready distributed systems benefit from deciding this before message schemas proliferate—revisiting coordination after years of organic events is expensive. For architecture discussions or hands-on help aligning workflows with infrastructure, the contact page is the right place to reach out.

Assine a newsletter

Receba um e-mail quando novos artigos forem publicados. Sem spam — apenas novos posts deste blog.

Via Resend. Você pode cancelar a inscrição em qualquer e-mail.