Compensating actions in LLM tool pipelines: when generation succeeds but the world does not

LLM tool calls often chain side effects. This article covers sagas-style compensations, idempotency, outbox-style durability, and operator UX when only some tools succeed—patterns from production API work.

Autor: Matheus Palma18 de mayo de 20267 min de lectura

Software engineeringBackendTypeScriptArtificial intelligenceAPI designReliabilityDistributed systems

Your model returns a confident plan: reserve inventory, charge the card, create the shipment label, then email the customer. In the happy path, four tools run and the user sees a single coherent outcome. In production, the third call fails because the carrier API returns 503 after the payment already cleared. Now you owe the customer an explanation, finance wants a clean ledger, and your logs show a half-finished saga with no obvious “undo” button.

That failure mode is not exotic. In client projects where LLM features sit on top of real commerce or operations APIs, tool pipelines behave like distributed workflows: partial success is normal, retries are unsafe without design, and the UI must not pretend that “the model said yes” equals “the business state is correct.” This article explains how to structure tool execution so you can recover, compensate, and observe partial runs—without turning every prompt into a bespoke microservice saga.

Why tool pipelines are workflows, not function calls

A classical in-process function either completes or throws; the runtime rolls back your local variables, and you return an error. Tool calls over HTTP, queues, or SDKs break that assumption:

Side effects survive the exception. A charge intent may finalize even if your handler crashes on the next line.
Latency hides races. Two retries from different workers can double-charge if keys are wrong.
The LLM is not a transaction manager. It proposes steps; your runtime must enforce invariants.

Thinking in workflow terms (steps, compensations, durable state) is the mental shift that keeps LLM features trustworthy at scale.

Sagas, compensations, and what “undo” really means

A saga splits a business operation into local transactions, each with a defined compensating action that semantically reverses the forward step—void the authorization, release the hold, cancel the draft order. Not every step is strictly reversible (emails sent, webhooks fired), so you classify steps:

Class	Example	Compensation strategy
Reversible	Payment authorization hold	Capture void, refund, or release hold
Irreversible but idempotent	Idempotent “ensure label” with same key	Safe retry; no duplicate side effect
Irreversible and noisy	Customer email	Follow-up correction email or support ticket

The goal is not mathematical rollback (often impossible) but a defined terminal state operators can reason about: “payment captured, shipment not created, customer notified, inventory released.”

Forward-only vs backward recovery

Backward recovery runs compensations for completed steps in reverse order. Clean when compensations exist and commute reasonably.
Forward recovery retries or completes missing forward steps from a checkpoint (for example, payment succeeded, label pending—retry label with backoff).

LLM-driven plans often mix both: the model may suggest an order that is not optimal for compensation ordering. The runtime, not the model, should enforce a canonical step order and allowed transitions.

Idempotency keys and command identity

Before compensations matter, retries must be safe. Every tool invocation should carry a stable idempotency key derived from workflow identity plus step index (or step name), not from the raw model output text (which can drift between retries).

Why this matters for LLMs specifically:

Streaming clients and gateways may retry POSTs after timeouts.
Your own worker may re-run a step after a partial write to your workflow table.

If the payment provider keys only on (merchant, idempotency_key), duplicate keys prevent double capture; new keys without coordination invite duplicates. Treat the key as part of your API contract to downstream systems.

Durable state between model turns

In multi-turn agents, it is tempting to keep workflow state only in the conversation transcript. That is fragile: transcripts truncate, sessions migrate between pods, and operators need a queryable record.

A small workflow record (relational row or document) should store:

workflow_id, status (running, compensating, failed, completed)
ordered steps with forward_status, compensation_status, last error, idempotency keys
correlation IDs for logs and traces

This mirrors patterns used in transactional outbox designs: your domain state and outbound calls decouple, and you can re-drive incomplete work from the database rather than from model memory. If you are already using an outbox for events, reusing its infrastructure for tool pipelines often costs less than inventing a second reliability layer.

Designing compensations operators can trust

Compensations should be boring code, not model-generated strings. The model proposes intent (“charge customer”); your server maps intent to registered handlers with typed inputs, timeouts, and compensation hooks.

Practical rules:

Register tools in code, including max concurrency per side-effect class (payments vs emails).
Validate arguments with a schema before any network I/O (see existing patterns around JSON Schema for tool calls).
Emit structured domain events after each successful forward step, not only at the end—partial traces help support and reconciliation.

When consulting on greenfield APIs, specifying this registry early avoids the “stringly typed tool name in production” anti-pattern that becomes expensive to retrofit.

Practical example: a minimal saga runner for tools

Below is a simplified TypeScript sketch of a linear saga executor with forward steps, compensations on failure, and idempotent keys. It is not a full workflow engine; it shows how to keep orchestration deterministic while the LLM only selects a plan name.

import { randomUUID } from "node:crypto";

type StepContext = {
  workflowId: string;
  stepIndex: number;
  signal: AbortSignal;
};

type Step<TIn, TOut> = {
  name: string;
  forward: (input: TIn, ctx: StepContext) => Promise<TOut>;
  compensate?: (input: TIn, ctx: StepContext) => Promise<void>;
};

type SagaDefinition<TIn> = {
  name: string;
  steps: Step<TIn, unknown>[];
};

async function runSaga<TIn>(def: SagaDefinition<TIn>, input: TIn, signal: AbortSignal): Promise<void> {
  const workflowId = randomUUID();
  const completed: { step: Step<TIn, unknown>; input: TIn }[] = [];

  for (let i = 0; i < def.steps.length; i++) {
    const step = def.steps[i];
    const ctx: StepContext = { workflowId, stepIndex: i, signal };
    try {
      await step.forward(input, ctx);
      completed.push({ step, input });
    } catch (err) {
      for (let j = completed.length - 1; j >= 0; j--) {
        const { step: s, input: inRef } = completed[j];
        if (s.compensate) {
          await s.compensate(inRef, { ...ctx, stepIndex: j });
        }
      }
      throw err;
    }
  }
}

/** Example: idempotency key helper — never trust model text for identity */
function idempotencyKey(workflowId: string, stepIndex: number, stepName: string) {
  return `${workflowId}:${stepIndex}:${stepName}`;
}

// --- Pseudonymous external calls (replace with real clients) ---
declare function chargeCard(amount: number, key: string): Promise<void>;
declare function voidCharge(key: string): Promise<void>;
declare function createLabel(orderId: string, key: string): Promise<void>;
declare function voidLabel(orderId: string, key: string): Promise<void>;

const checkoutSaga: SagaDefinition<{ orderId: string; amount: number }> = {
  name: "checkout",
  steps: [
    {
      name: "charge",
      async forward(input, ctx) {
        const key = idempotencyKey(ctx.workflowId, ctx.stepIndex, "charge");
        await chargeCard(input.amount, key);
      },
      async compensate(input, ctx) {
        const key = idempotencyKey(ctx.workflowId, ctx.stepIndex, "charge");
        await voidCharge(key);
      },
    },
    {
      name: "label",
      async forward(input, ctx) {
        const key = idempotencyKey(ctx.workflowId, ctx.stepIndex, "label");
        await createLabel(input.orderId, key);
      },
      async compensate(input, ctx) {
        const key = idempotencyKey(ctx.workflowId, ctx.stepIndex, "label");
        await voidLabel(input.orderId, key);
      },
    },
  ],
};

In a real service, runSaga would persist workflowId and step outcomes, integrate deadlines on signal, and map errors to user-visible states. The LLM’s job ends at choosing checkoutSaga (or a validated parameter object); execution stays in typed code.

Trade-offs and limitations

Compensation ordering is not always the reverse of forward order when steps commute oddly (for example, holds that depend on inventory checks). Graph-based sagas need explicit dependency edges—do not let the model invent arbitrary orderings for irreversible steps.
Long-running compensations (refunds pending for days) do not fit a single HTTP request; model them as async workflows with polling and human review queues.
Exactly-once is still a myth. You aim for idempotent effects and observable at-least-once processing.

Common mistakes and pitfalls

Trusting the model to emit idempotency keys. Keys must be derived from server-side workflow identity.
Compensations that throw unhandled errors, leaving the workflow stuck between states. Compensations need their own retry and DLQ policy.
Charging before all validations complete because the narrative “reads better” that way. Order steps so the cheapest, safest validations run first when business rules allow.
Surfacing raw provider errors to end users during partial failure. Map to stable error codes; log provider payloads internally.
Omitting metrics on partial sagas. You want dashboards for “payment without shipment” rates—the earliest warning of integration drift.

Conclusion

LLM tool pipelines fail like distributed systems: partial success, duplicate delivery, and ambiguous user-visible state. Treat tool execution as durable workflows with explicit forward steps, idempotent external calls, and compensations or forward completion policies defined in code—not in prose. Persist enough state to re-drive work without relying on chat history, and instrument partial outcomes before users open support tickets.

If you are hardening an agentic backend or designing APIs that will sit behind model-driven clients, getting this layer right early pays down operational debt quickly. For background on engineering focus areas, see About; for collaboration or questions about building production-ready systems, use Contact.

Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.

Con Resend. Puedes darte de baja en cualquier correo.