Designing batch HTTP APIs: partial success, limits, and database-safe writes

How to design bulk create/update endpoints with clear contracts, bounded work, idempotent retries, and PostgreSQL-friendly patterns—without turning one request into an unbounded incident.

Author: Matheus PalmaMay 31, 202610 min read

Software engineeringArchitectureBackendAPI designPostgreSQLReliability

A partner integration ships a nightly job that POSTs ten thousand line items in a single JSON array. Your handler opens one transaction, locks hot rows, and runs for ninety seconds before the load balancer returns 504 Gateway Timeout. The client retries the same payload. Half the rows were inserted; the other half never ran—but the client has no per-item outcomes and no stable idempotency scope. In consulting work on B2B and internal APIs, this pattern appears constantly: teams add a “batch endpoint” for convenience and inherit ambiguous semantics, unbounded latency, and duplicate side effects on retry.

Batch APIs are not “a loop around the single-resource handler.” They are a separate contract: how much work you accept, what happens when item 417 fails validation, whether partial progress is visible, and how retries interact with idempotency. This article walks through that contract, implementation patterns that keep databases healthy, and mistakes that turn a bulk shortcut into an outage multiplier.

Why batch endpoints exist—and what they optimize for

Clients batch for predictable reasons:

Round-trip amortization — mobile apps and ETL jobs pay TLS and HTTP overhead per call; sending 50 records in one request cuts chatter.
Atomic business intent — “apply this price list to these SKUs” may need all-or-nothing semantics for accounting, or deliberate partial application for operations dashboards.
Throughput ceilings — upstream systems cap requests per second; larger payloads stay under the cap while moving more rows.

Your API should state which goal you serve. A batch optimized for throughput (independent items, partial success OK) looks different from one optimized for atomicity (single transaction, fail entire batch on any error). Mixing both without documentation is how integrators build the wrong retry logic.

Choose a response model before writing handlers

All-or-nothing (atomic batch)

One validation failure or database constraint violation rolls back the entire batch. Return 400 or 422 with a single problem description, or 409 when concurrency conflicts.

When it fits: financial adjustments, entitlement grants where partial state is illegal, small batches (tens of items) with strict invariants.

Trade-offs: long transactions, larger lock footprints, and total failure when one bad row blocks thousands of good ones. Clients must fix the whole payload and retry.

Partial success (per-item outcomes)

Each item gets its own status: created, updated, skipped, or failed with a reason. HTTP status is often 207 Multi-Status (WebDAV) or 200/422 with a structured body—pick one and document it consistently.

When it fits: catalog sync, CRM imports, inventory deltas where operators fix failed rows and resubmit only those.

Trade-offs: clients must merge outcomes into local state; you need stable client-supplied ids or idempotency per item so retries do not duplicate successes.

Asynchronous batch (accepted, processed later)

202 Accepted with a batchId, webhook or poll for completion. The HTTP request only enqueues work.

When it fits: large imports, CPU-heavy validation, virus scanning, or anything that exceeds gateway timeouts (often 30–60 s).

Trade-offs: more moving parts (job store, status API, deduplication on enqueue). This is the right default once batch size or per-item work makes synchronous handling fragile.

Model	Typical max items	Retry story	DB pressure
Atomic	Low (10–100)	Retry whole batch with idempotency key	One long TX
Partial sync	Medium (100–500)	Retry failed ids only; per-item keys	Short TX per chunk or SAVEPOINT
Async	High (1k+)	Idempotent enqueue by batch key	Worker chunks

Request design: bounds, identity, and shape

Hard limits

Publish and enforce:

Maximum items per request (e.g. 100 sync, 500 async enqueue).
Maximum request body size at the edge (nginx, API gateway) aligned with that limit.
Per-item payload size where nested blobs matter.

Return 413 Payload Too Large or 422 with a clear code when limits are exceeded—do not start processing and fail halfway.

Stable identity per item

Require a client-supplied externalId, clientMutationId, or similar on every element. The server maps these to internal primary keys. Retries then mean: “for externalId sku-9, return the outcome I already stored,” not “insert another row.”

Pair batch-level Idempotency-Key (for the HTTP request as a whole) with per-item keys when items can be retried independently in later calls.

Flat array vs envelope

Prefer an explicit envelope:

{
  "items": [ { "externalId": "a1", "sku": "WIDGET-1", "qty": 3 } ],
  "options": { "continueOnError": true }
}

continueOnError (or separate endpoints) avoids overloading one route with incompatible semantics.

Database-safe implementation patterns

Never one giant transaction for unbounded N

For partial-success batches, holding a transaction open across thousands of rows blocks vacuum, bloats replication lag, and couples unrelated tenants if you share tables. Patterns that scale:

Chunked commits — process items in sub-batches of 25–100 inside separate transactions; return aggregated results. Atomicity is per chunk, not global unless the product requires it.

SAVEPOINT per item — within one transaction, SAVEPOINT item_417; on failure, ROLLBACK TO SAVEPOINT item_417 and record the error; commit once at the end. Useful for medium batches where you want a single commit but isolated item failures. Watch transaction duration and lock time.

Staging table — COPY or bulk INSERT into import_staging, validate in SQL, then INSERT ... SELECT into production tables in one controlled migration step. Excellent for large imports and auditability.

Upsert instead of check-then-insert

Race-prone pattern: SELECT then INSERT. Under concurrency, two batch workers duplicate rows. Use INSERT ... ON CONFLICT (PostgreSQL) or equivalent with a unique constraint on (tenant_id, external_id).

INSERT INTO catalog_items (tenant_id, external_id, sku, qty)
VALUES ($1, $2, $3, $4)
ON CONFLICT (tenant_id, external_id)
DO UPDATE SET sku = EXCLUDED.sku, qty = EXCLUDED.qty, updated_at = now()
RETURNING id, (xmax = 0) AS inserted;

RETURNING plus (xmax = 0) (or a trigger) lets you report created vs updated in partial-success responses.

Ordering and foreign keys

If items reference each other (parent category before child SKU), either require topological order in the payload, sort server-side with explicit dependency errors, or use a two-pass import (nodes first, edges second). Undocumented ordering guarantees produce flaky partial batches.

HTTP semantics and client ergonomics

Status codes

200 OK — synchronous batch completed; body describes per-item results (even if some items failed, if your contract treats that as “completed processing”).
207 Multi-Status — uncommon in JSON APIs but valid; some clients lack native support—document whether you use it.
422 Unprocessable Content — entire request rejected (schema, auth, or policy).
202 Accepted — async batch queued; include Location or batchId.
409 Conflict — idempotency key reused with different body, or version conflict on the batch resource.

Use Problem Details (RFC 9457) for top-level failures; nest per-item problems in an array with externalId and a stable code.

Response shape sketch

{
  "batchId": "b_01HYX...",
  "summary": { "total": 3, "succeeded": 2, "failed": 1 },
  "results": [
    { "externalId": "a1", "status": "created", "resourceId": "ci_9" },
    { "externalId": "a2", "status": "updated", "resourceId": "ci_10" },
    {
      "externalId": "a3",
      "status": "failed",
      "error": {
        "type": "https://api.example.com/problems/validation",
        "title": "Invalid quantity",
        "detail": "qty must be positive",
        "field": "qty"
      }
    }
  ]
}

Clients can persist results and retry only failed entries in a follow-up call.

Practical example: Node.js handler with per-item SAVEPOINTs

The following sketch is illustrative—production code adds auth, metrics, rate limits, and structured logging. It processes a bounded batch with partial success inside one transaction using savepoints.

import type { Pool, PoolClient } from "pg";

type ItemInput = { externalId: string; sku: string; qty: number };
type ItemResult =
  | { externalId: string; status: "created" | "updated"; resourceId: string }
  | { externalId: string; status: "failed"; error: { code: string; detail: string } };

const MAX_ITEMS = 100;

export async function postCatalogBatch(
  pool: Pool,
  tenantId: string,
  items: ItemInput[],
): Promise<{ summary: { total: number; succeeded: number; failed: number }; results: ItemResult[] }> {
  if (items.length === 0 || items.length > MAX_ITEMS) {
    throw new BatchValidationError("items length must be 1..100");
  }

  const client = await pool.connect();
  const results: ItemResult[] = [];

  try {
    await client.query("BEGIN");

    for (const item of items) {
      const sp = `sp_${item.externalId.replace(/[^a-zA-Z0-9]/g, "_")}`;
      await client.query(`SAVEPOINT ${sp}`);

      try {
        if (item.qty <= 0) {
          throw new Error("qty must be positive");
        }

        const { rows } = await client.query<{ id: string; inserted: boolean }>(
          `
          INSERT INTO catalog_items (tenant_id, external_id, sku, qty)
          VALUES ($1, $2, $3, $4)
          ON CONFLICT (tenant_id, external_id)
          DO UPDATE SET sku = EXCLUDED.sku, qty = EXCLUDED.qty, updated_at = now()
          RETURNING id::text, (xmax = 0) AS inserted
          `,
          [tenantId, item.externalId, item.sku, item.qty],
        );

        await client.query(`RELEASE SAVEPOINT ${sp}`);
        results.push({
          externalId: item.externalId,
          status: rows[0].inserted ? "created" : "updated",
          resourceId: rows[0].id,
        });
      } catch (err) {
        await client.query(`ROLLBACK TO SAVEPOINT ${sp}`);
        results.push({
          externalId: item.externalId,
          status: "failed",
          error: {
            code: "item_rejected",
            detail: err instanceof Error ? err.message : "unknown error",
          },
        });
      }
    }

    await client.query("COMMIT");
  } catch (err) {
    await client.query("ROLLBACK");
    throw err;
  } finally {
    client.release();
  }

  const succeeded = results.filter((r) => r.status !== "failed").length;
  return {
    summary: { total: items.length, succeeded, failed: items.length - succeeded },
    results,
  };
}

class BatchValidationError extends Error {
  readonly status = 422;
}

For larger batches, replace the single transaction with chunked transactions or an async worker; savepoints do not shrink total work—they only isolate failures.

Idempotency store for the whole request

When the client sends Idempotency-Key, persist the entire response body keyed by (tenantId, idempotencyKey) before returning. On replay, return the stored body without re-executing. Scope idempotency to the batch envelope; per-item externalId upserts still protect against duplicate rows if someone retries without the header.

Common mistakes and pitfalls

Unbounded items arrays. Without limits, one request becomes a denial-of-service vector against your API and database.

Silent partial failure. Returning 200 with only a log line for failed rows forces clients to guess. Always return machine-readable per-item status.

Retrying the full batch after partial success. Without per-item idempotency, retries duplicate created rows or double-charge. Document: retry failed externalIds only, or reuse the same batch idempotency key to get the same response.

Long transactions on hot tables. Even with savepoints, holding locks on a popular parent row (e.g. orders) while processing hundreds of line items blocks other traffic. Short transactions or staging tables are usually safer.

Omitting rate limits on batch routes. One batch call might equal hundreds of single-resource calls—weight it in your rate limiting and admission control policies.

Using batch for read fan-out. Clients sometimes POST ids to “batch fetch.” Prefer GET with careful caching, GraphQL DataLoader patterns on the server, or a dedicated query endpoint—POST reads complicate caching and CDNs.

Conclusion

Batch HTTP APIs are a productivity tool for integrators and a concentrated risk for operators. Success means choosing atomic vs partial vs async semantics up front, bounding work, giving every item a stable identity, and implementing writes so retries and concurrency do not corrupt data.

Key takeaways:

Publish limits and response models; do not overload one endpoint with incompatible guarantees.
Prefer upserts and staging tables over ad hoc loops that race under load.
Return per-item outcomes clients can merge; pair batch idempotency keys with per-item externalId discipline.
Move large or slow batches to async processing before the gateway timeout chooses your architecture for you.

Teams building partner-facing or internal platforms often underestimate how much batch design affects incident frequency. Getting the contract right early—before the first ten-thousand-row payload in production—saves weeks of reconciliation scripts and angry integrator threads. If you are shaping APIs or data pipelines for production scale, the contact page is the right place to start a conversation.

Get an email when new articles are published. No spam — only new posts from this blog.