Multi-turn LLM backends: session state, retrieval, and tool-call loops without losing the thread
Session-backed LLM APIs: durable turns, bounded context, RAG outside the transcript, tool-call round-trips, and per-session isolation—patterns from production assistants.
You expose a copilot-style endpoint: the client sends the latest user message, you append it to “the conversation,” call the model, and return the reply. Week two, someone opens two browser tabs; week three, a mobile client reconnects mid-stream; week four, you add document search and tool calls. Suddenly the same user sees answers that reference another tab’s topic, token limits explode, or the model loops on a malformed tool payload. The product feels haunted, but the cause is mundane: the server never defined what “a conversation” is, how it is stored, or how it evolves across turns.
This article walks through stateful multi-turn LLM backends the way you would design any long-lived workflow: explicit identity for sessions, bounded context windows, deterministic tool round-trips, and isolation under concurrency. The ideas apply whether you host models yourself or call a vendor API; the hard parts are data modeling and control flow, not the HTTP wrapper.
Why “just send the full history” stops working
Chat APIs are stateless. Every completion is a function of the messages you pass this request. That convenience hides three costs:
- Unbounded growth — Storing and re-sending every token from day one hits context limits, increases latency, and raises spend linearly with tenure, not with usefulness.
- Weak boundaries — Without a stable
session_id, “history” is whatever the client concatenates, which is easy to spoof or corrupt. - Nonlinear control flow — Retrieval and tools introduce branches: the model may need another hop before answering the user. Your HTTP handler is no longer “one request → one model call.”
Production assistants need a session record you own: durable, queryable, and small enough to fit the model’s window after compaction (trimming, summarizing, or both).
Session model: what to persist
At minimum, persist structured turns rather than a single markdown blob. A turn typically includes:
- Role —
system,user,assistant, and oftentoolfor function results. - Content — Text and/or structured tool-call payloads, depending on provider format.
- Metadata — Model id, token counts, retrieval citations, idempotency key, correlation id for tracing.
Why structured storage matters: when you implement summarization or redaction, you need to rewrite parts of the transcript without regex surgery. In freelance and consulting work on assistant backends, the teams that skipped this step paid for it when they had to add compliance features (PII stripping, export) months later.
System prompts: versioned, not embedded per row
Treat the system message as configuration, not chat history. Store a prompt_version or hash on the session so you can:
- Reproduce old conversations for debugging.
- Migrate users to new policies without rewriting every row.
If you mutate the system prompt every request, you lose reproducibility and make A/B tests meaningless.
Context windows: trimming vs summarizing
Two complementary strategies:
Trimming (cheap, lossy)
Drop oldest turns until estimated tokens fit under a budget, always keeping the latest user goal and recent tool results. Fast and predictable, but forgets early constraints—fine for FAQ bots, risky for legal or coding assistants where requirements accumulate.
Summarization (expensive, smoother)
Periodically replace the oldest N turns with a short rolling summary produced by the same or a smaller model. Better long-horizon coherence, but introduces summary drift (details vanish) and another failure mode if summarization itself fails.
Best practice: combine them—keep the last K verbatim turns, summarize everything older, and cap tool outputs (store full outputs in object storage or a table; pass truncated excerpts to the model).
Retrieval (RAG) inside a session
When you inject retrieved chunks, attach them as ephemeral context for that turn, not as fake user history. Reasons:
- You do not want retrieval noise permanently written into the transcript.
- Citations should be tied to this answer, not replayed verbatim on unrelated follow-ups.
A practical pattern:
- Load session messages (post-compaction).
- Run retrieval using the latest user text + optional session summary.
- Build a single augmented user message or a dedicated
contextblock your prompt template documents. - After the assistant responds, persist citations as metadata on the assistant row, not as additional fake turns—unless your product explicitly shows “sources” as part of the UX transcript.
Trade-off: more moving parts versus cleaner logs. Cleaner logs win when you need audits or support escalation.
Tool calls: close the loop deterministically
Function calling introduces a state machine:
- Model returns a
tool_callspayload (or equivalent). - Your server executes tools synchronously or via a job, never trusting the model to call your network directly.
- You append
toolrole messages with normalized JSON results. - You call the model again until it returns a final assistant message without pending tools, or you hit a max rounds guard.
Why max rounds matters: Models can oscillate (“call search again with a trivially different query”). A hard cap plus structured logging prevents runaway cost.
Idempotency: If the client retries the HTTP request, you must not double-append user turns or double-charge tool side effects. Use an client_message_id (UUID from the client) or idempotency key stored with a unique constraint so retries are a no-op at the persistence layer.
Concurrency and isolation
Two tabs means two logical sessions or explicit branches—never share one session id across unrelated threads unless your product definition says so.
Within one session, serialize writes per session key (row lock, advisory lock, or single-threaded actor) so interleaved requests cannot reorder turns. LLMs are sensitive to message order; a race that inserts an assistant message before its tool results will produce nonsense and hard-to-debug 400s from the provider.
Practical example: session append with tool round-trip
The following sketch shows a session-scoped append, a simple token budget, and a two-phase model loop with tools. It omits vendor-specific SDK details so you can map shapes to OpenAI, Anthropic, or open-weight stacks.
type Role = "system" | "user" | "assistant" | "tool";
type ChatMessage = {
role: Role;
content: string;
toolCallId?: string;
};
type ToolDef = {
name: string;
description: string;
jsonSchema: Record<string, unknown>;
};
type ToolExecutor = (name: string, args: unknown) => Promise<unknown>;
const MAX_TOOL_ROUNDS = 4;
const MAX_CONTEXT_CHARS = 24_000;
function roughCharBudget(messages: ChatMessage[]): number {
return messages.reduce((n, m) => n + m.content.length, 0);
}
function trimForBudget(messages: ChatMessage[], budget: number): ChatMessage[] {
const sys = messages.filter((m) => m.role === "system");
const rest = messages.filter((m) => m.role !== "system");
while (rest.length > 2 && roughCharBudget([...sys, ...rest]) > budget) {
rest.shift(); // drop oldest non-system; replace with summarization in production
}
return [...sys, ...rest];
}
/** Replace with your provider: returns assistant text and/or tool calls. */
async function callModel(_input: {
messages: ChatMessage[];
tools: ToolDef[];
}): Promise<{ assistantText?: string; toolCalls?: { id: string; name: string; args: unknown }[] }> {
throw new Error("wire to provider");
}
export async function handleUserTurn(input: {
sessionId: string;
clientMessageId: string;
userText: string;
systemPrompt: string;
tools: ToolDef[];
loadMessages: (sessionId: string) => Promise<ChatMessage[]>;
saveMessages: (sessionId: string, append: ChatMessage[]) => Promise<void>;
executeTool: ToolExecutor;
}): Promise<string> {
const {
sessionId,
clientMessageId,
userText,
systemPrompt,
tools,
loadMessages,
saveMessages,
executeTool,
} = input;
// Idempotency: if clientMessageId already stored, return cached assistant reply (not shown).
const prior = await loadMessages(sessionId);
const userRow: ChatMessage = { role: "user", content: userText };
await saveMessages(sessionId, [userRow]);
let transcript = trimForBudget(
[{ role: "system", content: systemPrompt }, ...prior, userRow],
MAX_CONTEXT_CHARS,
);
for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
const { assistantText, toolCalls } = await callModel({ messages: transcript, tools });
if (toolCalls?.length) {
const assistantToolRow: ChatMessage = {
role: "assistant",
content: JSON.stringify({ toolCalls }),
};
await saveMessages(sessionId, [assistantToolRow]);
transcript = [...transcript, assistantToolRow];
for (const call of toolCalls) {
const result = await executeTool(call.name, call.args);
const toolRow: ChatMessage = {
role: "tool",
content: JSON.stringify(result),
toolCallId: call.id,
};
await saveMessages(sessionId, [toolRow]);
transcript = [...transcript, toolRow];
}
continue;
}
const final = assistantText ?? "";
await saveMessages(sessionId, [{ role: "assistant", content: final }]);
return final;
}
throw new Error("tool round budget exceeded");
}
In a real deployment you would add: precise token counting, summarization jobs, retrieval injection, tracing spans per round, and provider-specific validation of tool-call shapes.
Common mistakes and pitfalls
- Trusting client-supplied history — Always rebuild authoritative history from your database; treat client payloads as hints at most.
- Writing retrieval blobs into permanent transcript — Pollutes future turns and confuses debugging; keep RAG inputs ephemeral or metadata-linked.
- Unbounded tool output — Paste a whole PDF into the next model call once, learn why output caps exist.
- Missing tool-result ordering — Providers require tool messages adjacent to the assistant call that requested them; races break that invariant.
- No backoff on provider errors — Retries are not free; combine with circuit breaking as covered in other articles on resilience patterns for LLM APIs.
Conclusion
Stateful LLM features succeed when the backend owns the conversation graph: versioned system prompts, bounded context with explicit compaction, deterministic tool loops with round limits, and session isolation under concurrency. Those choices are what turn a demo endpoint into something you can operate, audit, and evolve—whether you are extending an existing product or helping a team ship a production-ready assistant from scratch.
If you are comparing approaches for your stack, see the articles on production LLM API integration and RAG pipelines for adjacent depth. For collaboration or architecture reviews, contact is the best place to reach out.
Subscribe to the newsletter
Get an email when new articles are published. No spam — only new posts from this blog.
Powered by Resend. You can unsubscribe from any email.