Privacy-conscious observability for LLM backends: logs, metrics, and traces without raw prompts
Instrument LLM-backed APIs without storing raw prompts: redaction, stable ids, token and latency metrics, and trace attributes that keep sensitive text out of logs and APM.
An incident starts with a spike in 5xx responses on your /v1/assist route. You open the log aggregator, search for error, and the first hit is a full user message pasted into a stack trace—because a developer logged input while debugging retrieval. Next page: a system prompt with internal product names. A third line contains a session cookie accidentally concatenated into the log context. You fixed the outage, but you may have just created a data-retention and compliance problem that outlasts the deploy.
LLM backends make this worse almost by default: prompts are long, richly structured, and full of PII, credentials, and proprietary context. Traditional “log everything and grep later” observability collides with minimization, least privilege for log readers, and the reality that log platforms are replicated, indexed, and rarely deleted with the same rigor as application databases.
This article explains how to observe LLM pipelines the way you would observe payment or health endpoints: measure outcomes and dependencies aggressively, capture identifiers and shapes, and treat raw language as sensitive data unless you have an explicit, reviewed retention policy.
Why raw prompt logging fails in production
Volume and cost. A single “debug” log line per request can be kilobytes of text. At thousands of QPS, log ingest bills and index latency become their own incident—and teams respond by sampling away the very signal they thought they were buying.
Security and prompt injection. Logs are another exfiltration channel. If an attacker can steer model output toward a format your logger prints, you have turned log shipping into a side channel. That is not hypothetical; it is the same trust-boundary problem as returning model text unsanitized to clients, only with longer retention and broader read access.
Regulatory and contractual ambiguity. Even when no specific law applies, enterprise customers routinely ask where prompts land, who can read them, and how long they persist. “We might log prompts in Datadog” is a hard sentence to walk back in a security review.
The goal is not zero visibility. It is visibility with an explicit threat model: operators can answer “what broke, for whom, under which configuration?” without needing the literal words a user typed.
A layered model: what to record at each tier
Think in four planes—metrics, structured logs, traces, and optional secure replay—each with different privacy properties.
Metrics: cardinality-aware, content-free
Counters and histograms should describe behavior, not content:
llm_requests_totalwith labels such asroute,model_id,tenant_tier,outcome(success,provider_error,validation_error,client_cancel).- Latency histograms split by
providerandstreaming—not by user text. - Token usage (prompt, completion, total) as numbers aggregated per minute or per tenant, not as strings.
High-cardinality labels (user_id on every histogram bucket) explode cost and can re-identify individuals when combined with other datasets. Prefer aggregated tenant metrics, or sampled exemplars tied to trace IDs where your backend supports it with governance.
Structured logs: identifiers, hashes, and bounded previews
Each request should log a stable correlation id (already propagated from AsyncLocalStorage patterns in Node services), plus:
- Model and version (
gpt-4.1-mini-2025-04-14style identifiers—not “latest”). - Route and feature flag snapshot (hash of flag state is enough if the map is large).
- Outcome and machine-readable error code after you map provider errors (see production LLM integration).
- Token counts and truncation flags (whether the prompt was clipped server-side).
When you genuinely need semantic debugging signal without storing raw text, common patterns are:
- Cryptographic hash of the normalized prompt (
sha256of UTF-8 bytes after whitespace normalization), stored only if your security team accepts that as a pseudonymous identifier under your policy—not as a substitute for deleting PII at the source. - Length and schema summary:
prompt_chars=842,tool_count=2,retrieval_hits=4,citation_ids=[…]. - Redacted excerpt of at most N characters with fixed patterns for emails, credit cards, and JWT-like substrings—knowing that regex redaction is best-effort, not a guarantee.
Traces: spans that mirror dependency calls, not document bodies
Distributed tracing (W3C Trace Context and OpenTelemetry) excels at latency attribution: embedding call vs. vector search vs. completion vs. post-validation. Good practice for LLM routes:
- One span per outbound dependency (
embed,vector_query,completion,cache_lookup). - Attributes: latency, status, retry count, chunk count, HTTP status from the provider—not payloads.
- Events for lifecycle (
first_token_received,stream_aborted) without attaching text.
If you must capture prompt fragments for a tightly controlled debugging cohort, use span processors that drop or hash attributes in the collector, and separate backends with stricter ACLs—not the default project where every engineer has read access.
Secure replay: when operators truly need content
Sometimes only re-running a request reproduces a bug (nondeterministic model behavior, rare tool-call shapes). Instead of logging prompts:
- Persist encrypted, short-lived artifacts in an application-controlled store with explicit TTL, access logging, and break-glass approval; or
- Rely on staging reproduction with synthetic fixtures; or
- Use customer-approved support channels where they re-send minimal context.
That is heavier engineering than console.log, which is exactly why it belongs behind rare workflows—not the default path.
Designing redaction that survives refactors
Redaction logic scattered across handlers rots quickly. Centralize one outbound logging adapter used by LLM middleware:
// Central place: all LLM-related logs go through here.
export type LlmLogContext = {
correlationId: string;
tenantId: string;
route: string;
model: string;
/** Counts and flags only */
promptStats: { chars: number; toolDefinitions: number; retrievalDocs: number };
outcome: "success" | "validation_error" | "provider_error" | "canceled";
providerStatus?: number;
tokens?: { prompt?: number; completion?: number };
};
export function logLlmCompletion(ctx: LlmLogContext): void {
// Implementation sends structured JSON to your logger — no raw prompts.
logger.info("llm_request_completed", ctx);
}
Call sites pass already summarized fields. The LLM client wrapper computes sizes and outcomes once, right at the boundary where you also validate tool calls (JSON Schema for tool calls).
Trade-offs
Hash-only debugging lets you confirm “the same shape of prompt failed twice” without exposing text, but it does not help you understand why unless you maintain offline fixtures keyed by hash.
Redacted excerpts help on-call engineers but carry residual risk; keep N small, block structured secrets entirely, and review patterns quarterly as new token formats appear.
Practical example: completion wrapper with metrics + trace-safe attributes
The following sketch wires metrics and span attributes around a completion call. It deliberately avoids logging message bodies; adjust to your metrics and tracing SDK.
import { trace, metrics } from "@opentelemetry/api";
const meter = metrics.getMeter("assist-api");
const completionLatency = meter.createHistogram("llm.completion.latency_ms");
const completionTokens = meter.createHistogram("llm.completion.tokens_total");
const completionErrors = meter.createCounter("llm.completion.errors_total");
type CompletionInput = {
messages: { role: string; content: string }[];
model: string;
};
export async function completeInstrumented(
input: CompletionInput,
doCompletion: (input: CompletionInput) => Promise<{ text: string; usage?: { total_tokens?: number } }>
) {
const tracer = trace.getTracer("assist-api");
const span = tracer.startSpan("llm.completion", {
attributes: {
"llm.model": input.model,
"llm.message_count": input.messages.length,
"llm.approx_prompt_chars": input.messages.reduce((n, m) => n + m.content.length, 0),
},
});
const start = Date.now();
try {
const result = await doCompletion(input);
const ms = Date.now() - start;
completionLatency.record(ms, { model: input.model });
if (result.usage?.total_tokens != null) {
completionTokens.record(result.usage.total_tokens, { model: input.model });
}
span.setStatus({ code: 1 });
return result;
} catch (e) {
completionErrors.add(1, { model: input.model, type: "provider" });
span.setStatus({ code: 2 });
span.recordException(e instanceof Error ? e : new Error("completion_failed"));
throw e;
} finally {
span.end();
}
}
Notice what is absent: input.messages[*].content never touches telemetry. If you need text for local development, guard it with NODE_ENV === "development" and stderr, not the shared pipeline used in staging and production—better still, use fixtures and record/replay against a mock provider.
Connecting to broader operational practice
None of this replaces SLOs or error budgets (practical SRE with SLOs); it feeds them with honest signals. Token and validation-error metrics become leading indicators for cost and user-visible quality drift before star ratings collapse.
In client engagements on scalable, production-ready AI features, the teams that sleep well are not those with the most verbose logs—they are the ones where on-call runbooks reference dashboards and trace filters that do not require reading user prose to localize a fault.
Common mistakes and pitfalls
Logging “just the first 200 characters” of the prompt — PII and secrets are not uniformly distributed; prefixes often include names, emails, and pasted URLs with tokens in query strings.
Dumping retrieval chunks into logs — RAG pipelines (chunking and evaluation) often lift entire documents into context. Logging “for debugging retrieval” duplicates copyrighted or regulated text into a new system of record.
Storing model outputs unredacted by default — completions may repeat inputs or leak tool results containing database rows. Treat completion text like user-generated content for logging purposes unless policy says otherwise.
Giving log pipelines the same IAM role as the database — Observability stacks rarely need read access to primary data stores; over-broad roles encourage “log the row we already had” shortcuts.
Assuming sampling fixes privacy — Sampling reduces volume; it does not make retained lines less sensitive. If anything, arbitrary sampling complicates DSAR workflows when you cannot guarantee complete erasure.
Conclusion
Privacy-conscious observability for LLM backends is not about seeing less—it is about choosing representations that answer operational questions without turning your log vendor into a second database of customer language. Metrics carry trends; structured logs carry ids, versions, sizes, and outcomes; traces carry timing and dependency behavior; raw text belongs in short-lived, access-controlled channels when it belongs anywhere at all.
Key takeaways:
- Treat prompts, retrieved documents, and completions as sensitive payloads in the logging threat model.
- Standardize one instrumentation path at the LLM client boundary so redaction cannot be bypassed accidentally.
- Favor hashes, counts, and stable codes over prose; pair them with offline evaluation and fixtures for deep debugging.
If you are designing production-ready AI systems and want observability aligned with security review expectations, the engineering background and focus areas on About may be a useful reference—Contact is the right place for architecture questions or collaboration on hardened backends.
Suscríbete al boletín
Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.
Con Resend. Puedes darte de baja en cualquier correo.