From LLM demo to production: what actually breaks first

Most teams do not fail because the model is “not smart enough.” They fail because the shape of the problem changes the moment a prototype touches real traffic, real documents, and real incentives.

This post is a practical map of the failure modes we see most often when LLM-powered experiences move from a convincing demo to something customers rely on. It also covers what to build alongside the model so shipping stays boring (in a good way).

The demo is not the product

Demos optimize for a narrow path: a curated prompt, a friendly document set, and a reviewer who wants it to work. Production optimizes for everything else: ambiguous inputs, stale knowledge, edge cases nobody rehearsed, and users who will stress whatever shortcut you left open.

If your roadmap treats “integrate the API” as the hard part, you will discover late that behavior under uncertainty was the hard part all along.

Failure mode 1: correctness without definition

In classical software, “correct” often maps to tests and typed contracts. With LLMs, correctness is probabilistic unless you define what good means for your domain.

Teams without an explicit bar ship features that feel magical on day one and flaky by week three, not because the base model regressed, but because nobody agreed how to measure drift.

What helps

Write down task-level success criteria (what must be true for an answer to count as acceptable).
Maintain golden sets: small, evolving suites of representative prompts and expected outcomes, not just single screenshots.
Track regressions when you change prompts, retrieval settings, tools, or models.

This is not academic overhead; it is how you keep iteration safe once more than one engineer touches the system.

Failure mode 2: retrieval that looks fine until it is not

Retrieval-augmented generation (RAG) is sold as “grounded answers,” but grounding depends on chunking, indexing freshness, permissions, and ranking. Many incidents start as subtle contamination: the model cites the wrong section, merges two policies, or pulls an outdated doc that still matches keywords.

What helps

Treat retrieval as a data product: ownership, refresh cadence, and deletion matter as much as embeddings.
Instrument which chunks were used and whether answers stayed within them (when that is the requirement).
Separate “we don’t know” from confident-but-wrong; your UX and policies should reflect that distinction.

Failure mode 3: tool use becomes an accidental attack surface

Agents and tool-calling unlock real workflows and real risk. If permissions are coarse, prompts can chain into unintended actions. If errors are vague, the model may retry aggressively or hide failures behind plausible language.

What helps

Enforce least privilege at the tool boundary; prefer narrow tools over “do anything” endpoints.
Require human-readable audit trails for consequential actions.
Make failures structured so the model (and your UI) can recover without improvisation.

Failure mode 4: latency and cost cliffs

Interactive experiences collapse when p95 latency spikes. Batch-heavy architectures collapse when usage grows faster than budgets.

What helps

Design for streaming UX where partial results reduce perceived wait.
Cache stable retrieval contexts and repeated sub-queries where safe.
Budget per-request ceilings and degrade gracefully instead of silently burning margin.

Failure mode 5: “policy in the prompt”

Prompt instructions are necessary and insufficient. Production systems need enforceable guardrails outside the model: schema validation for outputs, blocklists where appropriate, rate limits, content policies tied to real enforcement points, and escalation paths for sensitive domains.

Write prompts for behavior; put non-negotiables in code.

A sane sequence from pilot to production

You do not need a heavyweight program on day one. You need a sequence that prevents surprises:

Define the job-to-be-done in user terms and failure terms (what must never happen).
Baseline quality on real-ish inputs; resist judging only on demo prompts.
Add observability early: traces, retrieval attribution, and outcome signals you can review weekly.
Harden boundaries (tools, authz, output contracts) before widening scope.
Iterate with evals, not vibes, especially across model upgrades.

Closing thought

Shipping AI that holds up is less about chasing the newest model and more about engineering discipline at the seams: evaluation, retrieval hygiene, permissions, latency, and operational honesty when confidence is low.

At RePizel we partner with teams to turn those seams into a repeatable delivery loop, so your roadmap advances without trading away trust.

If you are charting a path from prototype to production and want a second pair of eyes on architecture or eval strategy, reach out. We like practical constraints more than buzzwords.