RAG in Production: The Lessons Nobody Gives You

February 20, 20265 min read

RAGLLMProductionArchitecture

Everyone knows how to build a RAG prototype in an afternoon. The real question is what happens when you put that prototype in front of real users, real documents, and real cost constraints.

Fictional product. Real engineering patterns.

The gap between demo and production

On AuditLens — a regulatory compliance analysis platform — the initial prototype was compelling. A simple synchronous pipeline: the user uploads a document, Elasticsearch retrieves relevant regulatory passages, the LLM produces a structured analysis. In demos, it was magic. Compliance officers watched the system dissect a 200-page policy document and return structured findings with regulatory citations. We had buy-in within weeks.

In production, reality hit fast. Regulatory documents sometimes span hundreds of pages. HTTP timeouts became the norm — not the exception. An ML service crash mid-analysis meant silent data loss, with no way to recover the partial work. And when multiple analysts ran concurrent checks before a regulatory deadline, load spikes were completely unpredictable. The pipeline that looked bulletproof in demos was failing multiple times a day.

The retrieval layer had its own problems. A naive keyword search against 50,000+ indexed regulatory provisions returned too much noise. The LLM was spending tokens analyzing irrelevant passages, which inflated costs and degraded the quality of findings. We needed a smarter retrieval strategy before scaling anything else.

From synchronous to asynchronous — the migration that changed everything

The fix was not incremental. We rebuilt the execution model from synchronous HTTP calls to a fully asynchronous queue-based architecture.

Loading diagram…

SQS queues with exponential backoff retry handled transient failures gracefully. Dead Letter Queues captured messages that failed repeatedly — this alone achieved a 99.7% recovery rate on edge cases that would have been silently lost in the synchronous model. Job status was tracked in PostgreSQL so the frontend could poll every few seconds and render a progress bar.

The migration was not trivial. Every API contract between the frontend and backend changed. The frontend went from "submit and wait for response" to "submit, receive a job ID, poll for status." We had to redesign the analyst workflow around this new mental model.

The surprising outcome: users actually preferred the async model. They could launch multiple analyses in parallel and keep working while waiting — something the synchronous system never allowed. An analyst preparing for an audit could queue five documents at once, then review results as they came in. The P95 latency for a full analysis settled at under 45 seconds, with 200+ concurrent analyses supported without degradation.

A four-stage retrieval pipeline

The retrieval quality problem required a multi-stage approach, not a single-query fix. Early on, we tried tuning Elasticsearch relevance scoring alone — adjusting boost factors, adding synonym dictionaries for regulatory terminology. It helped marginally, but the core issue remained: keyword matching cannot distinguish between a provision that defines a requirement and one that merely references it in passing.

First, Elasticsearch performed keyword retrieval across the regulatory corpus, casting a wide net. Second, an LLM filtering pass eliminated noise — passages that matched keywords but were contextually irrelevant. This step alone cut the candidate set by roughly 60%. Third, semantic reranking using embeddings and cosine similarity surfaced the most relevant passages. Fourth, fine-grained chunk selection extracted the exact passages needed for the analysis.

This four-stage pipeline meant the LLM only processed pre-validated, highly relevant content. The reduction in irrelevant token consumption was dramatic — and it compounded with the cost tiering strategy, because fewer input tokens meant cheaper models could handle more of the workload.

Controlling costs without sacrificing quality

A RAG pipeline in production is also a cost problem. Regulatory documents are long, and a single compliance analysis could burn through thousands of tokens across multiple LLM calls. Left unchecked, costs would scale linearly with usage.

The approach was multi-layered. Not every task needed the most powerful model. Simple consistency checks — verifying definitions match across document sections, checking cross-references — used lighter, cheaper models. Complex regulatory reasoning — interpreting ambiguous provisions, identifying subtle non-conformity risks — called for more capable ones.

Elasticsearch pre-filtering meant the LLM never saw the full regulatory corpus, only the relevant slice. Smart chunking ensured documents were never sent whole — each chunk was sized to fit the model's context window with room for the system prompt and structured output format. A hash-based caching system caught repeated analyses of similar document sections — common when multiple versions of the same policy were uploaded during an audit cycle. Datadog dashboards tracked cost per feature and per client tier in real-time, with alerts on spending anomalies. The combination of tiering, pre-filtering, and caching brought the per-analysis cost down to a level where usage-based pricing became viable — something that was not even on the table with the original synchronous architecture.

What I take away

RAG in production is not an AI problem — it is a systems engineering problem. Retrieval quality matters more than model power: a four-stage pipeline with cheap pre-filtering outperforms throwing expensive tokens at noisy input. Async architecture is not a compromise, it is an advantage that users actively prefer. And LLM costs are controlled by design — tiered models, aggressive pre-filtering, caching — not by after-the-fact optimization.

The prototype gets you the meeting. The production system gets you the contract.