Retrieval-Augmented Generation (RAG) looks simple in a demo: embed docs, retrieve top‑k, and ask an LLM to answer. In production, the failure modes multiply: stale indexes, missing permissions, low-quality chunks, hallucinations, latency spikes, and silent regressions after content changes.
Treat RAG as a system of pipelines with explicit measurement. If you can’t evaluate it, you can’t ship it.
Architecture: think in stages
- Source ingestion (docs, tickets, wikis, PDFs)
- Preprocessing (cleaning, de-duplication, language detection)
- Chunking + metadata (owners, ACLs, timestamps)
- Embedding + indexing
- Retrieval (hybrid search, reranking)
- Prompting + generation
- Post-processing (citations, redaction)
- Evaluation + monitoring
Evals: the minimum viable evaluation suite
You don’t need 10,000 gold labels to start. You need a small, high-signal set of questions that represent real user tasks and cover your content breadth.
- Answer quality: judged by humans or model-as-judge with spot checks.
- Citation correctness: does the cited source actually support the claim?
- Retrieval recall: is the necessary passage present in top‑k results?
- Refusal behavior: does it refuse when the answer is not in docs?
- Latency: p50/p95 end-to-end and retrieval-only latencies.
ts// Conceptual eval record export type EvalCase = { question: string; expected_sources: string[]; must_refuse?: boolean; };
Guardrails: reduce risk without killing usefulness
- Grounding requirement: model must cite sources (and you verify citations).
- Tooling boundaries: restrict what the model can call, with allow-lists.
- Safety filters: PII detection + redaction before prompts.
- Permission checks: retrieval must enforce ACLs at query time.
- Fallback policies: if retrieval confidence is low, show ‘I don’t know’ + links.
Monitoring: what to log in production
Capture enough to debug issues, but avoid logging sensitive content. Prefer structured telemetry with careful redaction.
- Query id, user segment (not user id), timestamp
- Retriever: index version, top‑k ids, scores
- Generator: model version, prompt template version
- Latency breakdown (retrieval, rerank, generation)
- Safety signals: redaction count, refusal flag, policy triggers
Operational rule
Version everything: index version, embedding model, chunking algorithm, prompt template, and LLM model. Otherwise you can’t reproduce bugs.
Index freshness and content changes
The most common production issue is stale context: docs change, but the index doesn’t. Implement incremental indexing and an SLA for freshness.
- Incremental ingestion keyed by doc updated_at.
- Hard deletes for removed documents.
- Index rebuild playbook (full re-embed) for model upgrades.
- Freshness dashboard: % docs indexed within SLA.
Deployment strategy
Treat changes like any other production system: canary releases, regression gates, and rollback. Run your eval suite on every index rebuild and prompt/model change.
“The goal isn’t to never hallucinate. The goal is to detect, reduce, and bound failure — and to make it measurable.”
If you want, I can convert this into a concrete implementation blueprint for your stack (vector store, orchestration tool, warehouse, and monitoring).
