ChloryX Logo
ChloryX
← Back to Blog

Article

RAG in Production: Evals, Monitoring, and Guardrails

12 min
LLMMLOps

Retrieval-Augmented Generation (RAG) looks simple in a demo: embed docs, retrieve top‑k, and ask an LLM to answer. In production, the failure modes multiply: stale indexes, missing permissions, low-quality chunks, hallucinations, latency spikes, and silent regressions after content changes.

Treat RAG as a system of pipelines with explicit measurement. If you can’t evaluate it, you can’t ship it.

Architecture: think in stages

  1. Source ingestion (docs, tickets, wikis, PDFs)
  2. Preprocessing (cleaning, de-duplication, language detection)
  3. Chunking + metadata (owners, ACLs, timestamps)
  4. Embedding + indexing
  5. Retrieval (hybrid search, reranking)
  6. Prompting + generation
  7. Post-processing (citations, redaction)
  8. Evaluation + monitoring

Evals: the minimum viable evaluation suite

You don’t need 10,000 gold labels to start. You need a small, high-signal set of questions that represent real user tasks and cover your content breadth.

  • Answer quality: judged by humans or model-as-judge with spot checks.
  • Citation correctness: does the cited source actually support the claim?
  • Retrieval recall: is the necessary passage present in top‑k results?
  • Refusal behavior: does it refuse when the answer is not in docs?
  • Latency: p50/p95 end-to-end and retrieval-only latencies.
ts
// Conceptual eval record export type EvalCase = { question: string; expected_sources: string[]; must_refuse?: boolean; };

Guardrails: reduce risk without killing usefulness

  • Grounding requirement: model must cite sources (and you verify citations).
  • Tooling boundaries: restrict what the model can call, with allow-lists.
  • Safety filters: PII detection + redaction before prompts.
  • Permission checks: retrieval must enforce ACLs at query time.
  • Fallback policies: if retrieval confidence is low, show ‘I don’t know’ + links.

Monitoring: what to log in production

Capture enough to debug issues, but avoid logging sensitive content. Prefer structured telemetry with careful redaction.

  • Query id, user segment (not user id), timestamp
  • Retriever: index version, top‑k ids, scores
  • Generator: model version, prompt template version
  • Latency breakdown (retrieval, rerank, generation)
  • Safety signals: redaction count, refusal flag, policy triggers

Operational rule

Version everything: index version, embedding model, chunking algorithm, prompt template, and LLM model. Otherwise you can’t reproduce bugs.

Index freshness and content changes

The most common production issue is stale context: docs change, but the index doesn’t. Implement incremental indexing and an SLA for freshness.

  • Incremental ingestion keyed by doc updated_at.
  • Hard deletes for removed documents.
  • Index rebuild playbook (full re-embed) for model upgrades.
  • Freshness dashboard: % docs indexed within SLA.

Deployment strategy

Treat changes like any other production system: canary releases, regression gates, and rollback. Run your eval suite on every index rebuild and prompt/model change.

The goal isn’t to never hallucinate. The goal is to detect, reduce, and bound failure — and to make it measurable.

If you want, I can convert this into a concrete implementation blueprint for your stack (vector store, orchestration tool, warehouse, and monitoring).