ChloryX

Retrieval-Augmented Generation (RAG) looks simple in a demo: embed docs, retrieve top‑k, and ask an LLM to answer. In production, the failure modes multiply: stale indexes, missing permissions, low-quality chunks, hallucinations, latency spikes, and silent regressions after content changes.

Treat RAG as a system of pipelines with explicit measurement. If you can’t evaluate it, you can’t ship it.

Architecture: think in stages

Source ingestion (docs, tickets, wikis, PDFs)
Preprocessing (cleaning, de-duplication, language detection)
Chunking + metadata (owners, ACLs, timestamps)
Embedding + indexing
Retrieval (hybrid search, reranking)
Prompting + generation
Post-processing (citations, redaction)
Evaluation + monitoring

Evals: the minimum viable evaluation suite

You don’t need 10,000 gold labels to start. You need a small, high-signal set of questions that represent real user tasks and cover your content breadth.

Answer quality: judged by humans or model-as-judge with spot checks.
Citation correctness: does the cited source actually support the claim?
Retrieval recall: is the necessary passage present in top‑k results?
Refusal behavior: does it refuse when the answer is not in docs?
Latency: p50/p95 end-to-end and retrieval-only latencies.

ts
// Conceptual eval record
export type EvalCase = {
  question: string;
  expected_sources: string[];
  must_refuse?: boolean;
};

Guardrails: reduce risk without killing usefulness

Grounding requirement: model must cite sources (and you verify citations).
Tooling boundaries: restrict what the model can call, with allow-lists.
Safety filters: PII detection + redaction before prompts.
Permission checks: retrieval must enforce ACLs at query time.
Fallback policies: if retrieval confidence is low, show ‘I don’t know’ + links.

Monitoring: what to log in production

Capture enough to debug issues, but avoid logging sensitive content. Prefer structured telemetry with careful redaction.

Query id, user segment (not user id), timestamp
Retriever: index version, top‑k ids, scores
Generator: model version, prompt template version
Latency breakdown (retrieval, rerank, generation)
Safety signals: redaction count, refusal flag, policy triggers

Operational rule

Version everything: index version, embedding model, chunking algorithm, prompt template, and LLM model. Otherwise you can’t reproduce bugs.

Index freshness and content changes

The most common production issue is stale context: docs change, but the index doesn’t. Implement incremental indexing and an SLA for freshness.

Incremental ingestion keyed by doc updated_at.
Hard deletes for removed documents.
Index rebuild playbook (full re-embed) for model upgrades.
Freshness dashboard: % docs indexed within SLA.

Deployment strategy

Treat changes like any other production system: canary releases, regression gates, and rollback. Run your eval suite on every index rebuild and prompt/model change.

“The goal isn’t to never hallucinate. The goal is to detect, reduce, and bound failure — and to make it measurable.”

If you want, I can convert this into a concrete implementation blueprint for your stack (vector store, orchestration tool, warehouse, and monitoring).