Explain RAG architecture in detail

Here’s a clear explanation of Retrieval-Augmented Generation (RAG) architecture, broken down by components, data flow, and key design choices.

RAG Architecture

What is RAG?

RAG combines information retrieval with generative LLMs to produce grounded, accurate, and up-to-date responses. It solves two main LLM problems:

  • Hallucination (making up facts)
  • Stale knowledge (training data cut-off date)

High-Level Architecture

text

┌─────────────────────────────────────────────────────────────────────┐
│                           QUERY PIPELINE                            │
└─────────────────────────────────────────────────────────────────────┘

   User Query ──► [Embedding Model] ──► [Vector DB] ──► Top-K Chunks
        │                                    ▲
        │                                    │
        └────────────────────► [Prompt Constructor] ◄── [System Prompt]
                                      │
                                      ▼
                               [LLM Generation]
                                      │
                                      ▼
                               Final Answer

┌─────────────────────────────────────────────────────────────────────┐
│                         INDEXING PIPELINE (Offline)                 │
└─────────────────────────────────────────────────────────────────────┘

   Documents ──► [Chunking] ──► [Embedding Model] ──► [Vector DB]

Component Details

1. Indexing Pipeline (Pre-processing, run once or periodically)

ComponentFunctionExample
Document LoaderIngests PDFs, HTML, Confluence, Zendesk, etc.UnstructuredIOLangChain loaders
ChunkingSplits documents into smaller pieces (e.g., 512 tokens)RecursiveCharacterTextSplitter, semantic chunking
Embedding ModelConverts text chunks into dense vectorstext-embedding-3-smallBAAI/bge-large-en
Vector DatabaseStores + indexes embeddings for similarity searchPinecone, Weaviate, pgvector, FAISS
Metadata StoreStores source, date, page number for each chunkJSON field in vector DB or separate DB

2. Query Pipeline (Runtime)

ComponentFunction
Query EmbeddingConvert user query to vector (same model as indexing)
Similarity SearchFind top-K chunks (cosine / dot product / Euclidean)
Re-ranking (optional)Reorder chunks with cross-encoder for higher precision
Prompt ConstructionInject retrieved chunks into prompt template
LLM GenerationGenerate answer conditioned on query + chunks
Citation/AttributionReturn source references with answer

Data Flow Example

User query: “What is your refund policy for electronics?”

Step 1 – Retrieve:

  • Embed query → search vector DB → top 3 chunks:
    1. “Electronics can be returned within 30 days if unopened.” (source: policy.pdf, p.4)
    2. “Opened electronics are subject to 15% restocking fee.” (source: policy.pdf, p.5)
    3. “Defective electronics: free replacement within 1 year.” (source: warranty.pdf, p.2)

Step 2 – Generate prompt:

text

System: Answer using only the context below. Cite sources.

Context:
[1] Electronics can be returned within 30 days if unopened. (policy.pdf p.4)
[2] Opened electronics are subject to 15% restocking fee. (policy.pdf p.5)
[3] Defective electronics: free replacement within 1 year. (warranty.pdf p.2)

User: What is your refund policy for electronics?

Step 3 – LLM output:
“Our refund policy for electronics depends on condition: unopened items can be returned within 30 days (policy.pdf p.4). Opened items have a 15% restocking fee (policy.pdf p.5). Defective items are replaced for free within 1 year (warranty.pdf p.2).”

Key Design Variations

VariationDescriptionUse Case
Naive RAGRetrieve → generate onceSimple Q&A
RAG with FusionQuery expansion + multiple retrieval strategiesHigh recall needs
Self-RAGLLM decides whether to retrieve or notReducing unnecessary retrieval
Corrective RAGCheck retrieval quality; re-retrieve if lowHallucination-critical apps
Agentic RAGLLM uses retrieval as a tool, can search multiple timesComplex multi-step questions

Advanced Components (Optional)

  • HyDE – Generate hypothetical answer first, then retrieve similar documents
  • Re-ranking – Cross-encoder (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) to reorder chunks
  • Context Compression – Summarize or extract only relevant sentences from long chunks
  • Hybrid Search – Combine vector search + keyword (BM25) for better precision on rare terms

Common Failure Modes & Mitigations

ProblemMitigation
Low relevance chunksRe-ranking, better chunk sizing, hybrid search
Missing contextQuery rewriting, multi-step retrieval, feedback loop
LLM ignores contextPrompt constraints, instruction fine-tuning
High latencyCache embeddings, smaller LLM, semantic caching
Outdated indexed docsIncremental indexing, freshness monitoring

Where RAG Fits in Customer Support

In your GenAI support solution, RAG would be used to retrieve from:

  • Product manuals – Troubleshooting steps
  • Policy documents – Refund, shipping, warranty
  • Past tickets – Similar resolved cases
  • Internal wikis – Agent-only knowledge

🤞 Sign up for our newsletter!

We don’t spam! Read more in our privacy policy

Scroll to Top