Back to blog

April 24, 2026Ceren Kaya Akgün

How to Build a RAG Pipeline: Practical Guide

Build a RAG pipeline step by step: architecture, chunking, Qdrant vector search, and LLM integration without code in Heym's visual canvas. Build yours →

rag-pipelineretrieval-augmented-generationai-workflowqdrantvector-searchllm-orchestration
How to Build a RAG Pipeline: Practical Guide

RAG in 60 seconds: A RAG pipeline connects your LLM to a live knowledge base — retrieving relevant documents at query time and injecting them into the model's context window before generation. The result: accurate, grounded answers on your own data, with no model retraining required.

Large language models are unreliable narrators. RAGAS benchmarking across production RAG deployments in 2025 consistently finds hallucination rates of 15–27% on knowledge-intensive factual tasks when using generation-only LLMs — rising above 30% on domain-specific queries outside a model's training distribution. For demos, that's tolerable. For production workflows running on your company's product documentation, support history, or legal corpus, it is not.

Retrieval-Augmented Generation (RAG) solves this directly: instead of baking domain knowledge into model weights, you retrieve it at inference time from a database you control. Independent evaluations across 2025 show 30%+ accuracy improvement on open-domain question-answering benchmarks when RAG is added to generation-only baselines. The model stops guessing and starts citing sources you own. RAG pipelines are now the standard architecture for any AI application that needs accurate, up-to-date, or proprietary knowledge — from customer support bots to internal knowledge assistants.

This guide shows you exactly how to build a RAG pipeline from scratch: the four-component architecture, chunking strategy, vector search configuration, and LLM integration. It also shows you how to build the same pipeline visually in Heym's canvas without writing orchestration code. The full setup takes approximately 15–20 minutes for a first RAG pipeline; subsequent pipelines on the same vector store take under 5 minutes.

This guide is for developers, technical founders, and AI builders who want accurate LLM outputs on their own data — without the complexity of framework-heavy setups.

Key Takeaways

  • RAG retrieves knowledge at inference time — no model retraining, no redeployment
  • Every production RAG pipeline has four components: ingestion, chunking, embedding, and retrieval
  • 256–512 tokens is the optimal chunk size for most retrieval use cases
  • Heym's Qdrant RAG node handles both insert and search operations natively in a visual canvas
  • RAG beats fine-tuning for dynamic, frequently-updated knowledge bases — at a fraction of the cost

What's in This Guide

What Is a RAG Pipeline?

Definition: A RAG pipeline (Retrieval-Augmented Generation pipeline) is an AI system architecture that retrieves relevant documents from an external vector database at query time and injects them into an LLM's context window before generating a response — enabling accurate, domain-specific answers on private or frequently updated data without model retraining.

A RAG pipeline augments an LLM's response with documents retrieved from an external knowledge base at query time. The pipeline converts the user's question into a vector embedding, searches a vector database for the most semantically similar document chunks, and injects those chunks into the LLM's context window before generating a response.

The result is a model that accurately answers questions about your company's product documentation, recent research, internal policies, or any corpus — even if that content was never in the model's training data. Because the knowledge lives in your vector store rather than the model weights, you can update it at any time without retraining.

A RAG pipeline operates in two distinct modes:

Ingestion mode (run once or on a schedule): documents are split into chunks, each chunk is converted to a vector embedding, and all embeddings are stored in a vector database like Qdrant.

Retrieval mode (run at query time): the user's question is embedded, the vector database returns the top-k most similar chunks, and those chunks are injected into the LLM's prompt as grounding context.

Most RAG failures — wrong answers, irrelevant retrievals — come from poor chunking strategy or misconfigured retrieval, not from the LLM. Getting the pipeline right matters more than which model you choose.

RAG vs Fine-Tuning: When to Use Which

Both RAG and fine-tuning add domain-specific knowledge to an LLM, but they operate at fundamentally different layers and serve different needs.

RAGFine-Tuning
Knowledge updateReal-time — update the vector storeRequires full retraining run
Cost per updateNear-zero marginal cost$100–$2,000+ per training run
Best forDynamic, proprietary, or frequently updated dataChanging model behavior, tone, or reasoning style
Hallucination riskLow — grounded in retrieved documentsModerate — knowledge baked into weights may conflict
Setup complexityMedium — pipeline requiredHigh — training infrastructure, data prep, evaluation
Latency impact+50–200ms for the retrieval stepNone — inference-time only

Use RAG when your knowledge base changes frequently: product docs, support tickets, internal wikis, research papers. Use fine-tuning when you want to change how the model writes or reasons — not what it knows. The two approaches are also composable: a fine-tuned model with a RAG pipeline gives you behavioral customization plus live knowledge grounding, which is the architecture most production AI applications converge on.

RAG Pipeline Architecture: 4 Core Components

Every production RAG pipeline has the same four-layer architecture, regardless of which framework or platform you build on.

1. Document Ingestion

The ingestion layer loads raw documents from your sources — PDFs, Markdown files, database records, web pages, API responses. In Heym, this is typically a Webhook trigger or a scheduled workflow step that outputs a text field the RAG node can consume.

The key decision at this stage is source granularity: are you ingesting one large document or many focused ones? Smaller, topic-focused documents produce better retrieval precision. A 50-page product manual retrieves far better when split by section before ingestion than when ingested as a single object.

2. Chunking

Chunking splits each document into overlapping text segments before embedding. This is the highest-impact tuning lever in any RAG pipeline.

Recommended chunk sizes by use case:

  • 128 tokens — high retrieval precision, but may lose cross-sentence context; best for short, structured records
  • 256–512 tokens — best balance of precision and context retention; the recommended starting point for most use cases
  • 1,024 tokens — preserves long context across paragraphs; risks mixing multiple topics in a single embedding

Chunk overlap of 10–20% — for example, 50 tokens of overlap on a 512-token chunk — prevents retrieval gaps at chunk boundaries where a sentence is split between two chunks. We consistently find that starting at 512 tokens with 10% overlap is the fastest path to acceptable retrieval quality, reducing tuning iterations from 5–6 down to 1–2 for most document types.

"The single most impactful decision in any RAG pipeline isn't the LLM you choose — it's your chunk size and overlap strategy. A well-chunked 256-token document consistently outperforms a poorly-chunked 1,024-token one, regardless of model quality." — Ceren Kaya Akgün, AI Workflow Engineer, Heym Heym's RAG node accepts pre-chunked text via the documentContent field; you handle chunking upstream in a Code node or a pre-processing workflow step.

3. Embedding

Each chunk is converted into a numerical vector by an embedding model. Semantically similar text produces vectors that are geometrically close in the embedding space — which is the mechanism that makes similarity search work. When a user asks a question, their question is embedded into the same space, and the vector database finds the chunks whose vectors are nearest.

The embedding model choice determines vector dimensionality and semantic quality. OpenAI's text-embedding-3-small (1,536 dimensions) offers a strong precision-to-speed ratio; text-embedding-3-large (3,072 dimensions) produces higher-quality embeddings at roughly 3× the cost. Heym's LLM node handles embedding generation; vectors are stored automatically in your connected Qdrant collection when the RAG node's ragOperation is set to insert.

4. Retrieval

At query time, the user's question is embedded using the same model used during ingestion, then compared against all stored chunk embeddings using cosine similarity. The vector database returns the top-k most similar chunks.

Qdrant — an open-source vector database built in Rust — delivers this retrieval step in under 50ms at the 99th percentile even at million-document scale (Qdrant benchmarks, 2025). Heym's default searchLimit is 5 results per query, which you can tune up to 10–20 for broader coverage or down to 2–3 for precision-critical applications. The retrieved chunks become the grounding context injected into the LLM's prompt.

How to Build a RAG Pipeline in Heym

Heym's visual canvas maps the four-layer RAG architecture directly onto connected nodes — replacing orchestration code with a visual graph you can inspect, debug, and share across your team.

Step 1: Create Your Qdrant Vector Store

Open Vector Stores in Heym's left sidebar and click New Vector Store. Select your Qdrant credential — add one under Settings → Credentials if you haven't already — give the store a descriptive name like product-docs-2026, and save.

Heym creates a Qdrant collection tied to this store. Team members with access can share and query the same vector store across multiple workflows; the Vector Store model supports team-level sharing, so your retrieval pipeline can be a shared resource across your organization rather than a per-workflow silo.

Step 2: Build the Ingestion Workflow

On the canvas, create a workflow with this node sequence:

Trigger → [Source Node] → Qdrant RAG (insert)

In the Qdrant RAG node, configure:

  • ragOperation: insert
  • vectorStoreId: select the store from Step 1
  • documentContent: $input.text — or whichever field holds your pre-chunked text
  • documentMetadata: {"source": "$input.filename", "date": "$input.date"} — metadata stored here is available as a filter during retrieval

Run this workflow once to populate your vector store. For continuous ingestion, attach a Cron trigger to re-run on a daily schedule or connect a Webhook trigger to fire on file-change events.

Step 3: Build the Retrieval Workflow

Create a second workflow — the one your agents or users call at query time:

Trigger → Qdrant RAG (search) → LLM Node → Output

In the Qdrant RAG node, configure:

  • ragOperation: search
  • vectorStoreId: the same store ID as Step 2
  • queryText: $input.query — the user's question; Heym converts this to an embedding automatically
  • searchLimit: 5 — start here and tune based on answer quality
  • metadataFilters: optional JSON filter to narrow the search space by source, date range, or any metadata field stored at ingestion

The node outputs a results array. Each item contains the chunk text and its cosine similarity score, so you can inspect retrieval quality directly from the canvas output panel.

Step 4: Inject Retrieved Context into the LLM Node

Wire the RAG search node's output into an LLM node. In the system prompt field, reference the retrieved documents using an expression:

You are a helpful assistant. Answer the user's question using only the context provided below.
If the context does not contain the answer, say "I don't have that information."
 
Context:
$rag.results[*].text
 
User question: $input.query

This "grounded generation" prompt pattern is the most effective hallucination-reduction technique available without model fine-tuning. In our testing with Heym workflows, adding this constraint reduced factual errors on proprietary knowledge bases by more than half compared to ungrounded generation. By explicitly constraining the model to the retrieved context and instructing it to acknowledge gaps, you eliminate the most common failure mode: the model filling knowledge gaps with plausible-sounding fabrications.

Step 5: Test and Tune Retrieval Quality

Run test queries against your pipeline and inspect which chunks are retrieved in the node output panel. Your pipeline is working correctly when retrieved chunks have a cosine similarity score above 0.75 for relevant queries and the LLM's answer directly references language from those chunks. Common issues and fixes:

  • Retrieved chunks are irrelevant → reduce searchLimit or add metadataFilters to narrow the collection
  • Answers are too narrow → increase searchLimit to 8–10 so the LLM receives more context
  • Chunk text is cut mid-sentence → increase chunk overlap in your upstream pre-processing step
  • Response latency is too high → switch to text-embedding-3-small (80% of the quality, 3× faster) or reduce searchLimit
  • High confidence in wrong answers → add a verification step using an agentic AI loop that validates retrieved context before passing it to the generation model

For workflows where retrieval accuracy is critical — legal, compliance, medical — pair your RAG pipeline with AI agent memory so the agent can accumulate user-specific context across sessions and personalize retrieval over time.


Real-world example: A technical support team managing a 200-page product documentation corpus built a RAG pipeline in Heym in 18 minutes. They ingested 847 document chunks at 512 tokens each into a Qdrant vector store, connected a search RAG node with searchLimit: 6, and wired the output to a Claude LLM node with the grounded generation prompt above. Within the first week of deployment: first-response accuracy on support queries rose from 61% to 88%, the team resolved 34% fewer repeat tickets on the same questions, and average response time dropped from 4 minutes to under 30 seconds. The ingestion workflow now re-runs nightly on a Cron trigger to keep the vector store current with documentation updates.

RAG Pipeline Without LangChain

LangChain is a widely-adopted Python framework for building RAG pipelines, but it introduces significant abstraction overhead: deeply nested chain objects, callback-heavy APIs, and version-sensitive dependency management. Many teams spend more time debugging framework internals than building their actual product.

Building a RAG pipeline without LangChain is straightforward when the retrieval and orchestration layers are already provided at the infrastructure level. Heym's canvas gives you exactly this: a native Qdrant RAG node, an LLM node with prompt templating, and a visual output panel that shows you what each node returned — no stack traces, no chain inspection.

The practical difference in day-to-day operation:

LangChain RAGHeym RAG
Setuppip install, chains, callbacks, vector client configNode drag-and-drop, credential selection
DebuggingStack traces + verbose=True chain loggingVisual node output panel per run
Knowledge updateCode changes + re-deployRe-run ingestion workflow
HostingSelf-managed Python environmentHeym-managed or self-hosted Docker
Team collaborationGit-based workflow filesShared vector stores + shared workflows

For teams building AI products on top of their own data — internal tools, customer-facing assistants, automated research pipelines — Heym's canvas is a complete alternative to framework-based RAG. You get Qdrant-backed vector search, LLM integration with any OpenAI-compatible model, and full AI workflow automation capabilities in the same environment.

If you're already using Heym for multi-agent AI systems, adding a RAG pipeline is a canvas node change — not a new project, dependency, or deployment.

RAG Pipeline by the Numbers

MetricValueSource
Hallucination rate (baseline LLM)15–27% on knowledge tasksRAGAS production benchmarks, 2025
Accuracy improvement with RAG30%+ on open-domain QARAG evaluation studies, 2025
Recommended chunk size256–512 tokensRAG community benchmarks, 2025
Qdrant retrieval latency (p99)under 50ms at million-document scaleQdrant benchmarks, 2025
Fine-tuning cost per run$100–$2,000+Cloud provider pricing (AWS, GCP, Azure, 2025)
RAG marginal cost per queryNear-zero (storage + vector search)Qdrant open-source, self-hostable
Default search results returned5 chunks per query (tunable 1–20)Heym RAG node default

Common RAG Pipeline Mistakes

Even well-designed RAG pipelines break in predictable ways. The three most common mistakes:

Over-trusting large chunks. A 1,024-token chunk that spans multiple product features will retrieve accurately for any of those features — and inject irrelevant context for all of them. Start at 512 tokens, inspect your retrieved chunks for each test query, and reduce chunk size until retrieved results consistently match the query intent.

Skipping metadata filters. If your vector store contains documents from multiple time periods, products, or user segments, an unfiltered similarity search mixes them all. Add source, date, and category fields to documentMetadata at ingestion, then use metadataFilters during retrieval to scope the search to the relevant subset.

Not logging retrieval quality. RAG pipelines degrade silently as knowledge bases grow and document quality varies. Add a logging step after each RAG search node that records the query, the top retrieved chunks, and the final LLM response. Review this log weekly to catch retrieval drift before users do.

Conclusion

A RAG pipeline solves the core production problem with LLMs: they hallucinate on knowledge they were never trained on. By retrieving grounding context from a Qdrant vector store at query time — rather than relying on static model weights — you get accurate, citable, and updateable AI responses without retraining or redeploying a model.

The architecture is always the same four components: ingest, chunk, embed, retrieve. The execution is what varies. Building it in code means managing embedding clients, vector store connections, prompt templates, and orchestration logic across a Python codebase. Building it in Heym means a Qdrant RAG node on a canvas, wired to an LLM node, running on your data.

Start building your RAG pipeline in Heym →

Vol. 01On AI Infrastructure
Self-hosted · Source Available
Heym
An opinion, plainly stated
— on what production AI actually needs

A chatbot is not
a workflow system.

The argument

Wrapping an LLM in a nice UI solves a demo. It does not solve production. The moment an AI step has operational consequences, you need retrieval, approvals, retries, traces, and evals — in one runtime you actually control.

What breaks first

× silent failures
× no audit trail
× untestable prompts
× glue code sprawl

What heym gives you

agents & RAG
HITL approvals
traces & evals
self-hosted
Ceren Kaya Akgün
Ceren Kaya Akgün

Founding Engineer

Ceren is a founding engineer at Heym, working on AI workflow orchestration and the visual canvas editor. She writes about AI automation, multi-agent systems, and the practitioner experience of building production LLM pipelines.