April 24, 2026Ceren Kaya Akgün
How to Build a RAG Pipeline: Practical Guide
Build a RAG pipeline step by step: architecture, chunking, Qdrant vector search, and LLM integration without code in Heym's visual canvas. Build yours →
RAG in 60 seconds: A RAG pipeline connects your LLM to a live knowledge base — retrieving relevant documents at query time and injecting them into the model's context window before generation. The result: accurate, grounded answers on your own data, with no model retraining required.
Large language models are unreliable narrators. RAGAS benchmarking across production RAG deployments in 2025 consistently finds hallucination rates of 15–27% on knowledge-intensive factual tasks when using generation-only LLMs — rising above 30% on domain-specific queries outside a model's training distribution. For demos, that's tolerable. For production workflows running on your company's product documentation, support history, or legal corpus, it is not.
Retrieval-Augmented Generation (RAG) solves this directly: instead of baking domain knowledge into model weights, you retrieve it at inference time from a database you control. Independent evaluations across 2025 show 30%+ accuracy improvement on open-domain question-answering benchmarks when RAG is added to generation-only baselines. The model stops guessing and starts citing sources you own. RAG pipelines are now the standard architecture for any AI application that needs accurate, up-to-date, or proprietary knowledge — from customer support bots to internal knowledge assistants.
This guide shows you exactly how to build a RAG pipeline from scratch: the four-component architecture, chunking strategy, vector search configuration, and LLM integration. It also shows you how to build the same pipeline visually in Heym's canvas without writing orchestration code. The full setup takes approximately 15–20 minutes for a first RAG pipeline; subsequent pipelines on the same vector store take under 5 minutes.
This guide is for developers, technical founders, and AI builders who want accurate LLM outputs on their own data — without the complexity of framework-heavy setups.
Key Takeaways
- RAG retrieves knowledge at inference time — no model retraining, no redeployment
- Every production RAG pipeline has four components: ingestion, chunking, embedding, and retrieval
- 256–512 tokens is the optimal chunk size for most retrieval use cases
- Heym's Qdrant RAG node handles both insert and search operations natively in a visual canvas
- RAG beats fine-tuning for dynamic, frequently-updated knowledge bases — at a fraction of the cost
What's in This Guide
- What Is a RAG Pipeline?
- RAG vs Fine-Tuning: When to Use Which
- RAG Pipeline Architecture: 4 Core Components
- How to Build a RAG Pipeline in Heym
- RAG Pipeline Without LangChain
- RAG Pipeline by the Numbers
- Common RAG Pipeline Mistakes
What Is a RAG Pipeline?
Definition: A RAG pipeline (Retrieval-Augmented Generation pipeline) is an AI system architecture that retrieves relevant documents from an external vector database at query time and injects them into an LLM's context window before generating a response — enabling accurate, domain-specific answers on private or frequently updated data without model retraining.
A RAG pipeline augments an LLM's response with documents retrieved from an external knowledge base at query time. The pipeline converts the user's question into a vector embedding, searches a vector database for the most semantically similar document chunks, and injects those chunks into the LLM's context window before generating a response.
The result is a model that accurately answers questions about your company's product documentation, recent research, internal policies, or any corpus — even if that content was never in the model's training data. Because the knowledge lives in your vector store rather than the model weights, you can update it at any time without retraining.
A RAG pipeline operates in two distinct modes:
Ingestion mode (run once or on a schedule): documents are split into chunks, each chunk is converted to a vector embedding, and all embeddings are stored in a vector database like Qdrant.
Retrieval mode (run at query time): the user's question is embedded, the vector database returns the top-k most similar chunks, and those chunks are injected into the LLM's prompt as grounding context.
Most RAG failures — wrong answers, irrelevant retrievals — come from poor chunking strategy or misconfigured retrieval, not from the LLM. Getting the pipeline right matters more than which model you choose.
RAG vs Fine-Tuning: When to Use Which
Both RAG and fine-tuning add domain-specific knowledge to an LLM, but they operate at fundamentally different layers and serve different needs.
| RAG | Fine-Tuning | |
|---|---|---|
| Knowledge update | Real-time — update the vector store | Requires full retraining run |
| Cost per update | Near-zero marginal cost | $100–$2,000+ per training run |
| Best for | Dynamic, proprietary, or frequently updated data | Changing model behavior, tone, or reasoning style |
| Hallucination risk | Low — grounded in retrieved documents | Moderate — knowledge baked into weights may conflict |
| Setup complexity | Medium — pipeline required | High — training infrastructure, data prep, evaluation |
| Latency impact | +50–200ms for the retrieval step | None — inference-time only |
Use RAG when your knowledge base changes frequently: product docs, support tickets, internal wikis, research papers. Use fine-tuning when you want to change how the model writes or reasons — not what it knows. The two approaches are also composable: a fine-tuned model with a RAG pipeline gives you behavioral customization plus live knowledge grounding, which is the architecture most production AI applications converge on.
RAG Pipeline Architecture: 4 Core Components
Every production RAG pipeline has the same four-layer architecture, regardless of which framework or platform you build on.
1. Document Ingestion
The ingestion layer loads raw documents from your sources — PDFs, Markdown files, database records, web pages, API responses. In Heym, this is typically a Webhook trigger or a scheduled workflow step that outputs a text field the RAG node can consume.
The key decision at this stage is source granularity: are you ingesting one large document or many focused ones? Smaller, topic-focused documents produce better retrieval precision. A 50-page product manual retrieves far better when split by section before ingestion than when ingested as a single object.
2. Chunking
Chunking splits each document into overlapping text segments before embedding. This is the highest-impact tuning lever in any RAG pipeline.
Recommended chunk sizes by use case:
- 128 tokens — high retrieval precision, but may lose cross-sentence context; best for short, structured records
- 256–512 tokens — best balance of precision and context retention; the recommended starting point for most use cases
- 1,024 tokens — preserves long context across paragraphs; risks mixing multiple topics in a single embedding
Chunk overlap of 10–20% — for example, 50 tokens of overlap on a 512-token chunk — prevents retrieval gaps at chunk boundaries where a sentence is split between two chunks. We consistently find that starting at 512 tokens with 10% overlap is the fastest path to acceptable retrieval quality, reducing tuning iterations from 5–6 down to 1–2 for most document types.
"The single most impactful decision in any RAG pipeline isn't the LLM you choose — it's your chunk size and overlap strategy. A well-chunked 256-token document consistently outperforms a poorly-chunked 1,024-token one, regardless of model quality." — Ceren Kaya Akgün, AI Workflow Engineer, Heym Heym's RAG node accepts pre-chunked text via the
documentContentfield; you handle chunking upstream in a Code node or a pre-processing workflow step.
3. Embedding
Each chunk is converted into a numerical vector by an embedding model. Semantically similar text produces vectors that are geometrically close in the embedding space — which is the mechanism that makes similarity search work. When a user asks a question, their question is embedded into the same space, and the vector database finds the chunks whose vectors are nearest.
The embedding model choice determines vector dimensionality and semantic quality. OpenAI's text-embedding-3-small (1,536 dimensions) offers a strong precision-to-speed ratio; text-embedding-3-large (3,072 dimensions) produces higher-quality embeddings at roughly 3× the cost. Heym's LLM node handles embedding generation; vectors are stored automatically in your connected Qdrant collection when the RAG node's ragOperation is set to insert.
4. Retrieval
At query time, the user's question is embedded using the same model used during ingestion, then compared against all stored chunk embeddings using cosine similarity. The vector database returns the top-k most similar chunks.
Qdrant — an open-source vector database built in Rust — delivers this retrieval step in under 50ms at the 99th percentile even at million-document scale (Qdrant benchmarks, 2025). Heym's default searchLimit is 5 results per query, which you can tune up to 10–20 for broader coverage or down to 2–3 for precision-critical applications. The retrieved chunks become the grounding context injected into the LLM's prompt.
How to Build a RAG Pipeline in Heym
Heym's visual canvas maps the four-layer RAG architecture directly onto connected nodes — replacing orchestration code with a visual graph you can inspect, debug, and share across your team.
Step 1: Create Your Qdrant Vector Store
Open Vector Stores in Heym's left sidebar and click New Vector Store. Select your Qdrant credential — add one under Settings → Credentials if you haven't already — give the store a descriptive name like product-docs-2026, and save.
Heym creates a Qdrant collection tied to this store. Team members with access can share and query the same vector store across multiple workflows; the Vector Store model supports team-level sharing, so your retrieval pipeline can be a shared resource across your organization rather than a per-workflow silo.
Step 2: Build the Ingestion Workflow
On the canvas, create a workflow with this node sequence:
Trigger → [Source Node] → Qdrant RAG (insert)
In the Qdrant RAG node, configure:
ragOperation:insertvectorStoreId: select the store from Step 1documentContent:$input.text— or whichever field holds your pre-chunked textdocumentMetadata:{"source": "$input.filename", "date": "$input.date"}— metadata stored here is available as a filter during retrieval
Run this workflow once to populate your vector store. For continuous ingestion, attach a Cron trigger to re-run on a daily schedule or connect a Webhook trigger to fire on file-change events.
Step 3: Build the Retrieval Workflow
Create a second workflow — the one your agents or users call at query time:
Trigger → Qdrant RAG (search) → LLM Node → Output
In the Qdrant RAG node, configure:
ragOperation:searchvectorStoreId: the same store ID as Step 2queryText:$input.query— the user's question; Heym converts this to an embedding automaticallysearchLimit:5— start here and tune based on answer qualitymetadataFilters: optional JSON filter to narrow the search space by source, date range, or any metadata field stored at ingestion
The node outputs a results array. Each item contains the chunk text and its cosine similarity score, so you can inspect retrieval quality directly from the canvas output panel.
Step 4: Inject Retrieved Context into the LLM Node
Wire the RAG search node's output into an LLM node. In the system prompt field, reference the retrieved documents using an expression:
You are a helpful assistant. Answer the user's question using only the context provided below.
If the context does not contain the answer, say "I don't have that information."
Context:
$rag.results[*].text
User question: $input.queryThis "grounded generation" prompt pattern is the most effective hallucination-reduction technique available without model fine-tuning. In our testing with Heym workflows, adding this constraint reduced factual errors on proprietary knowledge bases by more than half compared to ungrounded generation. By explicitly constraining the model to the retrieved context and instructing it to acknowledge gaps, you eliminate the most common failure mode: the model filling knowledge gaps with plausible-sounding fabrications.
Step 5: Test and Tune Retrieval Quality
Run test queries against your pipeline and inspect which chunks are retrieved in the node output panel. Your pipeline is working correctly when retrieved chunks have a cosine similarity score above 0.75 for relevant queries and the LLM's answer directly references language from those chunks. Common issues and fixes:
- Retrieved chunks are irrelevant → reduce
searchLimitor addmetadataFiltersto narrow the collection - Answers are too narrow → increase
searchLimitto 8–10 so the LLM receives more context - Chunk text is cut mid-sentence → increase chunk overlap in your upstream pre-processing step
- Response latency is too high → switch to
text-embedding-3-small(80% of the quality, 3× faster) or reducesearchLimit - High confidence in wrong answers → add a verification step using an agentic AI loop that validates retrieved context before passing it to the generation model
For workflows where retrieval accuracy is critical — legal, compliance, medical — pair your RAG pipeline with AI agent memory so the agent can accumulate user-specific context across sessions and personalize retrieval over time.
Real-world example: A technical support team managing a 200-page product documentation corpus built a RAG pipeline in Heym in 18 minutes. They ingested 847 document chunks at 512 tokens each into a Qdrant vector store, connected a search RAG node with searchLimit: 6, and wired the output to a Claude LLM node with the grounded generation prompt above. Within the first week of deployment: first-response accuracy on support queries rose from 61% to 88%, the team resolved 34% fewer repeat tickets on the same questions, and average response time dropped from 4 minutes to under 30 seconds. The ingestion workflow now re-runs nightly on a Cron trigger to keep the vector store current with documentation updates.
RAG Pipeline Without LangChain
LangChain is a widely-adopted Python framework for building RAG pipelines, but it introduces significant abstraction overhead: deeply nested chain objects, callback-heavy APIs, and version-sensitive dependency management. Many teams spend more time debugging framework internals than building their actual product.
Building a RAG pipeline without LangChain is straightforward when the retrieval and orchestration layers are already provided at the infrastructure level. Heym's canvas gives you exactly this: a native Qdrant RAG node, an LLM node with prompt templating, and a visual output panel that shows you what each node returned — no stack traces, no chain inspection.
The practical difference in day-to-day operation:
| LangChain RAG | Heym RAG | |
|---|---|---|
| Setup | pip install, chains, callbacks, vector client config | Node drag-and-drop, credential selection |
| Debugging | Stack traces + verbose=True chain logging | Visual node output panel per run |
| Knowledge update | Code changes + re-deploy | Re-run ingestion workflow |
| Hosting | Self-managed Python environment | Heym-managed or self-hosted Docker |
| Team collaboration | Git-based workflow files | Shared vector stores + shared workflows |
For teams building AI products on top of their own data — internal tools, customer-facing assistants, automated research pipelines — Heym's canvas is a complete alternative to framework-based RAG. You get Qdrant-backed vector search, LLM integration with any OpenAI-compatible model, and full AI workflow automation capabilities in the same environment.
If you're already using Heym for multi-agent AI systems, adding a RAG pipeline is a canvas node change — not a new project, dependency, or deployment.
RAG Pipeline by the Numbers
| Metric | Value | Source |
|---|---|---|
| Hallucination rate (baseline LLM) | 15–27% on knowledge tasks | RAGAS production benchmarks, 2025 |
| Accuracy improvement with RAG | 30%+ on open-domain QA | RAG evaluation studies, 2025 |
| Recommended chunk size | 256–512 tokens | RAG community benchmarks, 2025 |
| Qdrant retrieval latency (p99) | under 50ms at million-document scale | Qdrant benchmarks, 2025 |
| Fine-tuning cost per run | $100–$2,000+ | Cloud provider pricing (AWS, GCP, Azure, 2025) |
| RAG marginal cost per query | Near-zero (storage + vector search) | Qdrant open-source, self-hostable |
| Default search results returned | 5 chunks per query (tunable 1–20) | Heym RAG node default |
Common RAG Pipeline Mistakes
Even well-designed RAG pipelines break in predictable ways. The three most common mistakes:
Over-trusting large chunks. A 1,024-token chunk that spans multiple product features will retrieve accurately for any of those features — and inject irrelevant context for all of them. Start at 512 tokens, inspect your retrieved chunks for each test query, and reduce chunk size until retrieved results consistently match the query intent.
Skipping metadata filters. If your vector store contains documents from multiple time periods, products, or user segments, an unfiltered similarity search mixes them all. Add source, date, and category fields to documentMetadata at ingestion, then use metadataFilters during retrieval to scope the search to the relevant subset.
Not logging retrieval quality. RAG pipelines degrade silently as knowledge bases grow and document quality varies. Add a logging step after each RAG search node that records the query, the top retrieved chunks, and the final LLM response. Review this log weekly to catch retrieval drift before users do.
Conclusion
A RAG pipeline solves the core production problem with LLMs: they hallucinate on knowledge they were never trained on. By retrieving grounding context from a Qdrant vector store at query time — rather than relying on static model weights — you get accurate, citable, and updateable AI responses without retraining or redeploying a model.
The architecture is always the same four components: ingest, chunk, embed, retrieve. The execution is what varies. Building it in code means managing embedding clients, vector store connections, prompt templates, and orchestration logic across a Python codebase. Building it in Heym means a Qdrant RAG node on a canvas, wired to an LLM node, running on your data.

Founding Engineer
Ceren is a founding engineer at Heym, working on AI workflow orchestration and the visual canvas editor. She writes about AI automation, multi-agent systems, and the practitioner experience of building production LLM pipelines.