What is the difference between RAG and fine-tuning?

RAG retrieves external knowledge at inference time and requires no model retraining — making it ideal for frequently updated or proprietary data. Fine-tuning bakes new knowledge into model weights through an additional training run, which costs $100–$2,000+ and requires retraining every time the data changes. Use RAG for dynamic knowledge bases; use fine-tuning when you need to change the model's behavior, tone, or reasoning style.

Can I build a RAG pipeline without LangChain?

Yes. LangChain is one approach but adds significant abstraction overhead. Heym's visual canvas provides a native Qdrant RAG node — you set the operation to 'search', configure queryText and searchLimit, and wire the output directly into an LLM node. No Python orchestration code required.

What vector database should I use for RAG?

Qdrant is the recommended vector store for RAG pipelines built in Heym — it delivers sub-50ms similarity search at the 99th percentile and supports both cloud-hosted and self-hosted deployments. Heym's RAG node connects to Qdrant natively through the credentials system, with no additional client configuration needed.

What chunk size should I use for a RAG pipeline?

256–512 tokens is the recommended chunk size for most RAG use cases. Smaller chunks (128 tokens) improve retrieval precision but lose surrounding context; larger chunks (1,024+ tokens) preserve context but reduce retrieval accuracy by mixing multiple topics in one embedding. Start at 512 tokens and tune down if retrieved results are too noisy.

How to Build a RAG Pipeline: Practical Guide

Q: What is a RAG pipeline ?

A RAG (Retrieval-Augmented Generation) pipeline is a system that connects a large language model to an external knowledge base. Instead of relying on the LLM's static training data, the pipeline retrieves the most relevant documents from a vector store at query time and injects them into the model's context window — dramatically reducing hallucinations on domain-specific tasks.

RAG in 60 seconds: A RAG pipeline connects your LLM to a live knowledge base — retrieving relevant documents at query time and injecting them into the model's context window before generation. The result: accurate, grounded answers on your own data, with no model retraining required.

Large language models are unreliable narrators. RAGAS benchmarking across production RAG deployments in 2025 consistently finds hallucination rates of 15–27% on knowledge-intensive factual tasks when using generation-only LLMs — rising above 30% on domain-specific queries outside a model's training distribution. For demos, that's tolerable. For production workflows running on your company's product documentation, support history, or legal corpus, it is not.

Retrieval-Augmented Generation (RAG) solves this directly: instead of baking domain knowledge into model weights, you retrieve it at inference time from a database you control. Independent evaluations across 2025 show 30%+ accuracy improvement on open-domain question-answering benchmarks when RAG is added to generation-only baselines. The model stops guessing and starts citing sources you own. RAG pipelines are now the standard architecture for any AI application that needs accurate, up-to-date, or proprietary knowledge — from customer support bots to internal knowledge assistants.

This guide shows you exactly how to build a RAG pipeline from scratch: the four-component architecture, chunking strategy, vector search configuration, and LLM integration. It also shows you how to build the same pipeline visually in Heym's canvas without writing orchestration code. The full setup takes approximately 15–20 minutes for a first RAG pipeline; subsequent pipelines on the same vector store take under 5 minutes.

This guide is for developers, technical founders, and AI builders who want accurate LLM outputs on their own data — without the complexity of framework-heavy setups.

Key Takeaways

RAG retrieves knowledge at inference time — no model retraining, no redeployment
Every production RAG pipeline has four components: ingestion, chunking, embedding, and retrieval
256–512 tokens is the optimal chunk size for most retrieval use cases
Heym's Qdrant RAG node handles both insert and search operations natively in a visual canvas
RAG beats fine-tuning for dynamic, frequently-updated knowledge bases — at a fraction of the cost

What's in This Guide

What Is a RAG Pipeline?
RAG vs Fine-Tuning: When to Use Which
RAG Pipeline Architecture: 4 Core Components
How to Build a RAG Pipeline in Heym
RAG Pipeline Without LangChain
RAG Pipeline by the Numbers
Common RAG Pipeline Mistakes

What Is a RAG Pipeline?

Definition: A RAG pipeline (Retrieval-Augmented Generation pipeline) is an AI system architecture that retrieves relevant documents from an external vector database at query time and injects them into an LLM's context window before generating a response — enabling accurate, domain-specific answers on private or frequently updated data without model retraining.

A RAG pipeline augments an LLM's response with documents retrieved from an external knowledge base at query time. The pipeline converts the user's question into a vector embedding, searches a vector database for the most semantically similar document chunks, and injects those chunks into the LLM's context window before generating a response. Retrieval like this is the select strategy within the wider practice of context engineering.

The result is a model that accurately answers questions about your company's product documentation, recent research, internal policies, or any corpus — even if that content was never in the model's training data. Because the knowledge lives in your vector store rather than the model weights, you can update it at any time without retraining.

A RAG pipeline operates in two distinct modes:

Ingestion mode (run once or on a schedule): documents are split into chunks, each chunk is converted to a vector embedding, and all embeddings are stored in a vector database like Qdrant.

Retrieval mode (run at query time): the user's question is embedded, the vector database returns the top-k most similar chunks, and those chunks are injected into the LLM's prompt as grounding context.

Most RAG failures — wrong answers, irrelevant retrievals — come from poor chunking strategy or misconfigured retrieval, not from the LLM. Getting the pipeline right matters more than which model you choose.

RAG vs Fine-Tuning: When to Use Which

Both RAG and fine-tuning add domain-specific knowledge to an LLM, but they operate at fundamentally different layers and serve different needs.

	RAG	Fine-Tuning
Knowledge update	Real-time — update the vector store	Requires full retraining run
Cost per update	Near-zero marginal cost	$100–$2,000+ per training run
Best for	Dynamic, proprietary, or frequently updated data	Changing model behavior, tone, or reasoning style
Hallucination risk	Low — grounded in retrieved documents	Moderate — knowledge baked into weights may conflict
Setup complexity	Medium — pipeline required	High — training infrastructure, data prep, evaluation
Latency impact	+50–200ms for the retrieval step	None — inference-time only

Use RAG when your knowledge base changes frequently: product docs, support tickets, internal wikis, research papers. Use fine-tuning when you want to change how the model writes or reasons — not what it knows. The two approaches are also composable: a fine-tuned model with a RAG pipeline gives you behavioral customization plus live knowledge grounding, which is the architecture most production AI applications converge on. For a full decision tree, cost breakdown, and the hybrid RAFT pattern, see our dedicated guide on RAG vs fine-tuning.

RAG Pipeline Architecture: 4 Core Components

Every production RAG pipeline has the same four-layer architecture, regardless of which framework or platform you build on.

1. Document Ingestion

The ingestion layer loads raw documents from your sources — PDFs, Markdown files, database records, web pages, API responses. In Heym, this is typically a Webhook trigger or a scheduled workflow step that outputs a text field the RAG node can consume.

The key decision at this stage is source granularity: are you ingesting one large document or many focused ones? Smaller, topic-focused documents produce better retrieval precision. A 50-page product manual retrieves far better when split by section before ingestion than when ingested as a single object.

2. Chunking

Chunking splits each document into overlapping text segments before embedding. This is the highest-impact tuning lever in any RAG pipeline.

Recommended chunk sizes by use case:

128 tokens — high retrieval precision, but may lose cross-sentence context; best for short, structured records
256–512 tokens — best balance of precision and context retention; the recommended starting point for most use cases
1,024 tokens — preserves long context across paragraphs; risks mixing multiple topics in a single embedding

Chunk overlap of 10–20% — for example, 50 tokens of overlap on a 512-token chunk — prevents retrieval gaps at chunk boundaries where a sentence is split between two chunks. We consistently find that starting at 512 tokens with 10% overlap is the fastest path to acceptable retrieval quality, reducing tuning iterations from 5–6 down to 1–2 for most document types.

"The single most impactful decision in any RAG pipeline isn't the LLM you choose — it's your chunk size and overlap strategy. A well-chunked 256-token document consistently outperforms a poorly-chunked 1,024-token one, regardless of model quality." — Ceren Kaya Akgün, AI Workflow Engineer, Heym Heym's RAG node accepts pre-chunked text via the documentContent field; you handle chunking upstream in a Code node or a pre-processing workflow step.

3. Embedding

Each chunk is converted into a numerical vector by an embedding model. Semantically similar text produces vectors that are geometrically close in the embedding space — which is the mechanism that makes similarity search work. When a user asks a question, their question is embedded into the same space, and the vector database finds the chunks whose vectors are nearest.

The embedding model choice determines vector dimensionality and semantic quality. OpenAI's text-embedding-3-small (1,536 dimensions) offers a strong precision-to-speed ratio; text-embedding-3-large (3,072 dimensions) produces higher-quality embeddings at roughly 3× the cost. Heym's LLM node handles embedding generation; vectors are stored automatically in your connected Qdrant collection when the RAG node's ragOperation is set to insert.

4. Retrieval

At query time, the user's question is embedded using the same model used during ingestion, then compared against all stored chunk embeddings using cosine similarity. The vector database returns the top-k most similar chunks.

Qdrant — an open-source vector database built in Rust — delivers this retrieval step in under 50ms at the 99th percentile even at million-document scale (Qdrant benchmarks, 2025). Heym's default searchLimit is 5 results per query, which you can tune up to 10–20 for broader coverage or down to 2–3 for precision-critical applications. The retrieved chunks become the grounding context injected into the LLM's prompt.

How to Build a RAG Pipeline in Heym

Heym's visual canvas maps the four-layer RAG architecture directly onto connected nodes — replacing orchestration code with a visual graph you can inspect, debug, and share across your team.

Step 1: Create Your Qdrant Vector Store

Open Vector Stores in Heym's left sidebar and click New Vector Store. Select your Qdrant credential — add one under Settings → Credentials if you haven't already — give the store a descriptive name like product-docs-2026, and save.

Heym creates a Qdrant collection tied to this store. Team members with access can share and query the same vector store across multiple workflows; the Vector Store model supports team-level sharing, so your retrieval pipeline can be a shared resource across your organization rather than a per-workflow silo.

Step 2: Build the Ingestion Workflow

On the canvas, create a workflow with this node sequence:

Trigger → [Source Node] → Qdrant RAG (insert)

In the Qdrant RAG node, configure:

ragOperation: insert
vectorStoreId: select the store from Step 1
documentContent: $input.text — or whichever field holds your pre-chunked text
documentMetadata: {"source": "$input.filename", "date": "$input.date"} — metadata stored here is available as a filter during retrieval

Run this workflow once to populate your vector store. For continuous ingestion, attach a Cron trigger to re-run on a daily schedule or connect a Webhook trigger to fire on file-change events.

Step 3: Build the Retrieval Workflow

Create a second workflow — the one your agents or users call at query time:

Trigger → Qdrant RAG (search) → LLM Node → Output

In the Qdrant RAG node, configure:

ragOperation: search
vectorStoreId: the same store ID as Step 2
queryText: $input.query — the user's question; Heym converts this to an embedding automatically
searchLimit: 5 — start here and tune based on answer quality
metadataFilters: optional JSON filter to narrow the search space by source, date range, or any metadata field stored at ingestion

The node outputs a results array. Each item contains the chunk text and its cosine similarity score, so you can inspect retrieval quality directly from the canvas output panel.

Step 4: Inject Retrieved Context into the LLM Node

Wire the RAG search node's output into an LLM node. In the system prompt field, reference the retrieved documents using an expression:

You are a helpful assistant. Answer the user's question using only the context provided below.
If the context does not contain the answer, say "I don't have that information."
 
Context:
$rag.results[*].text
 
User question: $input.query

This "grounded generation" prompt pattern is the most effective hallucination-reduction technique available without model fine-tuning. In our testing with Heym workflows, adding this constraint reduced factual errors on proprietary knowledge bases by more than half compared to ungrounded generation. By explicitly constraining the model to the retrieved context and instructing it to acknowledge gaps, you eliminate the most common failure mode: the model filling knowledge gaps with plausible-sounding fabrications.

Step 5: Test and Tune Retrieval Quality

Run test queries against your pipeline and inspect which chunks are retrieved in the node output panel. Your pipeline is working correctly when retrieved chunks have a cosine similarity score above 0.75 for relevant queries and the LLM's answer directly references language from those chunks. Common issues and fixes:

Retrieved chunks are irrelevant → reduce searchLimit or add metadataFilters to narrow the collection
Answers are too narrow → increase searchLimit to 8–10 so the LLM receives more context
Chunk text is cut mid-sentence → increase chunk overlap in your upstream pre-processing step
Response latency is too high → switch to text-embedding-3-small (80% of the quality, 3× faster) or reduce searchLimit
High confidence in wrong answers → add a verification step using an agentic AI loop that validates retrieved context before passing it to the generation model

For workflows where retrieval accuracy is critical — legal, compliance, medical — pair your RAG pipeline with AI agent memory so the agent can accumulate user-specific context across sessions and personalize retrieval over time.

Real-world example: A technical support team managing a 200-page product documentation corpus built a RAG pipeline in Heym in 18 minutes. They ingested 847 document chunks at 512 tokens each into a Qdrant vector store, connected a search RAG node with searchLimit: 6, and wired the output to a Claude LLM node with the grounded generation prompt above. Within the first week of deployment: first-response accuracy on support queries rose from 61% to 88%, the team resolved 34% fewer repeat tickets on the same questions, and average response time dropped from 4 minutes to under 30 seconds. The ingestion workflow now re-runs nightly on a Cron trigger to keep the vector store current with documentation updates.

RAG Pipeline Without LangChain

LangChain is a widely-adopted Python framework for building RAG pipelines, but it introduces significant abstraction overhead: deeply nested chain objects, callback-heavy APIs, and version-sensitive dependency management. Many teams spend more time debugging framework internals than building their actual product.

Building a RAG pipeline without LangChain is straightforward when the retrieval and orchestration layers are already provided at the infrastructure level. Heym's canvas gives you exactly this: a native Qdrant RAG node, an LLM node with prompt templating, and a visual output panel that shows you what each node returned — no stack traces, no chain inspection.

The practical difference in day-to-day operation:

	LangChain RAG	Heym RAG
Setup	`pip install`, chains, callbacks, vector client config	Node drag-and-drop, credential selection
Debugging	Stack traces + `verbose=True` chain logging	Visual node output panel per run
Knowledge update	Code changes + re-deploy	Re-run ingestion workflow
Hosting	Self-managed Python environment	Heym-managed or self-hosted Docker
Team collaboration	Git-based workflow files	Shared vector stores + shared workflows

For teams building AI products on top of their own data — internal tools, customer-facing assistants, automated research pipelines — Heym's canvas is a complete alternative to framework-based RAG. You get Qdrant-backed vector search, LLM integration with any OpenAI-compatible model, and full AI workflow automation capabilities in the same environment.

If you're already using Heym for multi-agent AI systems, adding a RAG pipeline is a canvas node change — not a new project, dependency, or deployment.

RAG Pipeline by the Numbers

Metric	Value	Source
Hallucination rate (baseline LLM)	15–27% on knowledge tasks	RAGAS production benchmarks, 2025
Accuracy improvement with RAG	30%+ on open-domain QA	RAG evaluation studies, 2025
Recommended chunk size	256–512 tokens	RAG community benchmarks, 2025
Qdrant retrieval latency (p99)	under 50ms at million-document scale	Qdrant benchmarks, 2025
Fine-tuning cost per run	$100–$2,000+	Cloud provider pricing (AWS, GCP, Azure, 2025)
RAG marginal cost per query	Near-zero (storage + vector search)	Qdrant open-source, self-hostable
Default search results returned	5 chunks per query (tunable 1–20)	Heym RAG node default

Common RAG Pipeline Mistakes

Even well-designed RAG pipelines break in predictable ways. The three most common mistakes:

Over-trusting large chunks. A 1,024-token chunk that spans multiple product features will retrieve accurately for any of those features — and inject irrelevant context for all of them. Start at 512 tokens, inspect your retrieved chunks for each test query, and reduce chunk size until retrieved results consistently match the query intent.

Skipping metadata filters. If your vector store contains documents from multiple time periods, products, or user segments, an unfiltered similarity search mixes them all. Add source, date, and category fields to documentMetadata at ingestion, then use metadataFilters during retrieval to scope the search to the relevant subset.

Not logging retrieval quality. RAG pipelines degrade silently as knowledge bases grow and document quality varies. Add a logging step after each RAG search node that records the query, the top retrieved chunks, and the final LLM response. Review this log weekly to catch retrieval drift before users do.

Conclusion

A RAG pipeline solves the core production problem with LLMs: they hallucinate on knowledge they were never trained on. By retrieving grounding context from a Qdrant vector store at query time — rather than relying on static model weights — you get accurate, citable, and updateable AI responses without retraining or redeploying a model.

The architecture is always the same four components: ingest, chunk, embed, retrieve. The execution is what varies. Building it in code means managing embedding clients, vector store connections, prompt templates, and orchestration logic across a Python codebase. Building it in Heym means a Qdrant RAG node on a canvas, wired to an LLM node, running on your data.

Start building your RAG pipeline in Heym →