June 19, 2026Mehmet Burak Akgün
LLM Guardrails: The 4 Layers Every AI Agent Needs
What LLM guardrails are, the four layers every AI agent needs (input, output, action, human review), and how to build each one visually in Heym. No code. →
An AI chatbot that says something wrong is an awkward message. An AI agent that does something wrong is a wrong action: an email sent, a record updated, a refund issued, a tool called with the wrong arguments. The moment you give a language model the ability to act, the cost of an unchecked output stops being reputational and starts being operational.
That gap is what LLM guardrails close. This guide explains what guardrails actually are, why agents need them far more than chatbots do, and how to build each layer as a concrete step in a workflow rather than a vague best practice. Every pattern below maps to a node or setting you can wire up in Heym without writing orchestration code.
What are LLM guardrails?
LLM guardrails are the validation and control layers that sit around a language model to keep its inputs and outputs inside safe, accurate, and on-policy boundaries. A guardrail is a separate check, independent of the model, that can block, rewrite, or escalate a response no matter what the model generated. There are four layers: input guardrails, output guardrails, action guardrails, and human-in-the-loop guardrails.
The key word is separate. Prompting a model to "be safe and never reveal private data" is not a guardrail, because the same model that you are trusting to follow the instruction is the one under attack. A real guardrail is enforced outside the model's own judgment: a classifier that inspects the text, a schema that the output must match, a permission boundary on the tools, a human who has to click approve.
Guardrails operate in four places, and a production agent usually needs all four.
Why AI agents need guardrails more than chatbots
A chatbot and an agent fail in very different ways. A chatbot produces text and stops. The worst case is a bad sentence. An agent reasons in a loop, calls tools, and feeds each step's output into the next, so the worst case is a bad sequence of real actions.
Three properties of agents make guardrails non-optional:
- They act. Tool calls touch databases, payment systems, email, and external APIs. An unvalidated output is no longer just shown to a user, it is executed.
- They chain. In a multi-agent or multi-step system, an early error propagates. A bad query rewrite or a hallucinated value flows downstream and compounds.
- They read untrusted content. Agents browse pages, read emails, and parse documents. Any of that text can carry an injected instruction, which turns "summarize this email" into an attack surface.
This is why guardrails for AI agents have to cover more than content moderation. They have to bound what the agent is allowed to do, not only what it is allowed to say. The NIST AI Risk Management Framework's Generative AI Profile (2024) makes the same point at the policy level: safe deployment depends on controls that govern actions and access, not output filtering alone.
The guardrail decision table
Before the implementation details, here is the map. Each common failure mode has a matching guardrail layer and a concrete way to build it. This is the part most guides leave out: they tell you the categories exist but not where each one goes.
| Failure mode | Guardrail layer | How to build it in Heym |
|---|---|---|
| Prompt injection from user or fetched content | Input | Built-in Guardrails (Prompt Injection category) on the LLM/Agent node |
| Leaked or solicited personal data (PII) | Input + Output | Built-in Guardrails (Personal Data category) |
| Toxic, hateful, or NSFW content | Input + Output | Built-in Guardrails (11 content categories, severity-tuned) |
| Malformed or unparseable response | Output | JSON schema output + Condition node validation |
| Answer not grounded in source data | Output | RAG context + Condition grader |
| Agent calls the wrong tool or too many steps | Action | Least-privilege tool list + capped Loop iterations |
| Irreversible or high-value action | Human | Human-in-the-loop approval gate |
| Silent guardrail drift over time | Observability | OpenTelemetry tracing + evals |
The rest of this guide walks each layer in order.
Layer 1: Input guardrails
Input guardrails inspect the request before the model ever runs, while output guardrails inspect the response after it. Input is your first and cheapest line of defense, because blocking a bad request costs nothing downstream.
In Heym, every LLM and Agent node ships with built-in Guardrails you toggle on directly in the node config. They classify the incoming text against eleven categories and block anything that violates the ones you select:
- Violence, hate speech, sexual content, NSFW and profanity, self-harm, harassment
- Illegal activity, political extremism, spam and phishing
- Personal data requests (PII)
- Prompt injection
Each category has a severity threshold (low, medium, high) so you decide how aggressive the filter is. When a message violates a selected category, the node raises a violation and stops. It is never silently retried, because retrying a blocked prompt is just asking the same dangerous question twice.
The two categories that matter most for agents are prompt injection and personal data. Turn those on for any agent that reads user input or external content, then add the content-safety categories your domain requires.
Prompt injection is ranked LLM01, the single highest risk in the OWASP Top 10 for LLM Applications (2025). If you only enable one input guardrail, make it this one.
For structured validation that goes beyond content classification, such as "the input must contain a valid order ID," add a Condition node right after your trigger. It branches the workflow on any rule you can express, rejecting malformed requests before they reach the model and burn tokens.
Layer 2: Output guardrails
A clean request can still produce a bad response. Output guardrails inspect what the model generated before it reaches the user or, more importantly, before it triggers an action.
Enforce structure with a schema. The most common production failure is not toxicity, it is a response your code cannot parse. Switch the node to structured output and define a JSON schema. The model must then return valid, typed fields, which removes an entire class of "the agent returned prose where I expected JSON" bugs. Heym validates the structured output before passing it on.
Validate the content of the structure. Schema-valid is not the same as business-valid. Wire the model output into a Condition node that checks your rules: the discount is within policy, the status is one of the allowed values, the confidence score clears a threshold. The true handle continues; the false handle routes to a fallback or a human.
Check grounding. For anything fact-based, pair the model with a RAG pipeline so it answers from retrieved context, then grade whether the answer is actually supported before you trust it. Re-running the content guardrails on the output catches a model that was talked into producing unsafe text despite a clean prompt.
Layer 3: Action guardrails
This is the layer that separates agent safety from chatbot safety, and the one most generic guides skip entirely. Action guardrails bound what the agent is permitted to do.
- Least-privilege tools. Give the Agent node only the tools it needs for the task. An agent that never has a "delete" or "send payment" tool attached cannot be tricked into using one. Tool scope is a guardrail.
- Iteration caps. An agent loop can run away, retrying and re-reasoning until it burns a fortune in tokens. Cap the maximum iterations on the Loop node so a vague task cannot spin forever. This is both a cost control and a safety control.
- Branch gating. Route risky paths through a Switch or Condition node so unsupported or out-of-policy actions are blocked structurally, not left to the model's discretion. The agentic design pattern here is to make the dangerous path unreachable rather than hoping the model avoids it.
Designing these limits up front is part of basic agent architecture, not an afterthought you bolt on after an incident.
Layer 4: Human-in-the-loop guardrails
Some decisions should never be fully automated. For irreversible, high-value, or legally sensitive actions, the right guardrail is a person.
Heym has a native human-in-the-loop step that pauses the workflow and waits for a human to approve, edit, or reject before it continues. The run holds at that point (with a time-to-live so it does not wait forever), sends a notification, and resumes only on a real decision. A refund above a threshold, an outbound email to a customer, a database write to production: route these through an approval gate and the agent proposes while a human disposes.
Human-in-the-loop is the guardrail the model can never argue its way past, which is exactly why it belongs on your highest-stakes paths. It is also the layer the competing guides most often forget.
Production tradeoffs: latency, false positives, and cost
Guardrails are not free, and over-guarding is its own failure mode. A wall of strict filters that blocks legitimate users gets switched off by frustrated teams, which is worse than no guardrail at all. Tune them.
- Latency. A guardrail that calls a classifier adds one round trip. Reserve the heaviest checks for high-risk paths, and use cheap rule-based Condition checks where a full classification is overkill.
- False positives. Run any new guardrail in shadow mode first: log what it would have blocked without actually blocking, review the would-block rate for a week or two, then turn on enforcement once you trust it. This is the single most effective way to avoid blocking real users.
- Severity tuning. Use the severity thresholds to block high-confidence violations while only logging borderline ones, instead of treating every category as an all-or-nothing switch.
- Cost. Iteration caps and input guardrails that reject bad requests early are direct cost controls. The cheapest tokens are the ones you never spend on a request you were going to block anyway.
Observability and evals: guardrails you can see and test
A guardrail you cannot observe is a guardrail you cannot trust. Two capabilities turn a static filter into a system you can actually run in production.
Tracing. Enable OpenTelemetry tracing so every guardrail decision, tool call, and model step is recorded. When a user complains that a legitimate request was blocked, or you want to know how often the injection filter fires, the trace is where you find out. Heym records guardrail checks as part of the execution trace.
Evaluation. Guardrails drift. As prompts, models, and your knowledge base change, a threshold that was right last month starts blocking too much or too little. Run an AI agent evaluation workflow as a regression test: a fixed set of inputs, including known-bad ones, scored automatically so you catch a guardrail regression before your users do.
Together these answer the question every team eventually asks: not "do we have guardrails," but "are our guardrails still working."
How to choose your guardrail approach
The Reddit threads on this topic almost all collapse into one question: which guardrail tool should I use? The honest answer is that the tool matters less than the coverage. Before comparing vendors, decide:
- Do you have all four layers? A tool that does excellent input moderation but leaves your agent with unrestricted tools and no human gate is not a complete answer.
- Can you tune severity and run shadow mode? Without these, you will either over-block and get switched off, or under-block and ship incidents.
- Is it observable and testable? If you cannot trace a decision and write a regression eval, the guardrail is a black box.
- Does it live where the logic lives? Guardrails bolted on as a separate proxy drift out of sync with the workflow. Guardrails that are part of the same canvas as the agent stay aligned with it.
Heym takes the last approach: the guardrail and the agent are the same workflow, so the check never falls out of step with the logic it protects.
"The teams that get burned are not the ones without a content filter. They are the ones who added input moderation, called it done, and left the agent with a live database tool and no approval gate. Safety is the four layers together, not the one that was easiest to turn on." — Mehmet Burak Akgün, Founding Engineer, Heym
Frequently Asked Questions
What are LLM guardrails? LLM guardrails are the validation and control layers that sit around a language model to keep its inputs and outputs inside safe, accurate, on-policy boundaries. They run in four places: on the input before the model sees it (blocking prompt injection, PII, and abuse), on the output before the user sees it (checking format, safety, and grounding), on the actions an agent is allowed to take (which tools, how many steps), and at human review gates for high-stakes decisions. A guardrail is not the model being polite. It is a separate check that can block, rewrite, or escalate a response regardless of what the model produced.
What is the difference between input and output guardrails? Input guardrails inspect the user message or upstream data before it reaches the model and block or sanitize anything dangerous: prompt injection attempts, personal data, or prohibited topics. Output guardrails inspect what the model produced before it reaches the user or a downstream tool, checking that the response is valid (correct JSON schema), safe (no toxic or off-policy content), and grounded (supported by retrieved context). You need both. Input guardrails stop bad requests; output guardrails stop bad responses, including ones a clean request can still trigger.
Do AI agents need stronger guardrails than chatbots? Yes. A chatbot only produces text, so a bad output is an embarrassing message. An agent takes actions: it calls tools, writes to databases, sends emails, and chains multiple model calls where one wrong step feeds the next. That widens the blast radius from a bad sentence to a bad transaction. Agents also run with less human supervision, so the guardrails have to enforce action scope (which tools, how many iterations) and human-in-the-loop approval, not just content filtering.
How do you prevent prompt injection in an LLM application? Prompt injection is the top LLM risk in the OWASP Top 10 for LLM Applications (2025), listed as LLM01. There is no single fix, so you layer defenses: run an input guardrail that classifies and blocks injection attempts before the model sees them, keep tool permissions least-privilege so a hijacked agent cannot do much, validate outputs before they trigger any action, and require human approval for irreversible steps. Treat all external content the agent reads (web pages, emails, documents) as untrusted input, not as instructions.
Do guardrails add latency and false positives? They can, which is why you tune them rather than enabling everything at maximum strictness. A guardrail that calls a classifier model adds one extra round trip, so reserve the heaviest checks for high-risk paths and use cheap rule-based checks elsewhere. To control false positives, run a new guardrail in shadow mode first (log what it would have blocked without blocking it), review the would-block rate, then raise enforcement once the rate is acceptable. Severity thresholds let you block high-confidence violations while only logging borderline ones.
Can you add guardrails to an LLM without writing code? Yes. A guardrail is a check placed at a point in the flow, which maps directly onto a visual canvas. In Heym you toggle the built-in Guardrails on an LLM or Agent node to filter eleven content categories including prompt injection and PII, add a Condition node to validate output, enable JSON schema output to enforce structure, and drop a human-in-the-loop step to pause for approval. Each guardrail is a node setting or a node on the canvas, not orchestration code you have to maintain.
Conclusion
LLM guardrails are not one filter you switch on. They are four checkpoints around the model: the input that gets in, the output that gets out, the actions the agent can take, and the moments a human has to decide. Content moderation alone is the layer that is easiest to add and the one that leaves you most exposed, because it does nothing about an agent with a live tool and no approval gate.
Build all four. Validate the request, constrain the agent, check the response, and escalate the high-stakes calls to a person. Then make the whole thing observable and testable so you can prove the guardrails still work as your prompts and models change. In Heym every one of those layers is a node setting or a node on the canvas, sitting in the same workflow as the agent it protects, so the guardrail never drifts out of step with the logic.

Founding Engineer
Burak is a founding engineer at Heym, focused on backend infrastructure, the execution engine, and self-hosted deployment. He builds the systems that make Heym's AI workflows run reliably in production.
Enjoyed this post? Get the next one in your inbox.
A monthly note with practical ideas for building AI workflows that hold up in production. No noise, and you can unsubscribe anytime.