June 9, 2026Mehmet Burak Akgün
OpenTelemetry for AI Agents: Trace Every Step
Trace AI agents and workflows with OpenTelemetry: spans, GenAI semantic conventions, context propagation, and native OTLP export with no instrumentation code.
TL;DR: OpenTelemetry lets you trace AI agents and workflows with a vendor-neutral standard instead of a proprietary SDK. Each run becomes a root span, each step becomes a child span, and spans carry attributes like model and token usage that follow OpenTelemetry's GenAI semantic conventions. Trace context propagates across webhooks, outbound HTTP calls, and sub-workflows so one request stays one trace. You can instrument your own agent code by hand, or run on a platform like Heym that emits these spans natively and exports them over OTLP to any backend with no instrumentation code.
Table of Contents
- What Is OpenTelemetry for AI Agents?
- Why AI Agents Need OpenTelemetry
- How an Agent Run Looks as a Trace
- What to Capture: Span Attributes for AI Workflows
- Trace Context Propagation Across Steps
- Instrument Your Own Code vs Native Platform Tracing
- How to Trace AI Workflows With OpenTelemetry in Heym
- Choosing an OTLP Backend
- Overhead, Sampling, and PII
- OpenTelemetry vs Built-In Observability
- Common Mistakes
- Key Takeaways
- FAQ
- References
This guide is for engineers who already have an observability stack for their services and now want their AI agents to show up in it. I build the execution engine at Heym, the part that runs every workflow and records what each step did, so the question of how to represent an agent run as a trace is something I work on directly.
The instinct most teams have is correct: do not invent a new telemetry format for agents. You already have one. OpenTelemetry is the open standard the rest of your stack speaks, and in 2025 the project began defining how AI agents should map onto it. The job is to make an agent run look like any other distributed trace, so a wrong tool call shows up next to a slow database query in the same view.
The hard part is not the model call. It is everything around it: the multi-step plan, the tools, the sub-workflows, and the chain of services that triggered the agent in the first place. A trace that captures only the model request misses where agents actually break. This article covers how agent runs map onto OpenTelemetry, which attributes matter, how context propagates across steps, and how to get all of it without writing instrumentation code.
What Is OpenTelemetry for AI Agents?
Definition: OpenTelemetry for AI agents is the use of the vendor-neutral OpenTelemetry standard to trace an agent at run time. Each run becomes a root span, each step inside it becomes a child span with timing, status, and attributes, and the whole trace is exported in a standard format that any compatible backend can read.
OpenTelemetry is the Cloud Native Computing Foundation project for collecting telemetry, the logs, metrics, and traces that describe what a system did. It defines both the data shape and the wire protocol, OTLP, so producers and backends agree without a custom integration per vendor.
For AI agents the valuable signal is the trace. A single agent run is a small distributed system: it touches a model, several tools, sometimes a memory store, and sometimes other agents. A trace stitches those scattered events into one ordered, timed story. OpenTelemetry is now defining GenAI semantic conventions so that story looks the same across frameworks, with shared attribute names for the model, the operation, and token usage (OpenTelemetry, 2025). Standard shapes are what stop you from being locked into one vendor's trace format.
Why AI Agents Need OpenTelemetry
Traditional monitoring assumes a deterministic system, where the same input runs the same code path, so a green dashboard means healthy. Agents break that assumption. A language model picks the path at run time, so the same input can produce a different plan, different tool calls, and a different cost on every run. Uptime and CPU graphs stay green while the actual work fails silently, a problem we cover in depth in our guide to AI agent observability.
OpenTelemetry helps for three specific reasons:
It is vendor neutral. You export once over OTLP and choose the backend later. Swapping Jaeger for Honeycomb does not mean reinstrumenting your agents.
It is built for distributed systems. Agents are distributed by nature. A trace that follows one request across every tool and sub-agent is exactly the model OpenTelemetry was designed around, which is why multi-agent systems are so hard to debug without it.
It unifies your view. Your agents do not run in isolation. They call APIs, databases, and queues that already emit OpenTelemetry spans. Putting agent runs in the same trace store means one query spans the model call and the database call that the model triggered.
Key Principle: The question for an agent is not "is the system up?" but "what did this specific run decide, and where did it go wrong?" A standardized trace is the only artifact that answers the second question across every step and every service the run touched.
How an Agent Run Looks as a Trace
A trace is a tree of spans. The root span covers the whole run, and child spans cover the steps, nested by causality and ordered by time. For an AI workflow the shape is intuitive once you see it.
| Span | Represents | Typical attributes |
|---|---|---|
| Root span | The full workflow or agent run | Workflow id, run status, node count, total duration |
| Node span | One step: a model call, tool, or condition | Node type, label, status, duration, model and token usage |
| Sub-workflow span | A nested workflow invoked by a step | Parented to the calling node span, with its own node children |
| Outbound HTTP span | A call the workflow makes to an external API | URL, method, status code, injected trace context |
In Heym the root span is named heym.workflow.execute and each step is a child span named heym.node.execute. The node span carries the node type and label, its status, and its duration, and for model and agent nodes it adds the model name and the prompt, completion, and total token counts. A failed step sets its span status to error, so it stands out in red in any trace viewer without you reading a single log line.
This nesting is what makes traces readable. Instead of a flat log stream, you get a waterfall: the run on top, every step underneath in order, parallel branches side by side, and sub-workflows as their own subtrees. The slowest or most expensive step is the widest bar, and you find it by looking, not by grepping.
What to Capture: Span Attributes for AI Workflows
Attributes are the key-value pairs that make a span useful. OpenTelemetry's GenAI semantic conventions define shared names for the AI-specific ones so traces are portable across tools, including gen_ai.system, gen_ai.request.model, and gen_ai.usage.input_tokens (OpenTelemetry GenAI semantic conventions, 2026). You do not need dozens of attributes. You need the few that answer real questions.
| Attribute | What it tells you | Why it matters |
|---|---|---|
| Model name | Which model the step called | Routing and cost attribution start here |
| Prompt and completion tokens | The volume of model work, split by direction | The raw driver of every model bill |
| Total tokens | The combined token count for the step | Maps directly to spend per model call |
| Node type and label | Which step ran and what it is | Where a multi-step task actually broke |
| Status | Whether the step succeeded or failed | The first signal of a broken integration |
| Duration | How long the step took | Finds the bottleneck in a slow run |
Notable fact: Token usage is the one observability attribute that maps straight to money, because model providers bill by the token. Recording prompt and completion tokens per span, attributed to the model, is what turns a trace into a cost breakdown. Seeing that breakdown is the first step toward AI agent cost optimization, where you right-size the model on each step and route cheap work to cheap models.
A deliberate choice sits inside this table: it captures metadata, not content. Model name and token counts are safe to record on every run. The raw prompt and the raw response are a different matter, which the overhead and PII section covers.
Trace Context Propagation Across Steps
A trace is only as good as its connections. If each step starts a fresh trace, you get a pile of disconnected spans instead of one story. The fix is context propagation, the passing of a trace identifier from one unit of work to the next, and the standard for it is W3C Trace Context, carried in a traceparent header (W3C, 2025).
For AI workflows propagation matters at three boundaries:
Inbound triggers. When a webhook calls your workflow and carries a traceparent header, the workflow run should attach to that upstream trace rather than starting a new one. Now the agent run is visible as a child of the service that triggered it, and you can follow a request from the API gateway all the way into the agent's reasoning.
Outbound calls. When a step calls an external API, it should inject traceparent into that request, so the downstream service continues the same trace. The model call and the API call it triggered live in one timeline.
Sub-workflows. When a step invokes a nested workflow, the sub-workflow's root span should parent to the calling step, preserving the call hierarchy. In Heym this happens automatically, and node spans nest correctly even when steps run in parallel worker threads, because the run's trace context is carried into each thread.
Key Principle: Propagation is what turns scattered spans into one trace. Without it you can see that a model call was slow but not that it was slow because of the API it called, which lived in a separate, disconnected trace.
Instrument Your Own Code vs Native Platform Tracing
There are two ways to get OpenTelemetry traces out of an AI agent, and the right one depends on where your agents live.
Instrument your own code. If your agents are bespoke Python or TypeScript, you add the OpenTelemetry SDK, create spans around each model and tool call, set the GenAI attributes by hand, and wire up context propagation across every boundary. This is flexible and it is the right path when your logic is custom. It is also ongoing work: every new tool, every refactor, and every framework upgrade is a chance for the instrumentation to drift from the code.
Native platform tracing. If your agents run on a workflow platform, the platform already knows the shape of every run, so it can emit the spans for you. There is no SDK to install and no per-step instrumentation to maintain. You set an endpoint and the traces appear. This is how observability should work for AI workflow automation: the system that runs the work also describes it.
The trade-off is control versus maintenance. Hand instrumentation gives you total control and total responsibility. Native tracing gives you correct spans for free, at the cost of the platform deciding the span shape. For most teams running agents as workflows, the second path is the one that actually stays accurate over time, because instrumentation that no human has to update cannot fall out of date.
How to Trace AI Workflows With OpenTelemetry in Heym
Heym takes the native path. The execution engine that runs your workflows also emits the spans, so tracing is a configuration step, not a coding project. It is disabled by default, so there is no overhead until you opt in.
- Turn tracing on. Set
HEYM_OTEL_ENABLED=trueandHEYM_OTEL_EXPORTER_OTLP_ENDPOINTto your collector, for examplehttp://collector:4318. If your backend needs auth, pass it throughHEYM_OTEL_EXPORTER_OTLP_HEADERSas akey=valuepair, read from the environment and never stored in the database. - Point it at a backend. The endpoint is any OTLP/HTTP target: a local OpenTelemetry Collector, Jaeger, Grafana Tempo, Honeycomb, or Datadog. Heym posts spans to the
/v1/tracespath of that endpoint. - Restart and run a workflow. A
heym.workflow.executeroot span is created for the run, with aheym.node.executechild span per node. Model and agent nodes carry model and token attributes, and failed steps are marked with an error status. - Confirm in Settings then Observability. Open the gear icon in the header and select the Observability tab. It shows a read-only summary: whether tracing is on, the endpoint, the service name, the sampler ratio, and which spans are emitted. Secrets are never displayed.
- Inspect the trace. Open your backend and find the run as a single trace, with node spans nested in execution order, parallel branches included, and sub-workflows as subtrees.
This pairs with Heym's built-in traces and cost view, so you can get a fast answer inside the platform and a unified, vendor-neutral trace in your own backend at the same time.
Choosing an OTLP Backend
Because Heym exports standard OTLP, the backend is an operational choice, not a compatibility one. The common options split into self-hosted and managed.
| Backend | Model | Good fit when |
|---|---|---|
| Jaeger | Open source, self-hosted | You want a simple, free trace store you run yourself |
| Grafana Tempo | Open source, self-hosted | You already run Grafana and want traces beside metrics and logs |
| Honeycomb | Managed | You want fast, high-cardinality querying without running infrastructure |
| Datadog or New Relic | Managed | Your team already lives in one of these for the rest of the stack |
| OpenTelemetry Collector | Pipeline | You want to fan out to several backends or process spans in transit |
A practical pattern is to send everything to an OpenTelemetry Collector first, then let the Collector route to one or more backends. That keeps the workflow configuration pointed at a single stable endpoint while you change backends behind it.
Overhead, Sampling, and PII
Three concerns come up every time a team turns on tracing. Each has a clean answer.
Overhead. Creating a span is a cheap in-memory operation, and a good exporter batches spans and ships them in the background, so a slow or unreachable collector never blocks a run. The dominant cost in an agent run is the network round trip to the model, measured in hundreds of milliseconds. The microseconds spent recording a span do not move that total.
Sampling. At high volume you do not need every run. Head sampling records a fixed fraction, for example one run in ten, decided at the start of the trace and applied consistently so a sampled trace is complete rather than half-recorded. Heym exposes this as a single ratio, so you keep full traces for a representative sample instead of turning tracing off to save space.
PII. Prompts and responses often carry user data and secrets, and a trace backend is not the place to store that by default. The safe default is to record metadata, the model, the tokens, the latency, and the status, on every span, and to gate raw prompt and response payloads behind an explicit opt-in that is off in production. You keep the debugging value of full traces in development without leaking sensitive content into your observability pipeline.
OpenTelemetry vs Built-In Observability
These are not competitors. They answer different questions and they work best together.
| Aspect | Built-in observability | OpenTelemetry export |
|---|---|---|
| Setup | None, it is already on | Set an endpoint and opt in |
| Best for | Fast answers on cost, latency, and errors | A unified trace store across all services |
| Data location | Inside the platform | A backend you operate |
| Vendor lock-in | Tied to the platform's view | Open standard, swap backends freely |
| Strongest at | Token cost attribution per model | Cross-service distributed tracing |
Built-in observability is the quickest way to see what an agent cost and where it failed, with nothing to configure. OpenTelemetry export is how you put that run in the same backend as the rest of your infrastructure. The combination, a built-in view for speed and OTLP for reach, is what we recommend, and it is why Heym ships both rather than forcing a choice.
Common Mistakes
- Tracing only the model call. The model is rarely where agents break. Trace the whole run, including tools, conditions, and sub-workflows, or you will miss the actual failure.
- Letting each step start its own trace. Without context propagation you get disconnected spans. Make sure inbound, outbound, and sub-workflow boundaries carry the trace id.
- Logging full prompts in production. Convenient in development, a data-leak risk in production. Default to metadata and gate raw content behind an opt-in.
- Picking a backend before exporting. With OTLP you do not have to. Export first, choose the backend on operational fit, and keep the option to change it.
- Treating tracing as a replacement for evaluation. Tracing tells you what happened, not whether the output was correct. Pair it with AI agent evaluation to catch silent quality regressions.
Key Takeaways
- OpenTelemetry is the vendor-neutral way to trace AI agents, so you reuse the standard the rest of your stack already speaks.
- An agent run maps to a trace: a root span for the run and child spans for each step, with model and token attributes following the GenAI semantic conventions.
- Context propagation across webhooks, outbound HTTP, and sub-workflows is what keeps one request as one trace.
- You can hand-instrument your own code or run on a platform that emits the spans natively. Native tracing stays accurate because no human has to maintain it.
- Overhead is negligible, sampling handles volume, and metadata-by-default handles PII.
- Built-in observability and OTLP export are complementary, and the strongest setup uses both.
FAQ
What is OpenTelemetry for AI agents?
OpenTelemetry for AI agents is the use of the vendor-neutral OpenTelemetry standard to trace what an agent does at run time. Each agent or workflow run becomes a root span, and every step inside it, such as a model call, a tool call, or a sub-workflow, becomes a child span with timing, status, and attributes like the model name and token usage. Because the format is standardized through OpenTelemetry's GenAI semantic conventions, you can send those traces to any compatible backend such as Jaeger, Grafana Tempo, Honeycomb, or Datadog instead of being locked into one vendor's SDK.
What is the overhead of OpenTelemetry tracing for AI agents?
The overhead is small and usually irrelevant next to the cost of the model calls themselves. Span creation is a cheap in-memory operation, and a well-built exporter batches spans and ships them in the background so a slow or unreachable collector never blocks a run. The dominant cost in an agent run is the network round trip to the model, which is measured in hundreds of milliseconds, so the microseconds spent recording a span do not move the total. If you are worried about volume at scale, use head sampling to record a fraction of runs rather than turning tracing off.
Which observability backend works best for AI agents?
Any backend that speaks OTLP works, which is the point of using OpenTelemetry. Jaeger and Grafana Tempo are common open-source, self-hosted choices, while Honeycomb, Datadog, New Relic, and Grafana Cloud are managed options. Pick on operational fit, not on whether it can read agent traces, because a standard OTLP exporter decouples the data from the tool. Self-hosted teams that want to keep trace data on their own infrastructure usually run Jaeger or Tempo.
Should you log full LLM prompts and responses in traces?
Be careful. Prompts and responses often contain user data, secrets, or PII, and a trace backend is not the place to store that by default. Capture metadata that is safe and useful, such as model name, token counts, latency, and status, on every span, and gate raw prompt and response payloads behind an explicit opt-in that is off by default. That way you get the debugging value of full traces in development without leaking sensitive content into your observability pipeline in production.
What is the difference between OpenTelemetry and built-in AI agent observability?
OpenTelemetry is an open standard for exporting traces to an external system you operate. Built-in observability is a view inside the platform that ran the agent, such as a traces and cost dashboard. They are complementary. Built-in observability is the fastest way to see token cost, latency, and errors without setting anything up, while OpenTelemetry export lets you stitch agent traces into the same backend as the rest of your services. A good platform offers both: a built-in view for quick answers and OTLP export for a unified, vendor-neutral trace store.
References
- OpenTelemetry, 2025. AI Agent Observability: Evolving Standards and Best Practices
- OpenTelemetry, 2026. GenAI semantic conventions
- W3C, 2025. Trace Context recommendation
- Heym, 2026. OpenTelemetry Tracing documentation

Founding Engineer
Burak is a founding engineer at Heym, focused on backend infrastructure, the execution engine, and self-hosted deployment. He builds the systems that make Heym's AI workflows run reliably in production.
Enjoyed this post? Get the next one in your inbox.
A monthly note with practical ideas for building AI workflows that hold up in production. No noise, and you can unsubscribe anytime.