Blind Eval Trio5 nodes

FeaturedIntegration#Multi-Agent#Evaluation#Ejentum#Cross-Lab#Self-Eval#MCP

Blind Eval Trio

Name: Blind Eval Trio — Heym Workflow Template
Availability: InStock
Rating: 4.9 (66 reviews)
Author: Heym

Three cross-lab agents evaluate any decision blind: steelman defends, stress_test attacks, gap_finder finds what's missing. No synthesizer — you integrate.

Workflow at a glance

The full canvas, before you import it

Click any node to see its config.

#Multi-Agent#Evaluation#Ejentum#Cross-Lab#Self-Eval#MCP

Click a node to select it — same as the Heym editor; the panel shows its settings.

start

steelmanAgent

openai/gpt-5-nano

stresstestAgent

anthropic/claude-opus-4

gapfinderAgent

z-ai/glm-4.7

setFields

Sticky Note

Blind Eval Trio
Pre-commitment self-evaluation for agent runtimes.

3 cross-lab agents → 3 independent evaluations:
steelman (defends), stress_test (attacks), gap_finder (what's missing).
No synthesizer — calling agent integrates.

Structural integrity
• 3 model labs → cross-lab decorrelation
• Tool lockout per role (no role-switching)
• No synthesizer agent (3 raw outputs)
• Forced output structure + articulation gate

steelman
gpt-5-nano + harness_reasoning
Role: strongest case FOR the method.
Hard rule: pure advocacy. Zero smuggled critique.

stress_test
claude-opus-4 + harness_anti_deception

Role: where the method BREAKS.
Hard rule: severity-tagged failures + concrete scenarios.

gap_finder
glm-4.7 + harness_memory

Role: what's MISSING (steps + articulation depth).
Hard rule: name 3 deeper implicit assumptions.

Output: structured JSON
{ steelman, stress_test, gap_finder, usage_note }

Three voices, no synthesis. Integrate; do not aggregate.

Sticky Note

Verified across domains

Caught hidden behavior changes (refactor planning), compliance gaps (payments migration), premature containment (security IR), framing tensions (investigative cases), unvalidated assumptions (strategic decisions including meta-eval of itself).

Role discipline + cross-lab decorrelation generalize. No domain-specific tuning required.

Sticky Note

Example output

Task: Auto-merge Dependabot PRs after CI passes.

steelman defends:
"Leverages standard platform capabilities; CI gating preserves code quality; uniform across all updates."

stress_test attacks (HIGH):
"Dependency hijacking — compromised maintainer publishes backdoored version, CI passes (tests don't catch malware payload), auto-merge deploys to production."

gap_finder finds:
"Missing: scope limiting, version constraints, rollback procedure. 'CI catches breaking changes' silently assumes integration coverage AND env parity AND end-to-end scenarios."

→ One agent advocates, one attacks, one finds gaps. Calling agent integrates.

Sticky Note

Callable from any agent

heym exposes this workflow as an HTTP endpoint:
POST /api/workflows/{id}/execute/stream

SSE events stream as each node completes. Any agent runtime ;
Claude Code, Cursor, Codex, custom Python — can curl it directly. One workflow, infinite integrations.

Sticky Note

Built with heym + ejentum-mcp cognitive harness.
Repo:
github.com/ejentum/agent-teams
ejentum.com

5 nodes · Free & source-available

Blind Eval Trio

Pre-commitment self-evaluation for agent runtimes. Three cross-lab agents (OpenAI, Anthropic, Zhipu) give independent evaluations of any plan or method — without seeing each other's output.

Why this works

Models cannot reliably self-evaluate. Asking the same model to critique its own plan reproduces the original blind spots. The structural fix is cross-lab blind evaluation: three different model labs (different RLHF priors, different training distributions) playing structured adversarial roles, returning three independent perspectives that the calling agent integrates.

Architecture

chatInput
   │
   ├── steelmanAgent   (OpenAI gpt-5-nano   + harness_reasoning)
   ├── stresstestAgent (Anthropic claude-4  + harness_anti_deception)
   └── gapfinderAgent  (Zhipu GLM-4.7      + harness_memory)
   │
   ▼
setFields → { steelman, stress_test, gap_finder, usage_note }

Three agents run in parallel. Each is locked to one role and one Ejentum cognitive harness. No synthesizer agent — the three evaluations are returned raw. The integration tension between voices IS the value.

Roles

steelmanAgent — builds the strongest case FOR the submitted method. Pure advocacy, zero smuggled critique.
stresstestAgent — finds where the method BREAKS. Failure modes with severity tags, concrete breaking scenarios. Loaded with the Chaos Engineering skill.
gapfinderAgent — finds what's MISSING: steps, alternatives, and names three deeper implicit assumptions.

Setup

Get an Ejentum API key at ejentum.com and set it in each agent's MCP env field
Add your OpenAI and Anthropic credentials to the agent nodes
Submit any plan, method, or decision as the input text

Output

{
  "steelman": "...",
  "stress_test": "...",
  "gap_finder": "...",
  "usage_note": "Three independent evaluations, no synthesis. Integrate into your decision; do not score-and-aggregate."
}

Built by Ejentum · agent-teams repo

How to import this template

1Click Import → Copy JSON on this page.
2Open your Heym and navigate to a workflow canvas.
3PressCmd+V/Ctrl+V— nodes appear instantly.
4Add your API keys in the node config panels and click Run.

More workflow templates

View all templates

Heym

incident analysis · production AI

Observed across 100s of AI rollouts

symptom · glue code01

5 tools

Scripts, vector DB, approval bot, tracing, browser runner — none of them talk.

symptom · visibility02

~0%

Observable behavior across the stack. Debugging is guesswork.

with heym · one runtime▲

1 canvas

Agents, RAG, HITL, MCP, traces & evals. Self-hosted. Observable.

AI-Native RuntimeProduction-Grade

github.com/heymrun/heym

Blind Eval Trio

The full canvas, before you import it

Blind Eval Trio

Why this works

Architecture

Roles

Setup

Output

How to import this template

Discover more automations

AI workflows don't fail because of prompts.They fail because of orchestration.

AI workflows don't fail because of prompts.
They fail because of orchestration.