#Policy graphs (DAG)
When a workflow's steps have a dependency structure — some can run
in parallel, some depend on multiple parents — chains are too
linear. Graphs let you declare each step's depends_on
explicitly. The engine topologically sorts the graph, runs each
level in parallel via asyncio.gather, and feeds every step the
outcomes of its direct dependencies.
#Chain vs Graph — when to use which
Chain (/chain) | Graph (/graph) | |
|---|---|---|
| Shape | Linear array, each step sees previous + chain[] | Node list with explicit depends_on edges |
| Execution | Strictly sequential | Levels parallel; cap 8 LLM calls concurrent |
| Latency | Σ(step latencies) | Σ(level latencies); max-of-level for parallel steps |
| Use when | Each step needs the previous step's outcome | Some steps are independent (e.g. classify + score both feed decide) |
The execution model derives from which endpoint you pick — there's no engine toggle. A workflow with no fanout is a chain. A workflow with fanout is a graph.
#Request body
The body specifies the graph in one of two ways — inline
(graph field, the original shape) or by slug reference
(graph_slug field, references a stored graph authored in the
admin). Exactly one is required; passing both is a 422.
Integrations should prefer graph_slug so a graph definition
change in admin doesn't require a code deploy on the caller side.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| graph | array<GraphStep> | No | — | The nodes of the DAG. 1–20 nodes. Each node carries an operator-chosen `id` (must be unique within the graph), the policy slug, the requested mode, and a `depends_on` list of node ids it consumes. Mutually exclusive with `graph_slug`. |
| graph_slug | string | No | — | Reference to a stored, published graph (authored in the admin under /policies/graphs). The engine loads the current published version and evaluates that. Mutually exclusive with `graph`. Returns `404` if the slug doesn't exist or has no published version. |
| inputs | object | No | {} | Base inputs available to every step. Each step also receives its parents' outcomes injected under their step ids — see Dependency context. |
| context | object | No | {} | Side-channel metadata. Same semantic as on /evaluate and /chain. |
| halt_on | "error" | "never" | No | "error" | "error" halts after the failed step's level completes — siblings at that level run to completion (already firing in parallel) but downstream levels get status: "skipped". "never" runs every level regardless; downstream steps still receive their successful parents' outcomes. |
GraphStep object:
{
"id": "decide_route",
"policy": "itsm/incident-route",
"mode": "decide",
"version": null,
"depends_on": ["classify", "score"]
}
id is the operator-chosen handle for this node within the graph
(lowercase, kebab/snake-case). policy is the slug (the same policy
can appear twice in a graph with different ids). depends_on lists
the parent node ids — empty means a root node that runs at level 0.
#Dependency context
When a step runs, its inputs bag is:
inputs = {
...original_inputs,
"classify": { ...classify_step_outcome },
"score": { ...score_step_outcome }
}
Each direct parent's outcome is injected under the parent's
step id (not policy slug). Policies reference these in their
English text:
"Given
classify.primary_labelandscore.score, pick the appropriate routing action..."
Collision rule: if a parent's id matches an existing input
key, the parent's outcome wins (overwrites). Pick step ids that
don't collide with your base inputs. Failed/skipped parents are
NOT injected — the step's inputs simply omit them.
#Response
{
"request_id": "uuid",
"steps": [
{ "id": "classify", "policy": "...", "mode": "classify",
"status": "ok", "outcome": {...}, "latency_ms": 1500 },
{ "id": "score", "policy": "...", "mode": "score",
"status": "ok", "outcome": {...}, "latency_ms": 1500 },
{ "id": "decide", "policy": "...", "mode": "decide",
"status": "ok", "outcome": {...}, "latency_ms": 1200 }
],
"leaves": {
"decide": { "action_id": "page_oncall", "payload": {...}, "reason": "..." }
},
"halted_at_level": null,
"levels_executed": 2,
"latency_ms_total": 2700,
"aggregate": { ... } // see Score aggregation
}
steps is in declaration order (matches the request's graph
array) for easy lookup. leaves maps leaf-node id → outcome for
graphs with obvious sink nodes (a final "draft response" step,
for example). halted_at_level is 0-indexed; null when the graph
completed cleanly.
Notice the latency: classify and score ran in parallel
(level 0, ~1.5s), then decide (level 1, ~1.2s). Total = 2.7s
vs ~4.2s if it had been a chain.
The aggregate block is documented separately in
Score aggregation.
#Example: ITSM incident triage as a graph
The same workflow as the
chain example,
but with classify and score running in parallel:
curl -X POST https://aiengine.velgent.com/api/v1/policies/graph \
-H "Authorization: Bearer velgent_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
-H "Content-Type: application/json" \
-d '{
"graph": [
{ "id": "classify", "policy": "itsm/incident-classify", "mode": "classify" },
{ "id": "score", "policy": "itsm/incident-score", "mode": "score" },
{ "id": "route", "policy": "itsm/incident-route", "mode": "decide",
"depends_on": ["classify", "score"] },
{ "id": "comms", "policy": "itsm/customer-comms", "mode": "generate",
"depends_on": ["classify", "score", "route"] }
],
"inputs": {
"short_description": "Payment service returning 500 errors",
"description": "~5% of transactions failing since 10:15am",
"affected_ci": "payment-svc-prod-01"
}
}'
Execution order:
- Level 0:
classify+score(parallel, ~1.5s each → 1.5s) - Level 1:
route(1.2s) - Level 2:
comms(2s) - Total: ~4.7s vs ~6.2s for the equivalent chain.
#Save and reuse a graph (graph_slug)
Authoring a graph once in the admin and referencing it by slug keeps integrations stable across graph edits. Operators iterate in the admin; integration code never changes.
# After saving the graph at https://admin.velgent.com/policies/graphs
# with slug "itsm/incident-triage" and publishing v1:
curl -X POST https://aiengine.velgent.com/api/v1/policies/graph \
-H "Authorization: Bearer velgent_live_..." \
-H "Content-Type: application/json" \
-d '{
"graph_slug": "itsm/incident-triage",
"inputs": {
"short_description": "Payment service returning 500 errors",
"description": "~5% of transactions failing since 10:15am",
"affected_ci": "payment-svc-prod-01"
}
}'
Same response shape as the inline form. The engine loads the graph's current published version and runs it through the same orchestrator path.
Lifecycle:
- Admin authors the graph in the designer (
/policies/graphs/designer) - Click "Save as graph…" → modal asks for slug + name + description
- First version saved as
ready(not published) - Admin reviews + clicks Publish on the version → atomically flips
current_version_id - Integration call with
graph_slugresolves to whichever version is currently published
Atomic publish + immutable versions: in-flight evaluations finish on whichever version they started against; new ones use the new version. Roll back by publishing an older version — one click, one second.
#Conditional edges (routing)
Each entry in depends_on can be a plain string (unconditional)
or a ConditionalEdge object that gates the edge behind a
condition evaluated against the parent's outcome:
{
"id": "escalate",
"policy": "ops/escalate",
"mode": "decide",
"depends_on": [
{ "step": "risk", "if": "outcome.score >= 0.7" }
]
}
If the condition evaluates false, the edge doesn't fire — and if
all of a node's incoming edges fail to fire, the node is
naturally skipped (status: "skipped", distinct from
halt-skipped via the detail field). The graph continues; this is
normal branching, not an error.
Common pattern — branching by score:
"graph": [
{ "id": "risk", "policy": "compliance/risk-score", "mode": "score" },
{ "id": "auto_approve", "policy": "ops/auto-approve", "mode": "decide",
"depends_on": [{ "step": "risk", "if": "outcome.score < 0.3" }] },
{ "id": "escalate", "policy": "ops/escalate", "mode": "decide",
"depends_on": [{ "step": "risk", "if": "outcome.score >= 0.3" }] }
]
Exactly one of auto_approve / escalate runs depending on the
risk score. The other gets status: "skipped" with detail
"condition not met on edge from 'risk'".
#Condition source: string or AST
if accepts either form:
-
String —
"outcome.score >= 0.7". Engine parses on receipt. Convenient for raw-API callers and CLI scripts. -
AST object — the pre-compiled JSON shape the admin UI emits after the condition builder finishes. Same shape both directions:
"if": { "type": "compare", "op": ">=", "left": { "type": "path", "parts": ["outcome", "score"] }, "right": { "type": "literal", "value": 0.7 } }
Both compile to the same internal AST and behave identically. Operators never write the AST by hand — the admin UI's condition builder emits it automatically.
#Condition language
The expression language is intentionally narrow — comparisons, boolean logic, membership, null-checks, path access. No function calls, no arithmetic, no string concat. "Test a thing," never "compute a thing."
| Construct | Example | Notes |
|---|---|---|
| Numeric comparison | outcome.score >= 0.7 | ==, !=, >, <, >=, <= |
| String equality | outcome.action_id == "escalate" | Double-quoted strings only |
| Null check | outcome.action_id == null | Special-cased: x == null is true when x is missing/None |
| Boolean literal | outcome.passed == false | true, false |
| List/string membership | "security" in outcome.labels | Python-like; works on lists and strings |
| Boolean AND/OR/NOT | a && b, a || b, !a | && has higher precedence than || |
| Parens | (a || b) && c | Override default precedence |
| Nested path | outcome.payload.amount > 100 | .-separated; missing path returns null (false) |
Null-safe by construction. A condition that references a
field that doesn't exist on the parent's outcome (typo, refactored
outcome shape, parent ran in a different mode) silently evaluates
to false — the node is naturally skipped, not a 500. Compare-with-null
returns false; x == null / x != null are explicit checks for
missing fields.
#Multi-parent AND semantics
When a node has multiple incoming edges, all of them must fire (AND). Multi-parent OR is deferred — operators expressing OR today restructure the graph or put the OR inside a single condition that references multiple parents' outcomes via the shared inputs bag.
#Halt vs conditional skip
Two different reasons a node might end up with status: "skipped":
| Reason | When | Detail format |
|---|---|---|
| Halt-skipped | halt_on='error' triggered after a parent failed | "skipped: halt_on='error' triggered at level N" |
| Condition not met | The parent succeeded but the edge condition was false | "skipped: condition not met on edge from 'parent_id'" |
| Cascading | An unconditional parent was itself skipped | "skipped: cascading from 'parent_id' (which was skipped)" |
| Condition unevaluable | A conditional edge had a non-ok parent (no outcome to test) | "skipped: condition on edge from 'parent_id' cannot be evaluated; parent status='not_found'" |
#Validation
Pre-execution checks reject the request as 422 with a structured
message:
| When | Detail |
|---|---|
Two nodes share the same id | duplicate step ids in graph: [...] |
A depends_on ref doesn't exist | step 'x' depends_on unknown step 'y'; declared ids are [...] |
| The graph contains a cycle | cycle detected in graph; unresolvable nodes: [...] |
| A node depends on itself | step 'x' depends on itself |
| A condition source has a syntax error | step 'x': invalid condition source on edge from 'y': ... |
| A condition AST is malformed | step 'x': invalid AST on edge from 'y': ... |
Both graph and graph_slug set | provide EITHER 'graph' (inline) OR 'graph_slug' (stored), not both |
Neither graph nor graph_slug set | provide 'graph' (inline) or 'graph_slug' (stored graph reference) |
graph_slug doesn't exist or is unpublished | 404 — graph 'x' not found or has no published version |
These run before any LLM call — misconfigured graphs reject cheaply, including bad conditions.
The engine caps parallel LLM calls at 8 per graph evaluation. A wide level (12 independent steps) runs in two waves of 8 + 4 rather than firing all at once. No requests are rejected — they queue on an internal semaphore.
Next: Score aggregation → —
the weighted aggregate block returned on every multi-step response.