#Policy graphs (DAG)

When a workflow's steps have a dependency structure — some can run in parallel, some depend on multiple parents — chains are too linear. Graphs let you declare each step's depends_on explicitly. The engine topologically sorts the graph, runs each level in parallel via asyncio.gather, and feeds every step the outcomes of its direct dependencies.

POSThttps://aiengine.velgent.com/api/v1/policies/graph

#Chain vs Graph — when to use which

	Chain (`/chain`)	Graph (`/graph`)
Shape	Linear array, each step sees `previous` + `chain[]`	Node list with explicit `depends_on` edges
Execution	Strictly sequential	Levels parallel; cap 8 LLM calls concurrent
Latency	Σ(step latencies)	Σ(level latencies); max-of-level for parallel steps
Use when	Each step needs the previous step's outcome	Some steps are independent (e.g. classify + score both feed decide)

The execution model derives from which endpoint you pick — there's no engine toggle. A workflow with no fanout is a chain. A workflow with fanout is a graph.

#Request body

The body specifies the graph in one of two ways — inline (graph field, the original shape) or by slug reference (graph_slug field, references a stored graph authored in the admin). Exactly one is required; passing both is a 422.

Integrations should prefer graph_slug so a graph definition change in admin doesn't require a code deploy on the caller side.

Field	Type	Required	Default	Description
graph	array<GraphStep>	No	—	The nodes of the DAG. 1–20 nodes. Each node carries an operator-chosen `id` (must be unique within the graph), the policy slug, the requested mode, and a `depends_on` list of node ids it consumes. Mutually exclusive with `graph_slug`.
graph_slug	string	No	—	Reference to a stored, published graph (authored in the admin under /policies/graphs). The engine loads the current published version and evaluates that. Mutually exclusive with `graph`. Returns `404` if the slug doesn't exist or has no published version.
inputs	object	No	{}	Base inputs available to every step. Each step also receives its parents' outcomes injected under their step ids — see Dependency context.
context	object	No	{}	Side-channel metadata. Same semantic as on /evaluate and /chain.
halt_on	"error" \| "never"	No	"error"	`"error"` halts after the failed step's level completes — siblings at that level run to completion (already firing in parallel) but downstream levels get `status: "skipped"`. `"never"` runs every level regardless; downstream steps still receive their successful parents' outcomes.

GraphStep object:

{
  "id":         "decide_route",
  "policy":     "itsm/incident-route",
  "mode":       "decide",
  "version":    null,
  "depends_on": ["classify", "score"]
}

id is the operator-chosen handle for this node within the graph (lowercase, kebab/snake-case). policy is the slug (the same policy can appear twice in a graph with different ids). depends_on lists the parent node ids — empty means a root node that runs at level 0.

#Dependency context

When a step runs, its inputs bag is:

inputs = {
  ...original_inputs,
  "classify": { ...classify_step_outcome },
  "score":    { ...score_step_outcome }
}

Each direct parent's outcome is injected under the parent's step id (not policy slug). Policies reference these in their English text:

"Given classify.primary_label and score.score, pick the appropriate routing action..."

Collision rule: if a parent's id matches an existing input key, the parent's outcome wins (overwrites). Pick step ids that don't collide with your base inputs. Failed/skipped parents are NOT injected — the step's inputs simply omit them.

#Response

{
  "request_id": "uuid",
  "steps": [
    { "id": "classify", "policy": "...", "mode": "classify",
      "status": "ok", "outcome": {...}, "latency_ms": 1500 },
    { "id": "score",    "policy": "...", "mode": "score",
      "status": "ok", "outcome": {...}, "latency_ms": 1500 },
    { "id": "decide",   "policy": "...", "mode": "decide",
      "status": "ok", "outcome": {...}, "latency_ms": 1200 }
  ],
  "leaves": {
    "decide": { "action_id": "page_oncall", "payload": {...}, "reason": "..." }
  },
  "halted_at_level":  null,
  "levels_executed":  2,
  "latency_ms_total": 2700,
  "aggregate":        { ... }    // see Score aggregation
}

steps is in declaration order (matches the request's graph array) for easy lookup. leaves maps leaf-node id → outcome for graphs with obvious sink nodes (a final "draft response" step, for example). halted_at_level is 0-indexed; null when the graph completed cleanly.

Notice the latency: classify and score ran in parallel (level 0, ~1.5s), then decide (level 1, ~1.2s). Total = 2.7s vs ~4.2s if it had been a chain.

The aggregate block is documented separately in Score aggregation.

#Example: ITSM incident triage as a graph

The same workflow as the chain example, but with classify and score running in parallel:

curl -X POST https://aiengine.velgent.com/api/v1/policies/graph \
  -H "Authorization: Bearer velgent_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "graph": [
      { "id": "classify", "policy": "itsm/incident-classify", "mode": "classify" },
      { "id": "score",    "policy": "itsm/incident-score",    "mode": "score" },
      { "id": "route",    "policy": "itsm/incident-route",    "mode": "decide",
        "depends_on": ["classify", "score"] },
      { "id": "comms",    "policy": "itsm/customer-comms",    "mode": "generate",
        "depends_on": ["classify", "score", "route"] }
    ],
    "inputs": {
      "short_description": "Payment service returning 500 errors",
      "description":       "~5% of transactions failing since 10:15am",
      "affected_ci":       "payment-svc-prod-01"
    }
  }'

Execution order:

Level 0: classify + score (parallel, ~1.5s each → 1.5s)
Level 1: route (1.2s)
Level 2: comms (2s)
Total: ~4.7s vs ~6.2s for the equivalent chain.

#Save and reuse a graph (`graph_slug`)

Authoring a graph once in the admin and referencing it by slug keeps integrations stable across graph edits. Operators iterate in the admin; integration code never changes.

# After saving the graph at https://admin.velgent.com/policies/graphs
# with slug "itsm/incident-triage" and publishing v1:

curl -X POST https://aiengine.velgent.com/api/v1/policies/graph \
  -H "Authorization: Bearer velgent_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "graph_slug": "itsm/incident-triage",
    "inputs": {
      "short_description": "Payment service returning 500 errors",
      "description":       "~5% of transactions failing since 10:15am",
      "affected_ci":       "payment-svc-prod-01"
    }
  }'

Same response shape as the inline form. The engine loads the graph's current published version and runs it through the same orchestrator path.

Lifecycle:

Admin authors the graph in the designer (/policies/graphs/designer)
Click "Save as graph…" → modal asks for slug + name + description
First version saved as ready (not published)
Admin reviews + clicks Publish on the version → atomically flips current_version_id
Integration call with graph_slug resolves to whichever version is currently published

Atomic publish + immutable versions: in-flight evaluations finish on whichever version they started against; new ones use the new version. Roll back by publishing an older version — one click, one second.

#Conditional edges (routing)

Each entry in depends_on can be a plain string (unconditional) or a ConditionalEdge object that gates the edge behind a condition evaluated against the parent's outcome:

{
  "id": "escalate",
  "policy": "ops/escalate",
  "mode": "decide",
  "depends_on": [
    { "step": "risk", "if": "outcome.score >= 0.7" }
  ]
}

If the condition evaluates false, the edge doesn't fire — and if all of a node's incoming edges fail to fire, the node is naturally skipped (status: "skipped", distinct from halt-skipped via the detail field). The graph continues; this is normal branching, not an error.

Common pattern — branching by score:

"graph": [
  { "id": "risk", "policy": "compliance/risk-score", "mode": "score" },

  { "id": "auto_approve", "policy": "ops/auto-approve", "mode": "decide",
    "depends_on": [{ "step": "risk", "if": "outcome.score < 0.3" }] },

  { "id": "escalate", "policy": "ops/escalate", "mode": "decide",
    "depends_on": [{ "step": "risk", "if": "outcome.score >= 0.3" }] }
]

Exactly one of auto_approve / escalate runs depending on the risk score. The other gets status: "skipped" with detail "condition not met on edge from 'risk'".

#Condition source: string or AST

if accepts either form:

String — "outcome.score >= 0.7". Engine parses on receipt. Convenient for raw-API callers and CLI scripts.

AST object — the pre-compiled JSON shape the admin UI emits after the condition builder finishes. Same shape both directions:

"if": {
  "type": "compare",
  "op":   ">=",
  "left":  { "type": "path", "parts": ["outcome", "score"] },
  "right": { "type": "literal", "value": 0.7 }
}

Both compile to the same internal AST and behave identically. Operators never write the AST by hand — the admin UI's condition builder emits it automatically.

#Condition language

The expression language is intentionally narrow — comparisons, boolean logic, membership, null-checks, path access. No function calls, no arithmetic, no string concat. "Test a thing," never "compute a thing."

Construct	Example	Notes
Numeric comparison	`outcome.score >= 0.7`	`==`, `!=`, `>`, `<`, `>=`, `<=`
String equality	`outcome.action_id == "escalate"`	Double-quoted strings only
Null check	`outcome.action_id == null`	Special-cased: `x == null` is true when `x` is missing/None
Boolean literal	`outcome.passed == false`	`true`, `false`
List/string membership	`"security" in outcome.labels`	Python-like; works on lists and strings
Boolean AND/OR/NOT	`a && b`, `a \|\| b`, `!a`	`&&` has higher precedence than `\|\|`
Parens	`(a \|\| b) && c`	Override default precedence
Nested path	`outcome.payload.amount > 100`	`.`-separated; missing path returns null (false)

Null-safe by construction. A condition that references a field that doesn't exist on the parent's outcome (typo, refactored outcome shape, parent ran in a different mode) silently evaluates to false — the node is naturally skipped, not a 500. Compare-with-null returns false; x == null / x != null are explicit checks for missing fields.

#Multi-parent AND semantics

When a node has multiple incoming edges, all of them must fire (AND). Multi-parent OR is deferred — operators expressing OR today restructure the graph or put the OR inside a single condition that references multiple parents' outcomes via the shared inputs bag.

#Halt vs conditional skip

Two different reasons a node might end up with status: "skipped":

Reason	When	Detail format
Halt-skipped	`halt_on='error'` triggered after a parent failed	`"skipped: halt_on='error' triggered at level N"`
Condition not met	The parent succeeded but the edge condition was false	`"skipped: condition not met on edge from 'parent_id'"`
Cascading	An unconditional parent was itself skipped	`"skipped: cascading from 'parent_id' (which was skipped)"`
Condition unevaluable	A conditional edge had a non-ok parent (no outcome to test)	`"skipped: condition on edge from 'parent_id' cannot be evaluated; parent status='not_found'"`

#Validation

Pre-execution checks reject the request as 422 with a structured message:

When	Detail
Two nodes share the same `id`	`duplicate step ids in graph: [...]`
A `depends_on` ref doesn't exist	`step 'x' depends_on unknown step 'y'; declared ids are [...]`
The graph contains a cycle	`cycle detected in graph; unresolvable nodes: [...]`
A node depends on itself	`step 'x' depends on itself`
A condition source has a syntax error	`step 'x': invalid condition source on edge from 'y': ...`
A condition AST is malformed	`step 'x': invalid AST on edge from 'y': ...`
Both `graph` and `graph_slug` set	`provide EITHER 'graph' (inline) OR 'graph_slug' (stored), not both`
Neither `graph` nor `graph_slug` set	`provide 'graph' (inline) or 'graph_slug' (stored graph reference)`
`graph_slug` doesn't exist or is unpublished	`404` — `graph 'x' not found or has no published version`

These run before any LLM call — misconfigured graphs reject cheaply, including bad conditions.

Concurrency cap

The engine caps parallel LLM calls at 8 per graph evaluation. A wide level (12 independent steps) runs in two waves of 8 + 4 rather than firing all at once. No requests are rejected — they queue on an internal semaphore.

Next: Score aggregation → — the weighted aggregate block returned on every multi-step response.