#Policy graphs (DAG)

When a workflow's steps have a dependency structure — some can run in parallel, some depend on multiple parents — chains are too linear. Graphs let you declare each step's depends_on explicitly. The engine topologically sorts the graph, runs each level in parallel via asyncio.gather, and feeds every step the outcomes of its direct dependencies.

POSThttps://aiengine.velgent.com/api/v1/policies/graph

#Chain vs Graph — when to use which

Chain (/chain)Graph (/graph)
ShapeLinear array, each step sees previous + chain[]Node list with explicit depends_on edges
ExecutionStrictly sequentialLevels parallel; cap 8 LLM calls concurrent
LatencyΣ(step latencies)Σ(level latencies); max-of-level for parallel steps
Use whenEach step needs the previous step's outcomeSome steps are independent (e.g. classify + score both feed decide)

The execution model derives from which endpoint you pick — there's no engine toggle. A workflow with no fanout is a chain. A workflow with fanout is a graph.

#Request body

The body specifies the graph in one of two ways — inline (graph field, the original shape) or by slug reference (graph_slug field, references a stored graph authored in the admin). Exactly one is required; passing both is a 422.

Integrations should prefer graph_slug so a graph definition change in admin doesn't require a code deploy on the caller side.

FieldTypeRequiredDefaultDescription
grapharray<GraphStep>NoThe nodes of the DAG. 1–20 nodes. Each node carries an operator-chosen `id` (must be unique within the graph), the policy slug, the requested mode, and a `depends_on` list of node ids it consumes. Mutually exclusive with `graph_slug`.
graph_slugstringNoReference to a stored, published graph (authored in the admin under /policies/graphs). The engine loads the current published version and evaluates that. Mutually exclusive with `graph`. Returns `404` if the slug doesn't exist or has no published version.
inputsobjectNo{}Base inputs available to every step. Each step also receives its parents' outcomes injected under their step ids — see Dependency context.
contextobjectNo{}Side-channel metadata. Same semantic as on /evaluate and /chain.
halt_on"error" | "never"No"error""error" halts after the failed step's level completes — siblings at that level run to completion (already firing in parallel) but downstream levels get status: "skipped". "never" runs every level regardless; downstream steps still receive their successful parents' outcomes.

GraphStep object:

{
  "id":         "decide_route",
  "policy":     "itsm/incident-route",
  "mode":       "decide",
  "version":    null,
  "depends_on": ["classify", "score"]
}

id is the operator-chosen handle for this node within the graph (lowercase, kebab/snake-case). policy is the slug (the same policy can appear twice in a graph with different ids). depends_on lists the parent node ids — empty means a root node that runs at level 0.

#Dependency context

When a step runs, its inputs bag is:

inputs = {
  ...original_inputs,
  "classify": { ...classify_step_outcome },
  "score":    { ...score_step_outcome }
}

Each direct parent's outcome is injected under the parent's step id (not policy slug). Policies reference these in their English text:

"Given classify.primary_label and score.score, pick the appropriate routing action..."

Collision rule: if a parent's id matches an existing input key, the parent's outcome wins (overwrites). Pick step ids that don't collide with your base inputs. Failed/skipped parents are NOT injected — the step's inputs simply omit them.

#Response

{
  "request_id": "uuid",
  "steps": [
    { "id": "classify", "policy": "...", "mode": "classify",
      "status": "ok", "outcome": {...}, "latency_ms": 1500 },
    { "id": "score",    "policy": "...", "mode": "score",
      "status": "ok", "outcome": {...}, "latency_ms": 1500 },
    { "id": "decide",   "policy": "...", "mode": "decide",
      "status": "ok", "outcome": {...}, "latency_ms": 1200 }
  ],
  "leaves": {
    "decide": { "action_id": "page_oncall", "payload": {...}, "reason": "..." }
  },
  "halted_at_level":  null,
  "levels_executed":  2,
  "latency_ms_total": 2700,
  "aggregate":        { ... }    // see Score aggregation
}

steps is in declaration order (matches the request's graph array) for easy lookup. leaves maps leaf-node id → outcome for graphs with obvious sink nodes (a final "draft response" step, for example). halted_at_level is 0-indexed; null when the graph completed cleanly.

Notice the latency: classify and score ran in parallel (level 0, ~1.5s), then decide (level 1, ~1.2s). Total = 2.7s vs ~4.2s if it had been a chain.

The aggregate block is documented separately in Score aggregation.

#Example: ITSM incident triage as a graph

The same workflow as the chain example, but with classify and score running in parallel:

curl -X POST https://aiengine.velgent.com/api/v1/policies/graph \
  -H "Authorization: Bearer velgent_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "graph": [
      { "id": "classify", "policy": "itsm/incident-classify", "mode": "classify" },
      { "id": "score",    "policy": "itsm/incident-score",    "mode": "score" },
      { "id": "route",    "policy": "itsm/incident-route",    "mode": "decide",
        "depends_on": ["classify", "score"] },
      { "id": "comms",    "policy": "itsm/customer-comms",    "mode": "generate",
        "depends_on": ["classify", "score", "route"] }
    ],
    "inputs": {
      "short_description": "Payment service returning 500 errors",
      "description":       "~5% of transactions failing since 10:15am",
      "affected_ci":       "payment-svc-prod-01"
    }
  }'

Execution order:

  • Level 0: classify + score (parallel, ~1.5s each → 1.5s)
  • Level 1: route (1.2s)
  • Level 2: comms (2s)
  • Total: ~4.7s vs ~6.2s for the equivalent chain.

#Save and reuse a graph (graph_slug)

Authoring a graph once in the admin and referencing it by slug keeps integrations stable across graph edits. Operators iterate in the admin; integration code never changes.

# After saving the graph at https://admin.velgent.com/policies/graphs
# with slug "itsm/incident-triage" and publishing v1:

curl -X POST https://aiengine.velgent.com/api/v1/policies/graph \
  -H "Authorization: Bearer velgent_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "graph_slug": "itsm/incident-triage",
    "inputs": {
      "short_description": "Payment service returning 500 errors",
      "description":       "~5% of transactions failing since 10:15am",
      "affected_ci":       "payment-svc-prod-01"
    }
  }'

Same response shape as the inline form. The engine loads the graph's current published version and runs it through the same orchestrator path.

Lifecycle:

  • Admin authors the graph in the designer (/policies/graphs/designer)
  • Click "Save as graph…" → modal asks for slug + name + description
  • First version saved as ready (not published)
  • Admin reviews + clicks Publish on the version → atomically flips current_version_id
  • Integration call with graph_slug resolves to whichever version is currently published

Atomic publish + immutable versions: in-flight evaluations finish on whichever version they started against; new ones use the new version. Roll back by publishing an older version — one click, one second.

#Conditional edges (routing)

Each entry in depends_on can be a plain string (unconditional) or a ConditionalEdge object that gates the edge behind a condition evaluated against the parent's outcome:

{
  "id": "escalate",
  "policy": "ops/escalate",
  "mode": "decide",
  "depends_on": [
    { "step": "risk", "if": "outcome.score >= 0.7" }
  ]
}

If the condition evaluates false, the edge doesn't fire — and if all of a node's incoming edges fail to fire, the node is naturally skipped (status: "skipped", distinct from halt-skipped via the detail field). The graph continues; this is normal branching, not an error.

Common pattern — branching by score:

"graph": [
  { "id": "risk", "policy": "compliance/risk-score", "mode": "score" },

  { "id": "auto_approve", "policy": "ops/auto-approve", "mode": "decide",
    "depends_on": [{ "step": "risk", "if": "outcome.score < 0.3" }] },

  { "id": "escalate", "policy": "ops/escalate", "mode": "decide",
    "depends_on": [{ "step": "risk", "if": "outcome.score >= 0.3" }] }
]

Exactly one of auto_approve / escalate runs depending on the risk score. The other gets status: "skipped" with detail "condition not met on edge from 'risk'".

#Condition source: string or AST

if accepts either form:

  • String"outcome.score >= 0.7". Engine parses on receipt. Convenient for raw-API callers and CLI scripts.

  • AST object — the pre-compiled JSON shape the admin UI emits after the condition builder finishes. Same shape both directions:

    "if": {
      "type": "compare",
      "op":   ">=",
      "left":  { "type": "path", "parts": ["outcome", "score"] },
      "right": { "type": "literal", "value": 0.7 }
    }
    

Both compile to the same internal AST and behave identically. Operators never write the AST by hand — the admin UI's condition builder emits it automatically.

#Condition language

The expression language is intentionally narrow — comparisons, boolean logic, membership, null-checks, path access. No function calls, no arithmetic, no string concat. "Test a thing," never "compute a thing."

ConstructExampleNotes
Numeric comparisonoutcome.score >= 0.7==, !=, >, <, >=, <=
String equalityoutcome.action_id == "escalate"Double-quoted strings only
Null checkoutcome.action_id == nullSpecial-cased: x == null is true when x is missing/None
Boolean literaloutcome.passed == falsetrue, false
List/string membership"security" in outcome.labelsPython-like; works on lists and strings
Boolean AND/OR/NOTa && b, a || b, !a&& has higher precedence than ||
Parens(a || b) && cOverride default precedence
Nested pathoutcome.payload.amount > 100.-separated; missing path returns null (false)

Null-safe by construction. A condition that references a field that doesn't exist on the parent's outcome (typo, refactored outcome shape, parent ran in a different mode) silently evaluates to false — the node is naturally skipped, not a 500. Compare-with-null returns false; x == null / x != null are explicit checks for missing fields.

#Multi-parent AND semantics

When a node has multiple incoming edges, all of them must fire (AND). Multi-parent OR is deferred — operators expressing OR today restructure the graph or put the OR inside a single condition that references multiple parents' outcomes via the shared inputs bag.

#Halt vs conditional skip

Two different reasons a node might end up with status: "skipped":

ReasonWhenDetail format
Halt-skippedhalt_on='error' triggered after a parent failed"skipped: halt_on='error' triggered at level N"
Condition not metThe parent succeeded but the edge condition was false"skipped: condition not met on edge from 'parent_id'"
CascadingAn unconditional parent was itself skipped"skipped: cascading from 'parent_id' (which was skipped)"
Condition unevaluableA conditional edge had a non-ok parent (no outcome to test)"skipped: condition on edge from 'parent_id' cannot be evaluated; parent status='not_found'"

#Validation

Pre-execution checks reject the request as 422 with a structured message:

WhenDetail
Two nodes share the same idduplicate step ids in graph: [...]
A depends_on ref doesn't existstep 'x' depends_on unknown step 'y'; declared ids are [...]
The graph contains a cyclecycle detected in graph; unresolvable nodes: [...]
A node depends on itselfstep 'x' depends on itself
A condition source has a syntax errorstep 'x': invalid condition source on edge from 'y': ...
A condition AST is malformedstep 'x': invalid AST on edge from 'y': ...
Both graph and graph_slug setprovide EITHER 'graph' (inline) OR 'graph_slug' (stored), not both
Neither graph nor graph_slug setprovide 'graph' (inline) or 'graph_slug' (stored graph reference)
graph_slug doesn't exist or is unpublished404graph 'x' not found or has no published version

These run before any LLM call — misconfigured graphs reject cheaply, including bad conditions.

Concurrency cap

The engine caps parallel LLM calls at 8 per graph evaluation. A wide level (12 independent steps) runs in two waves of 8 + 4 rather than firing all at once. No requests are rejected — they queue on an internal semaphore.


Next: Score aggregation → — the weighted aggregate block returned on every multi-step response.