#Data Extractor — examples

End-to-end requests and example responses for each input mode. For the request body shape and response field reference, see the Data Extractor overview.

#Text input + inline schema

The simplest path. Useful for testing or when the source is already text (parsed email, scraped page).

#Example response

{
  "request_id": "01HZQ7K8YV9X3R2M5N6P7QABCD",
  "extracted": {
    "invoice_number": "INV-2026-001",
    "vendor":         "Acme Corp",
    "total":          1250.00,
    "currency":       "USD",
    "invoice_date":   "2026-05-28"
  },
  "confidence": {
    "invoice_number": 0.98,
    "vendor":         0.95,
    "total":          0.95,
    "currency":       0.99,
    "invoice_date":   0.97
  },
  "anchors": {
    "invoice_number": { "text": "Invoice INV-2026-001",                       "page": null },
    "vendor":         { "text": "from Acme Corp",                             "page": null },
    "total":          { "text": "Total: $1,250.00",                           "page": null },
    "currency":       { "text": "Total: $1,250.00 USD",                       "page": null },
    "invoice_date":   { "text": "dated 2026-05-28",                           "page": null }
  },
  "components": {
    "invoice_number": { "llm_self_conf": 0.99, "anchor_conf": 1.0, "source_quality": 1.0, "schema_validates": true, "n_samples": 1, "agreement_rate": null },
    "total":          { "llm_self_conf": 0.95, "anchor_conf": 1.0, "source_quality": 1.0, "schema_validates": true, "n_samples": 1, "agreement_rate": null }
  },
  "metadata": {
    "input_kind":      "text",
    "pages_processed": 0,
    "ocr_used":        false,
    "model_used":      "claude-sonnet-4-6",
    "tokens_in":       342,
    "tokens_out":      287,
    "latency_ms":      1894,
    "redaction_count": 0,
    "n_samples":       1
  },
  "schema_drift": [],
  "warnings":     []
}

#Image input — vendor invoice

Vision LLM path. Send the image as base64; the engine reads it directly (no OCR-first, preserving spatial information vision models use for tables / forms).

curl -X POST https://aiengine.velgent.com/api/v1/extract \
  -H "Authorization: Bearer $VELGENT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": {
      "kind":      "image",
      "mime_type": "image/png",
      "data":      "iVBORw0KGgoAAAANSUhEUg…"
    },
    "extraction_schema": {
      "fields": {
        "invoice_number": { "type": "string", "description": "Invoice ID" },
        "total":          { "type": "number", "description": "Grand total" },
        "line_items": {
          "type": "array_of_object",
          "description": "Each row of the invoice",
          "item_schema": {
            "type": "object",
            "properties": {
              "description": { "type": "string", "description": "Item description" },
              "quantity":    { "type": "number", "description": "Quantity" },
              "unit_price":  { "type": "number", "description": "Unit price" }
            }
          }
        }
      }
    }
  }'

For image inputs anchors[].page is null. The engine returns one page-less anchor per extracted leaf.

#PDF input — multi-page

The engine tries the text-layer first (pdfminer); if the PDF is scanned / has no text layer, it rasterises each page and uses the vision LLM. You don't pick the mode — the response's metadata.input_kind tells you which path ran (pdf_text_layer vs pdf_rasterised).

curl -X POST https://aiengine.velgent.com/api/v1/extract \
  -H "Authorization: Bearer $VELGENT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": {
      "kind":      "pdf",
      "mime_type": "application/pdf",
      "data":      "JVBERi0xLjQK…"
    },
    "options":  { "page_range": "1-3" },
    "extraction_schema": { "fields": { … } }
  }'

Per doc 14 §"PDF — two-mode": the hard cap is 10 pages. For larger documents pass page_range to slice into multiple calls. A 413 response includes the hint.

#Template path

Once an admin has published a template at admin.velgent.com → Extract templates, callers reference it by slug. The schema lives server-side; the caller just sends the content.

curl -X POST https://aiengine.velgent.com/api/v1/extract \
  -H "Authorization: Bearer $VELGENT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": {
      "kind":      "pdf",
      "mime_type": "application/pdf",
      "data":      "JVBERi0xLjQK…"
    },
    "template": "invoice_v1"
  }'

Pin a historical version for replay / canary / rollback:

{ "template": "invoice_v1", "template_version": 3, ... }

#Multi-sample for higher confidence

Mission-critical extraction can opt into multi-sample mode — the engine runs the extraction n_samples times and folds the agreement rate into the per-field confidence. Each sample is one billable evaluation, so a 3-sample run charges 3 evaluations.

curl -X POST https://aiengine.velgent.com/api/v1/extract \
  -H "Authorization: Bearer $VELGENT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "…",
    "extraction_schema": { … },
    "options": { "n_samples": 3 }
  }'

When n_samples > 1, components.<field>.agreement_rate is populated and contributes to the final score:

"components": {
  "total": {
    "llm_self_conf":   0.95,
    "anchor_conf":     1.0,
    "source_quality":  0.95,
    "schema_validates": true,
    "n_samples":       3,
    "agreement_rate":  0.67    // 2 of 3 samples agreed → final = min(…, 0.67)
  }
}

#HTML output modes

Same endpoint, three shapes. Pick via output_format.

#`html_fields` — JSON plus a ready-to-embed HTML render

Add "output_format": "html_fields" to any schema-driven extract call. The response includes the usual extracted / confidence / anchors and an extracted_html string with a deterministic <dl> / <table> render of the extracted fields — useful for ticket comments, customer-facing emails, or audit notes that need to be human-readable without a frontend render step.

POST /api/v1/extract
{
  "text": "Invoice INV-2026-001 dated 2026-05-28, total $1,250.00 USD",
  "extraction_schema": { "fields": { "invoice_number": { "type": "string" }, "total": { "type": "number" } } },
  "output_format": "html_fields"
}

{
  "extracted":      { "invoice_number": "INV-2026-001", "total": 1250.00 },
  "confidence":     { "invoice_number": 0.98, "total": 0.95 },
  "extracted_html": "<section class=\"velgent-extract velgent-extract-fields\">\n  <dl class=\"velgent-extract-dl\">\n    <dt class=\"velgent-extract-key\">Invoice number</dt>\n    <dd class=\"velgent-extract-value\">INV-2026-001</dd>\n    <dt class=\"velgent-extract-key\">Total</dt>\n    <dd class=\"velgent-extract-value\">1250.0</dd>\n  </dl>\n</section>\n"
}

The HTML is CSS-less — style it via your own stylesheet keyed off the velgent-extract-* class names.

#`html_document` — whole-document semantic HTML

No schema. Image or PDF only. Velgent re-flows the source into semantic HTML (<h1>–<h6>, <p>, <ul>, <table>, …). Useful for making scanned documents searchable or re-displayable in modern surfaces.

POST /api/v1/extract
{
  "file": { "kind": "pdf", "mime_type": "application/pdf", "data": "<base64>" },
  "output_format": "html_document"
}

{
  "extracted":      {},
  "confidence":     {},
  "anchors":        {},
  "extracted_html": "<section class=\"velgent-extract velgent-extract-document\">\n<h1>Vendor Invoice</h1>\n<section data-page=\"1\">\n  <p><strong>Bill to:</strong> Acme Ltd, Sydney</p>\n  <table>…</table>\n</section>\n</section>\n",
  "metadata":       { "input_kind": "pdf_rasterised", "pages_processed": 1, ... }
}

Server-side sanitised (no <script> / <style> / event handlers / javascript: URLs), but still run DOMPurify (or equivalent) before injecting into a privileged context.

#Composing with Policy Engine

The natural downstream is the Policy Engine. Extract gives you typed JSON with per-field confidence; Policy Engine validate mode encodes the routing logic ("auto-approve if every critical field has confidence > 0.9 AND total matches PO; else route to human review"). One product extracts, one product decides.

A typical AP automation flow:

Email lands with a PDF invoice attachment.
POST /api/v1/extract with template: "invoice_v1" → structured invoice JSON + per-field confidence.
POST /api/v1/policies/evaluate with the extracted JSON as inputs → auto_approve or route_to_review.
Caller fires the chosen action.

Back to: Data Extractor overview →