#Data Extractor — examples
End-to-end requests and example responses for each input mode. For the request body shape and response field reference, see the Data Extractor overview.
#Text input + inline schema
The simplest path. Useful for testing or when the source is already text (parsed email, scraped page).
#Example response
{
"request_id": "01HZQ7K8YV9X3R2M5N6P7QABCD",
"extracted": {
"invoice_number": "INV-2026-001",
"vendor": "Acme Corp",
"total": 1250.00,
"currency": "USD",
"invoice_date": "2026-05-28"
},
"confidence": {
"invoice_number": 0.98,
"vendor": 0.95,
"total": 0.95,
"currency": 0.99,
"invoice_date": 0.97
},
"anchors": {
"invoice_number": { "text": "Invoice INV-2026-001", "page": null },
"vendor": { "text": "from Acme Corp", "page": null },
"total": { "text": "Total: $1,250.00", "page": null },
"currency": { "text": "Total: $1,250.00 USD", "page": null },
"invoice_date": { "text": "dated 2026-05-28", "page": null }
},
"components": {
"invoice_number": { "llm_self_conf": 0.99, "anchor_conf": 1.0, "source_quality": 1.0, "schema_validates": true, "n_samples": 1, "agreement_rate": null },
"total": { "llm_self_conf": 0.95, "anchor_conf": 1.0, "source_quality": 1.0, "schema_validates": true, "n_samples": 1, "agreement_rate": null }
},
"metadata": {
"input_kind": "text",
"pages_processed": 0,
"ocr_used": false,
"model_used": "claude-sonnet-4-6",
"tokens_in": 342,
"tokens_out": 287,
"latency_ms": 1894,
"redaction_count": 0,
"n_samples": 1
},
"schema_drift": [],
"warnings": []
}
#Image input — vendor invoice
Vision LLM path. Send the image as base64; the engine reads it directly (no OCR-first, preserving spatial information vision models use for tables / forms).
curl -X POST https://aiengine.velgent.com/api/v1/extract \
-H "Authorization: Bearer $VELGENT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"file": {
"kind": "image",
"mime_type": "image/png",
"data": "iVBORw0KGgoAAAANSUhEUg…"
},
"extraction_schema": {
"fields": {
"invoice_number": { "type": "string", "description": "Invoice ID" },
"total": { "type": "number", "description": "Grand total" },
"line_items": {
"type": "array_of_object",
"description": "Each row of the invoice",
"item_schema": {
"type": "object",
"properties": {
"description": { "type": "string", "description": "Item description" },
"quantity": { "type": "number", "description": "Quantity" },
"unit_price": { "type": "number", "description": "Unit price" }
}
}
}
}
}
}'
For image inputs anchors[].page is null. The engine returns one
page-less anchor per extracted leaf.
#PDF input — multi-page
The engine tries the text-layer first (pdfminer); if the PDF is
scanned / has no text layer, it rasterises each page and uses the
vision LLM. You don't pick the mode — the response's
metadata.input_kind tells you which path ran (pdf_text_layer
vs pdf_rasterised).
curl -X POST https://aiengine.velgent.com/api/v1/extract \
-H "Authorization: Bearer $VELGENT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"file": {
"kind": "pdf",
"mime_type": "application/pdf",
"data": "JVBERi0xLjQK…"
},
"options": { "page_range": "1-3" },
"extraction_schema": { "fields": { … } }
}'
Per doc 14 §"PDF — two-mode": the hard cap is 10 pages. For
larger documents pass page_range to slice into multiple calls.
A 413 response includes the hint.
#Template path
Once an admin has published a template at admin.velgent.com → Extract templates, callers reference it by slug. The schema lives server-side; the caller just sends the content.
curl -X POST https://aiengine.velgent.com/api/v1/extract \
-H "Authorization: Bearer $VELGENT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"file": {
"kind": "pdf",
"mime_type": "application/pdf",
"data": "JVBERi0xLjQK…"
},
"template": "invoice_v1"
}'
Pin a historical version for replay / canary / rollback:
{ "template": "invoice_v1", "template_version": 3, ... }
#Multi-sample for higher confidence
Mission-critical extraction can opt into multi-sample mode — the
engine runs the extraction n_samples times and folds the
agreement rate into the per-field confidence. Each sample is one
billable evaluation, so a 3-sample run charges 3 evaluations.
curl -X POST https://aiengine.velgent.com/api/v1/extract \
-H "Authorization: Bearer $VELGENT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "…",
"extraction_schema": { … },
"options": { "n_samples": 3 }
}'
When n_samples > 1, components.<field>.agreement_rate is
populated and contributes to the final score:
"components": {
"total": {
"llm_self_conf": 0.95,
"anchor_conf": 1.0,
"source_quality": 0.95,
"schema_validates": true,
"n_samples": 3,
"agreement_rate": 0.67 // 2 of 3 samples agreed → final = min(…, 0.67)
}
}
#HTML output modes
Same endpoint, three shapes. Pick via output_format.
#html_fields — JSON plus a ready-to-embed HTML render
Add "output_format": "html_fields" to any schema-driven extract
call. The response includes the usual extracted / confidence /
anchors and an extracted_html string with a deterministic
<dl> / <table> render of the extracted fields — useful for ticket
comments, customer-facing emails, or audit notes that need to be
human-readable without a frontend render step.
POST /api/v1/extract
{
"text": "Invoice INV-2026-001 dated 2026-05-28, total $1,250.00 USD",
"extraction_schema": { "fields": { "invoice_number": { "type": "string" }, "total": { "type": "number" } } },
"output_format": "html_fields"
}
{
"extracted": { "invoice_number": "INV-2026-001", "total": 1250.00 },
"confidence": { "invoice_number": 0.98, "total": 0.95 },
"extracted_html": "<section class=\"velgent-extract velgent-extract-fields\">\n <dl class=\"velgent-extract-dl\">\n <dt class=\"velgent-extract-key\">Invoice number</dt>\n <dd class=\"velgent-extract-value\">INV-2026-001</dd>\n <dt class=\"velgent-extract-key\">Total</dt>\n <dd class=\"velgent-extract-value\">1250.0</dd>\n </dl>\n</section>\n"
}
The HTML is CSS-less — style it via your own stylesheet keyed off the
velgent-extract-* class names.
#html_document — whole-document semantic HTML
No schema. Image or PDF only. Velgent re-flows the source into
semantic HTML (<h1>–<h6>, <p>, <ul>, <table>, …). Useful for
making scanned documents searchable or re-displayable in modern
surfaces.
POST /api/v1/extract
{
"file": { "kind": "pdf", "mime_type": "application/pdf", "data": "<base64>" },
"output_format": "html_document"
}
{
"extracted": {},
"confidence": {},
"anchors": {},
"extracted_html": "<section class=\"velgent-extract velgent-extract-document\">\n<h1>Vendor Invoice</h1>\n<section data-page=\"1\">\n <p><strong>Bill to:</strong> Acme Ltd, Sydney</p>\n <table>…</table>\n</section>\n</section>\n",
"metadata": { "input_kind": "pdf_rasterised", "pages_processed": 1, ... }
}
Server-side sanitised (no <script> / <style> / event handlers /
javascript: URLs), but still run DOMPurify (or equivalent) before
injecting into a privileged context.
#Composing with Policy Engine
The natural downstream is the Policy Engine.
Extract gives you typed JSON with per-field confidence; Policy
Engine validate mode encodes the routing logic
("auto-approve if every critical field has confidence > 0.9 AND
total matches PO; else route to human review"). One product
extracts, one product decides.
A typical AP automation flow:
- Email lands with a PDF invoice attachment.
POST /api/v1/extractwithtemplate: "invoice_v1"→ structured invoice JSON + per-field confidence.POST /api/v1/policies/evaluatewith the extracted JSON asinputs→auto_approveorroute_to_review.- Caller fires the chosen action.
Back to: Data Extractor overview →