#Data Extractor

Send text, an image, or a PDF; declare the fields you want; get back structured JSON with per-field confidence and citation anchors (where each value was found). Built so callers can route between auto-approve and human-review programmatically — no re-parsing the LLM's prose.

POSThttps://aiengine.velgent.com/api/v1/extract

For worked requests + example responses across text, image, and PDF inputs, see Examples.

#What you get

Three things on every response, in addition to the structured output:

extracted — the values, keyed by the field names in your schema.
confidence — a 0.0–1.0 score per leaf field. Routes between auto-approve and review.
anchors — for each value, the exact source phrase the LLM used. Mandatory in V0 — it's the audit trail and the hallucination guard.

Plus a components block so quality teams can see which signal drove a low confidence score (the LLM's own self-rating, anchor verification, source quality, schema validation), and schema_drift listing fields the model emitted that you didn't ask for.

#Two ways to declare the schema

You can pass the schema inline on every call, or publish a named template in the admin console and reference it by slug. Pick one per request.

Mode	Use when
`extraction_schema` (inline)	One-off shapes, schema is dynamic per request, or testing.
`template` (slug)	Repeated shapes (invoices, receipts, KYC forms). Operator publishes once in admin → `/extract-templates`; callers pass `{"template": "invoice_v1"}` from then on.

#Request

Header	Required	Description
`Authorization`	Yes	`Bearer velgent_live_…` — see Authentication.
`Content-Type`	Yes	`application/json`.
`Idempotency-Key`	No	Any opaque string ≤ 128 chars. Repeated calls within 24h return the original response.

#Body

Field	Type	Required	Description
`text`	string	One of	Raw text input. Max ~200,000 chars.
`file`	ExtractFile	One of	Image or PDF, base64-encoded.
`extraction_schema`	ExtractionSchema	One of	Inline schema declaring fields to extract.
`template`	string	One of	Slug of a published template (admin → `/extract-templates`).
`template_version`	integer	No	Pin to a specific version. Default = current published version.
`output_format`	`"json"` \| `"html_fields"` \| `"html_document"`	No	Default `"json"`. See Output formats.
`options`	ExtractOptions	No	Per-call knobs.
`pii_action_override`	`"REDACT"` \| `"SUPPRESS"` \| `"ALLOW"` \| null	No	Per-call override of the tenant's PII action policy. Applies to text inputs only — images go directly to the vision LLM.

You must provide exactly one of text / file (which input). For output_format "json" and "html_fields" you must also provide exactly one of extraction_schema / template (which target). "html_document" is whole-document HTML reconstruction with no schema — image or PDF input only. The engine rejects requests that violate either constraint with HTTP 422.

#Output formats

Mode	Returns	Use when
`json` (default)	Structured JSON with per-field confidence + anchors.	Downstream automation — invoice posting, ticket routing, ETL.
`html_fields`	Same JSON plus `extracted_html`, a deterministic HTML render (`<dl>`/`<table>`) of the extracted fields.	Embedding the extracted data in an email, ticket comment, or notification.
`html_document`	`extracted_html` only — semantic HTML reconstruction of the whole source document. No `extracted` / `confidence` / `anchors`.	Re-flowing scanned PDFs or document images into searchable / editable HTML. Image or PDF only.

html_document mode does not accept extraction_schema or template — it produces a faithful HTML rendering of the source, not field extraction. The HTML is server-sanitised (no <script>, <style>, or event handlers) and wrapped in a velgent-extract velgent-extract-document section so you can scope CSS to it. We still recommend running DOMPurify (or equivalent) before injecting the HTML into a privileged context.

#`ExtractFile`

Field	Type	Required	Description
`kind`	`"image"` \| `"pdf"`	Yes	The kind of file.
`mime_type`	string	Yes	`application/pdf`, `image/jpeg`, `image/png`, `image/webp`, or `image/tiff`.
`data`	string (base64)	Yes	Base64-encoded file bytes. Sanity-capped at ~37 MB decoded.

#`ExtractOptions`

Field	Type	Default	Description
`language_hint`	string	`null`	Hint for OCR / vision (`"en"`, `"de"`, …).
`n_samples`	integer (1–5)	`1`	When > 1, the engine runs the extraction `n_samples` times and folds the agreement rate into the confidence score. Each sample is one billable evaluation.
`extract_tables`	boolean	`true`	Tells the prompt builder to preserve table row/column structure.
`page_range`	string	`null`	PDF only — `"1-3"` or `"2,5,7"` to restrict processing.

#`ExtractionSchema`

A typed declaration of the fields you want. Two top-level keys:

{
  "description": "Vendor invoice",
  "fields": {
    "invoice_number": { "type": "string",  "description": "Invoice ID", "pattern": "^INV-\\d+" },
    "total":          { "type": "number",  "description": "Grand total", "min": 0 },
    "currency":       { "type": "enum",    "description": "ISO code", "values": ["USD","EUR","GBP"] },
    "invoice_date":   { "type": "date",    "format": "iso8601_date" },
    "paid":           { "type": "boolean", "description": "Has it been paid" },
    "vendor": {
      "type": "object",
      "properties": {
        "name":    { "type": "string", "description": "Vendor company name" },
        "country": { "type": "string", "description": "Country", "nullable": true }
      }
    },
    "line_items": {
      "type": "array_of_object",
      "description": "Each row of the invoice",
      "item_schema": {
        "type": "object",
        "properties": {
          "description": { "type": "string", "description": "Item description" },
          "quantity":    { "type": "number", "description": "Quantity", "min": 0 },
          "unit_price":  { "type": "number", "description": "Unit price", "min": 0 }
        }
      }
    }
  }
}

#Field types

Type	Constraints	Example
`boolean`	—	`{ "type": "boolean", "description": "Approved?" }`
`string`	`max_length`, `pattern`, `nullable`	`{ "type": "string", "pattern": "^INV-\\d+", "description": "Invoice ID" }`
`integer`	`min`, `max`	`{ "type": "integer", "min": 1, "max": 10, "description": "Priority" }`
`number`	`min`, `max`	`{ "type": "number", "min": 0, "description": "Total" }`
`date`	`format` (`iso8601_date` \| `iso8601_datetime`)	`{ "type": "date", "format": "iso8601_date" }`
`enum`	`values` (array)	`{ "type": "enum", "values": ["USD","EUR"] }`
`string_array`	`min_items`, `max_items`	`{ "type": "string_array", "max_items": 5 }`
`object`	`properties`, `required_fields`	`{ "type": "object", "properties": { … } }`
`array_of_object`	`item_schema`, `min_items`, `max_items`	`{ "type": "array_of_object", "item_schema": { … } }`

Every field also carries a description the LLM sees verbatim — write it like instructions to a junior employee.

#Response

Field	Type	Description
`request_id`	string (UUID)	Surfaces in your Activity logs.
`extracted`	object	Values keyed by the schema field names.
`confidence`	object	`0.0`–`1.0` per leaf field, mirrors `extracted` shape.
`anchors`	object	`{ text, page }` per leaf field, mirrors `extracted` shape. `page` is 1-indexed for PDFs / multi-image inputs; `null` for text or single-image.
`components`	object	Per-field `ConfidenceComponents` — see Confidence.
`metadata`	object	Cost / latency / provenance. See Metadata.
`schema_drift`	array of string	Field paths the model emitted that weren't in your schema.
`warnings`	array of string	Soft-fail signals (truncation, fallback paths, unparseable JSON).
`extracted_html`	string \| null	HTML rendering of the extracted content. Populated when `output_format` is `"html_fields"` (deterministic field render) or `"html_document"` (whole-document HTML). `null` for the default `"json"` mode.

#Metadata

Field	Type	Description
`input_kind`	`"text"` \| `"image"` \| `"pdf_text_layer"` \| `"pdf_rasterised"`	Which code path the engine took.
`pages_processed`	integer	Pages actually processed (PDFs).
`ocr_used`	boolean	True when a scanned PDF was rasterised to images.
`model_used`	string	Provider/model that handled the call.
`tokens_in`, `tokens_out`	integer	Provider-reported counts.
`latency_ms`	integer	End-to-end engine latency.
`redaction_count`	integer	PII redactions on text input.
`n_samples`	integer	Echo of the request option.

#Confidence

Velgent's per-field confidence is multi-signal, not just the LLM's self-rating. The final score is:

final = min(
  llm_self_conf,         # what the LLM said
  anchor_conf,           # does the value actually appear in the anchor text?
  source_quality,        # how trustworthy is the input itself?
)                        # with a hard 0.0 short-circuit if the value
                         # fails schema validation

For n_samples > 1, the agreement rate across samples is folded in as a fourth signal (still min).

Why min — a chain is as strong as its weakest link. A hallucination (anchor doesn't actually contain the value) should dominate over LLM optimism. An over-compressed blurry image should cap confidence regardless of how clean the LLM output looks.

The components block on the response surfaces each signal so quality teams can explain a low score:

"components": {
  "total": {
    "llm_self_conf":    0.95,
    "anchor_conf":      0.20,    // anchor verification failed — likely hallucination
    "source_quality":   0.95,
    "schema_validates": true,
    "n_samples":        1,
    "agreement_rate":   null
  }
}

#Source quality bands

Computed automatically from input metadata. No model calls. V0 uses resolution + compression for images; sharpness (laplacian variance) is deferred to Phase 2.

Input	Source quality
Raw text	`1.0`
Text-layer PDF	`0.95`
Born-digital image (PNG/lossless)	up to `0.95`
2 MP+ JPEG	up to `0.9`
0.5–2 MP JPEG	up to `0.8`
< 0.5 MP image	`≤ 0.5`
Heavily compressed JPEG (< 0.10 bytes/pixel)	`≤ 0.6`
Rasterised PDF	min across per-page image scores

#Anchors

For every extracted leaf value, the engine returns the exact source phrase it was read from. This is:

A hallucination guard. Post-LLM the engine verifies the value appears in the anchor (with currency / formatting variants). Mismatch drops the confidence to 0.2.
An audit trail. When a customer questions "where did that total come from", the answer is one field-lookup — concrete and citable.
UI-ready. Apps building on top of /extract highlight the anchor text on the source document for the user.

Cost: ~20-30% extra output tokens per call. Mandatory in V0 — there is no opt-out flag.

#Templates

Pre-published schemas managed in the admin console:

admin.velgent.com → Extract templates → New template — author the schema by hand, or paste a sample document and let Velgent propose one for you (see Templates).
Call with {"template": "invoice_v1"} instead of inlining the schema. Faster requests, central control, atomic version swaps for breaking schema changes.

Pin a historical version with template_version (e.g. for replay / canary / rollback).

#Provider + residency

The Data Extractor uses the same per-tenant LLM router as Triage and Policy Engine. BYOK is fully supported — bring your own Anthropic key and point the Data Extractor row at it under admin → Engine Settings → Model routing. Defaults to Anthropic Claude Sonnet 4.6 (vision-capable) on tenant onboarding.

Input	Provider support (V0)
Text	Anthropic, Groq, Custom, BYOK
PDF text-layer	Anthropic, Groq, Custom, BYOK
Image	Anthropic only (vision-capable models)
PDF rasterised	Anthropic only

Non-Anthropic vision-capable providers (OpenAI GPT-4o, etc.) can be wired in when a tenant brings the requirement — V0 returns a clean HTTP 400 naming the fix path if a non-vision provider is pinned and the input is an image.

#Errors

Status	Code	When
`400`	`invalid_request`	Bad mime type, undecodable base64, malformed schema, both `text` and `file` provided, etc.
`400`	—	Non-vision provider pinned on `extract` routing + image/PDF-rasterised input. Error names the fix path.
`404`	`not_found`	`template` slug doesn't resolve for your tenant, or the pinned `template_version` doesn't exist.
`413`	`payload_too_large`	PDF beyond the 10-page hard cap. Response includes a hint to use `page_range`.
`422`	`invalid_request`	Pydantic validation — exactly one of `text`/`file` and `extraction_schema`/`template` must be set.

#Security

V0 ships with these baked-in guarantees (engine doc 14 §"Security + confidentiality"):

No persistent storage of the file. Bytes held in memory only; cleared in a finally block before the response returns.
No file content in logs. Audit captures mime_type, pages, size_bytes, SHA-256 content_hash, and a redacted preview of extracted text only (capped at 2KB).
PII redaction before the LLM for text inputs (existing Presidio path).
Tenant isolation + BYOK + residency-aware LLM routing inherited from the existing per-tenant LLM router.
TLS in transit + per-request memory hygiene.

Persistent file storage / re-extraction workflows are opt-in tenant config in V1+.

Next: Examples →