#Data Extractor
Send text, an image, or a PDF; declare the fields you want; get back structured JSON with per-field confidence and citation anchors (where each value was found). Built so callers can route between auto-approve and human-review programmatically — no re-parsing the LLM's prose.
For worked requests + example responses across text, image, and PDF inputs, see Examples.
#What you get
Three things on every response, in addition to the structured output:
extracted— the values, keyed by the field names in your schema.confidence— a0.0–1.0score per leaf field. Routes between auto-approve and review.anchors— for each value, the exact source phrase the LLM used. Mandatory in V0 — it's the audit trail and the hallucination guard.
Plus a components block so quality teams can see which signal
drove a low confidence score (the LLM's own self-rating, anchor
verification, source quality, schema validation), and schema_drift
listing fields the model emitted that you didn't ask for.
#Two ways to declare the schema
You can pass the schema inline on every call, or publish a named template in the admin console and reference it by slug. Pick one per request.
| Mode | Use when |
|---|---|
extraction_schema (inline) | One-off shapes, schema is dynamic per request, or testing. |
template (slug) | Repeated shapes (invoices, receipts, KYC forms). Operator publishes once in admin → /extract-templates; callers pass {"template": "invoice_v1"} from then on. |
#Request
| Header | Required | Description |
|---|---|---|
Authorization | Yes | Bearer velgent_live_… — see Authentication. |
Content-Type | Yes | application/json. |
Idempotency-Key | No | Any opaque string ≤ 128 chars. Repeated calls within 24h return the original response. |
#Body
| Field | Type | Required | Description |
|---|---|---|---|
text | string | One of | Raw text input. Max ~200,000 chars. |
file | ExtractFile | One of | Image or PDF, base64-encoded. |
extraction_schema | ExtractionSchema | One of | Inline schema declaring fields to extract. |
template | string | One of | Slug of a published template (admin → /extract-templates). |
template_version | integer | No | Pin to a specific version. Default = current published version. |
output_format | "json" | "html_fields" | "html_document" | No | Default "json". See Output formats. |
options | ExtractOptions | No | Per-call knobs. |
pii_action_override | "REDACT" | "SUPPRESS" | "ALLOW" | null | No | Per-call override of the tenant's PII action policy. Applies to text inputs only — images go directly to the vision LLM. |
You must provide exactly one of text / file (which input). For output_format "json" and "html_fields" you must also provide exactly one of extraction_schema / template (which target). "html_document" is whole-document HTML reconstruction with no schema — image or PDF input only. The engine rejects requests that violate either constraint with HTTP 422.
#Output formats
| Mode | Returns | Use when |
|---|---|---|
json (default) | Structured JSON with per-field confidence + anchors. | Downstream automation — invoice posting, ticket routing, ETL. |
html_fields | Same JSON plus extracted_html, a deterministic HTML render (<dl>/<table>) of the extracted fields. | Embedding the extracted data in an email, ticket comment, or notification. |
html_document | extracted_html only — semantic HTML reconstruction of the whole source document. No extracted / confidence / anchors. | Re-flowing scanned PDFs or document images into searchable / editable HTML. Image or PDF only. |
html_document mode does not accept extraction_schema or template — it produces a faithful HTML rendering of the source, not field extraction. The HTML is server-sanitised (no <script>, <style>, or event handlers) and wrapped in a velgent-extract velgent-extract-document section so you can scope CSS to it. We still recommend running DOMPurify (or equivalent) before injecting the HTML into a privileged context.
#ExtractFile
| Field | Type | Required | Description |
|---|---|---|---|
kind | "image" | "pdf" | Yes | The kind of file. |
mime_type | string | Yes | application/pdf, image/jpeg, image/png, image/webp, or image/tiff. |
data | string (base64) | Yes | Base64-encoded file bytes. Sanity-capped at ~37 MB decoded. |
#ExtractOptions
| Field | Type | Default | Description |
|---|---|---|---|
language_hint | string | null | Hint for OCR / vision ("en", "de", …). |
n_samples | integer (1–5) | 1 | When > 1, the engine runs the extraction n_samples times and folds the agreement rate into the confidence score. Each sample is one billable evaluation. |
extract_tables | boolean | true | Tells the prompt builder to preserve table row/column structure. |
page_range | string | null | PDF only — "1-3" or "2,5,7" to restrict processing. |
#ExtractionSchema
A typed declaration of the fields you want. Two top-level keys:
{
"description": "Vendor invoice",
"fields": {
"invoice_number": { "type": "string", "description": "Invoice ID", "pattern": "^INV-\\d+" },
"total": { "type": "number", "description": "Grand total", "min": 0 },
"currency": { "type": "enum", "description": "ISO code", "values": ["USD","EUR","GBP"] },
"invoice_date": { "type": "date", "format": "iso8601_date" },
"paid": { "type": "boolean", "description": "Has it been paid" },
"vendor": {
"type": "object",
"properties": {
"name": { "type": "string", "description": "Vendor company name" },
"country": { "type": "string", "description": "Country", "nullable": true }
}
},
"line_items": {
"type": "array_of_object",
"description": "Each row of the invoice",
"item_schema": {
"type": "object",
"properties": {
"description": { "type": "string", "description": "Item description" },
"quantity": { "type": "number", "description": "Quantity", "min": 0 },
"unit_price": { "type": "number", "description": "Unit price", "min": 0 }
}
}
}
}
}
#Field types
| Type | Constraints | Example |
|---|---|---|
boolean | — | { "type": "boolean", "description": "Approved?" } |
string | max_length, pattern, nullable | { "type": "string", "pattern": "^INV-\\d+", "description": "Invoice ID" } |
integer | min, max | { "type": "integer", "min": 1, "max": 10, "description": "Priority" } |
number | min, max | { "type": "number", "min": 0, "description": "Total" } |
date | format (iso8601_date | iso8601_datetime) | { "type": "date", "format": "iso8601_date" } |
enum | values (array) | { "type": "enum", "values": ["USD","EUR"] } |
string_array | min_items, max_items | { "type": "string_array", "max_items": 5 } |
object | properties, required_fields | { "type": "object", "properties": { … } } |
array_of_object | item_schema, min_items, max_items | { "type": "array_of_object", "item_schema": { … } } |
Every field also carries a description the LLM sees verbatim —
write it like instructions to a junior employee.
#Response
| Field | Type | Description |
|---|---|---|
request_id | string (UUID) | Surfaces in your Activity logs. |
extracted | object | Values keyed by the schema field names. |
confidence | object | 0.0–1.0 per leaf field, mirrors extracted shape. |
anchors | object | { text, page } per leaf field, mirrors extracted shape. page is 1-indexed for PDFs / multi-image inputs; null for text or single-image. |
components | object | Per-field ConfidenceComponents — see Confidence. |
metadata | object | Cost / latency / provenance. See Metadata. |
schema_drift | array of string | Field paths the model emitted that weren't in your schema. |
warnings | array of string | Soft-fail signals (truncation, fallback paths, unparseable JSON). |
extracted_html | string | null | HTML rendering of the extracted content. Populated when output_format is "html_fields" (deterministic field render) or "html_document" (whole-document HTML). null for the default "json" mode. |
#Metadata
| Field | Type | Description |
|---|---|---|
input_kind | "text" | "image" | "pdf_text_layer" | "pdf_rasterised" | Which code path the engine took. |
pages_processed | integer | Pages actually processed (PDFs). |
ocr_used | boolean | True when a scanned PDF was rasterised to images. |
model_used | string | Provider/model that handled the call. |
tokens_in, tokens_out | integer | Provider-reported counts. |
latency_ms | integer | End-to-end engine latency. |
redaction_count | integer | PII redactions on text input. |
n_samples | integer | Echo of the request option. |
#Confidence
Velgent's per-field confidence is multi-signal, not just the LLM's self-rating. The final score is:
final = min(
llm_self_conf, # what the LLM said
anchor_conf, # does the value actually appear in the anchor text?
source_quality, # how trustworthy is the input itself?
) # with a hard 0.0 short-circuit if the value
# fails schema validation
For n_samples > 1, the agreement rate across samples is folded in
as a fourth signal (still min).
Why min — a chain is as strong as its weakest link. A
hallucination (anchor doesn't actually contain the value) should
dominate over LLM optimism. An over-compressed blurry image should
cap confidence regardless of how clean the LLM output looks.
The components block on the response surfaces each signal so
quality teams can explain a low score:
"components": {
"total": {
"llm_self_conf": 0.95,
"anchor_conf": 0.20, // anchor verification failed — likely hallucination
"source_quality": 0.95,
"schema_validates": true,
"n_samples": 1,
"agreement_rate": null
}
}
#Source quality bands
Computed automatically from input metadata. No model calls. V0 uses resolution + compression for images; sharpness (laplacian variance) is deferred to Phase 2.
| Input | Source quality |
|---|---|
| Raw text | 1.0 |
| Text-layer PDF | 0.95 |
| Born-digital image (PNG/lossless) | up to 0.95 |
| 2 MP+ JPEG | up to 0.9 |
| 0.5–2 MP JPEG | up to 0.8 |
| < 0.5 MP image | ≤ 0.5 |
| Heavily compressed JPEG (< 0.10 bytes/pixel) | ≤ 0.6 |
| Rasterised PDF | min across per-page image scores |
#Anchors
For every extracted leaf value, the engine returns the exact source phrase it was read from. This is:
- A hallucination guard. Post-LLM the engine verifies the value
appears in the anchor (with currency / formatting variants).
Mismatch drops the confidence to
0.2. - An audit trail. When a customer questions "where did that total come from", the answer is one field-lookup — concrete and citable.
- UI-ready. Apps building on top of
/extracthighlight the anchor text on the source document for the user.
Cost: ~20-30% extra output tokens per call. Mandatory in V0 — there is no opt-out flag.
#Templates
Pre-published schemas managed in the admin console:
- admin.velgent.com → Extract templates → New template — author the schema by hand, or paste a sample document and let Velgent propose one for you (see Templates).
- Call with
{"template": "invoice_v1"}instead of inlining the schema. Faster requests, central control, atomic version swaps for breaking schema changes.
Pin a historical version with template_version (e.g. for replay /
canary / rollback).
#Provider + residency
The Data Extractor uses the same per-tenant LLM router as Summariser
and Policy Engine. BYOK is fully supported — bring your own
Anthropic key and point the Data Extractor row at it under
admin → Engine Settings → Model routing. Defaults to Anthropic
Claude Sonnet 4.6 (vision-capable) on tenant onboarding.
| Input | Provider support (V0) |
|---|---|
| Text | Anthropic, Groq, Custom, BYOK |
| PDF text-layer | Anthropic, Groq, Custom, BYOK |
| Image | Anthropic only (vision-capable models) |
| PDF rasterised | Anthropic only |
Non-Anthropic vision-capable providers (OpenAI GPT-4o, etc.) can be wired in when a tenant brings the requirement — V0 returns a clean HTTP 400 naming the fix path if a non-vision provider is pinned and the input is an image.
#Errors
| Status | Code | When |
|---|---|---|
400 | invalid_request | Bad mime type, undecodable base64, malformed schema, both text and file provided, etc. |
400 | — | Non-vision provider pinned on extract routing + image/PDF-rasterised input. Error names the fix path. |
404 | not_found | template slug doesn't resolve for your tenant, or the pinned template_version doesn't exist. |
413 | payload_too_large | PDF beyond the 10-page hard cap. Response includes a hint to use page_range. |
422 | invalid_request | Pydantic validation — exactly one of text/file and extraction_schema/template must be set. |
#Security
V0 ships with these baked-in guarantees (engine doc 14 §"Security + confidentiality"):
- No persistent storage of the file. Bytes held in memory only; cleared in a
finallyblock before the response returns. - No file content in logs. Audit captures
mime_type,pages,size_bytes, SHA-256content_hash, and a redacted preview of extracted text only (capped at 2KB). - PII redaction before the LLM for text inputs (existing Presidio path).
- Tenant isolation + BYOK + residency-aware LLM routing inherited from the existing per-tenant LLM router.
- TLS in transit + per-request memory hygiene.
Persistent file storage / re-extraction workflows are opt-in tenant config in V1+.
Next: Examples →