#Data Extractor

Send text, an image, or a PDF; declare the fields you want; get back structured JSON with per-field confidence and citation anchors (where each value was found). Built so callers can route between auto-approve and human-review programmatically — no re-parsing the LLM's prose.

POSThttps://aiengine.velgent.com/api/v1/extract

For worked requests + example responses across text, image, and PDF inputs, see Examples.

#What you get

Three things on every response, in addition to the structured output:

  • extracted — the values, keyed by the field names in your schema.
  • confidence — a 0.01.0 score per leaf field. Routes between auto-approve and review.
  • anchors — for each value, the exact source phrase the LLM used. Mandatory in V0 — it's the audit trail and the hallucination guard.

Plus a components block so quality teams can see which signal drove a low confidence score (the LLM's own self-rating, anchor verification, source quality, schema validation), and schema_drift listing fields the model emitted that you didn't ask for.

#Two ways to declare the schema

You can pass the schema inline on every call, or publish a named template in the admin console and reference it by slug. Pick one per request.

ModeUse when
extraction_schema (inline)One-off shapes, schema is dynamic per request, or testing.
template (slug)Repeated shapes (invoices, receipts, KYC forms). Operator publishes once in admin → /extract-templates; callers pass {"template": "invoice_v1"} from then on.

#Request

HeaderRequiredDescription
AuthorizationYesBearer velgent_live_… — see Authentication.
Content-TypeYesapplication/json.
Idempotency-KeyNoAny opaque string ≤ 128 chars. Repeated calls within 24h return the original response.

#Body

FieldTypeRequiredDescription
textstringOne ofRaw text input. Max ~200,000 chars.
fileExtractFileOne ofImage or PDF, base64-encoded.
extraction_schemaExtractionSchemaOne ofInline schema declaring fields to extract.
templatestringOne ofSlug of a published template (admin → /extract-templates).
template_versionintegerNoPin to a specific version. Default = current published version.
output_format"json" | "html_fields" | "html_document"NoDefault "json". See Output formats.
optionsExtractOptionsNoPer-call knobs.
pii_action_override"REDACT" | "SUPPRESS" | "ALLOW" | nullNoPer-call override of the tenant's PII action policy. Applies to text inputs only — images go directly to the vision LLM.

You must provide exactly one of text / file (which input). For output_format "json" and "html_fields" you must also provide exactly one of extraction_schema / template (which target). "html_document" is whole-document HTML reconstruction with no schema — image or PDF input only. The engine rejects requests that violate either constraint with HTTP 422.

#Output formats

ModeReturnsUse when
json (default)Structured JSON with per-field confidence + anchors.Downstream automation — invoice posting, ticket routing, ETL.
html_fieldsSame JSON plus extracted_html, a deterministic HTML render (<dl>/<table>) of the extracted fields.Embedding the extracted data in an email, ticket comment, or notification.
html_documentextracted_html only — semantic HTML reconstruction of the whole source document. No extracted / confidence / anchors.Re-flowing scanned PDFs or document images into searchable / editable HTML. Image or PDF only.

html_document mode does not accept extraction_schema or template — it produces a faithful HTML rendering of the source, not field extraction. The HTML is server-sanitised (no <script>, <style>, or event handlers) and wrapped in a velgent-extract velgent-extract-document section so you can scope CSS to it. We still recommend running DOMPurify (or equivalent) before injecting the HTML into a privileged context.

#ExtractFile

FieldTypeRequiredDescription
kind"image" | "pdf"YesThe kind of file.
mime_typestringYesapplication/pdf, image/jpeg, image/png, image/webp, or image/tiff.
datastring (base64)YesBase64-encoded file bytes. Sanity-capped at ~37 MB decoded.

#ExtractOptions

FieldTypeDefaultDescription
language_hintstringnullHint for OCR / vision ("en", "de", …).
n_samplesinteger (1–5)1When > 1, the engine runs the extraction n_samples times and folds the agreement rate into the confidence score. Each sample is one billable evaluation.
extract_tablesbooleantrueTells the prompt builder to preserve table row/column structure.
page_rangestringnullPDF only — "1-3" or "2,5,7" to restrict processing.

#ExtractionSchema

A typed declaration of the fields you want. Two top-level keys:

{
  "description": "Vendor invoice",
  "fields": {
    "invoice_number": { "type": "string",  "description": "Invoice ID", "pattern": "^INV-\\d+" },
    "total":          { "type": "number",  "description": "Grand total", "min": 0 },
    "currency":       { "type": "enum",    "description": "ISO code", "values": ["USD","EUR","GBP"] },
    "invoice_date":   { "type": "date",    "format": "iso8601_date" },
    "paid":           { "type": "boolean", "description": "Has it been paid" },
    "vendor": {
      "type": "object",
      "properties": {
        "name":    { "type": "string", "description": "Vendor company name" },
        "country": { "type": "string", "description": "Country", "nullable": true }
      }
    },
    "line_items": {
      "type": "array_of_object",
      "description": "Each row of the invoice",
      "item_schema": {
        "type": "object",
        "properties": {
          "description": { "type": "string", "description": "Item description" },
          "quantity":    { "type": "number", "description": "Quantity", "min": 0 },
          "unit_price":  { "type": "number", "description": "Unit price", "min": 0 }
        }
      }
    }
  }
}

#Field types

TypeConstraintsExample
boolean{ "type": "boolean", "description": "Approved?" }
stringmax_length, pattern, nullable{ "type": "string", "pattern": "^INV-\\d+", "description": "Invoice ID" }
integermin, max{ "type": "integer", "min": 1, "max": 10, "description": "Priority" }
numbermin, max{ "type": "number", "min": 0, "description": "Total" }
dateformat (iso8601_date | iso8601_datetime){ "type": "date", "format": "iso8601_date" }
enumvalues (array){ "type": "enum", "values": ["USD","EUR"] }
string_arraymin_items, max_items{ "type": "string_array", "max_items": 5 }
objectproperties, required_fields{ "type": "object", "properties": { … } }
array_of_objectitem_schema, min_items, max_items{ "type": "array_of_object", "item_schema": { … } }

Every field also carries a description the LLM sees verbatim — write it like instructions to a junior employee.

#Response

FieldTypeDescription
request_idstring (UUID)Surfaces in your Activity logs.
extractedobjectValues keyed by the schema field names.
confidenceobject0.01.0 per leaf field, mirrors extracted shape.
anchorsobject{ text, page } per leaf field, mirrors extracted shape. page is 1-indexed for PDFs / multi-image inputs; null for text or single-image.
componentsobjectPer-field ConfidenceComponents — see Confidence.
metadataobjectCost / latency / provenance. See Metadata.
schema_driftarray of stringField paths the model emitted that weren't in your schema.
warningsarray of stringSoft-fail signals (truncation, fallback paths, unparseable JSON).
extracted_htmlstring | nullHTML rendering of the extracted content. Populated when output_format is "html_fields" (deterministic field render) or "html_document" (whole-document HTML). null for the default "json" mode.

#Metadata

FieldTypeDescription
input_kind"text" | "image" | "pdf_text_layer" | "pdf_rasterised"Which code path the engine took.
pages_processedintegerPages actually processed (PDFs).
ocr_usedbooleanTrue when a scanned PDF was rasterised to images.
model_usedstringProvider/model that handled the call.
tokens_in, tokens_outintegerProvider-reported counts.
latency_msintegerEnd-to-end engine latency.
redaction_countintegerPII redactions on text input.
n_samplesintegerEcho of the request option.

#Confidence

Velgent's per-field confidence is multi-signal, not just the LLM's self-rating. The final score is:

final = min(
  llm_self_conf,         # what the LLM said
  anchor_conf,           # does the value actually appear in the anchor text?
  source_quality,        # how trustworthy is the input itself?
)                        # with a hard 0.0 short-circuit if the value
                         # fails schema validation

For n_samples > 1, the agreement rate across samples is folded in as a fourth signal (still min).

Why min — a chain is as strong as its weakest link. A hallucination (anchor doesn't actually contain the value) should dominate over LLM optimism. An over-compressed blurry image should cap confidence regardless of how clean the LLM output looks.

The components block on the response surfaces each signal so quality teams can explain a low score:

"components": {
  "total": {
    "llm_self_conf":    0.95,
    "anchor_conf":      0.20,    // anchor verification failed — likely hallucination
    "source_quality":   0.95,
    "schema_validates": true,
    "n_samples":        1,
    "agreement_rate":   null
  }
}

#Source quality bands

Computed automatically from input metadata. No model calls. V0 uses resolution + compression for images; sharpness (laplacian variance) is deferred to Phase 2.

InputSource quality
Raw text1.0
Text-layer PDF0.95
Born-digital image (PNG/lossless)up to 0.95
2 MP+ JPEGup to 0.9
0.5–2 MP JPEGup to 0.8
< 0.5 MP image≤ 0.5
Heavily compressed JPEG (< 0.10 bytes/pixel)≤ 0.6
Rasterised PDFmin across per-page image scores

#Anchors

For every extracted leaf value, the engine returns the exact source phrase it was read from. This is:

  1. A hallucination guard. Post-LLM the engine verifies the value appears in the anchor (with currency / formatting variants). Mismatch drops the confidence to 0.2.
  2. An audit trail. When a customer questions "where did that total come from", the answer is one field-lookup — concrete and citable.
  3. UI-ready. Apps building on top of /extract highlight the anchor text on the source document for the user.

Cost: ~20-30% extra output tokens per call. Mandatory in V0 — there is no opt-out flag.

#Templates

Pre-published schemas managed in the admin console:

  1. admin.velgent.com → Extract templates → New template — author the schema by hand, or paste a sample document and let Velgent propose one for you (see Templates).
  2. Call with {"template": "invoice_v1"} instead of inlining the schema. Faster requests, central control, atomic version swaps for breaking schema changes.

Pin a historical version with template_version (e.g. for replay / canary / rollback).

#Provider + residency

The Data Extractor uses the same per-tenant LLM router as Summariser and Policy Engine. BYOK is fully supported — bring your own Anthropic key and point the Data Extractor row at it under admin → Engine Settings → Model routing. Defaults to Anthropic Claude Sonnet 4.6 (vision-capable) on tenant onboarding.

InputProvider support (V0)
TextAnthropic, Groq, Custom, BYOK
PDF text-layerAnthropic, Groq, Custom, BYOK
ImageAnthropic only (vision-capable models)
PDF rasterisedAnthropic only

Non-Anthropic vision-capable providers (OpenAI GPT-4o, etc.) can be wired in when a tenant brings the requirement — V0 returns a clean HTTP 400 naming the fix path if a non-vision provider is pinned and the input is an image.

#Errors

StatusCodeWhen
400invalid_requestBad mime type, undecodable base64, malformed schema, both text and file provided, etc.
400Non-vision provider pinned on extract routing + image/PDF-rasterised input. Error names the fix path.
404not_foundtemplate slug doesn't resolve for your tenant, or the pinned template_version doesn't exist.
413payload_too_largePDF beyond the 10-page hard cap. Response includes a hint to use page_range.
422invalid_requestPydantic validation — exactly one of text/file and extraction_schema/template must be set.

#Security

V0 ships with these baked-in guarantees (engine doc 14 §"Security + confidentiality"):

  • No persistent storage of the file. Bytes held in memory only; cleared in a finally block before the response returns.
  • No file content in logs. Audit captures mime_type, pages, size_bytes, SHA-256 content_hash, and a redacted preview of extracted text only (capped at 2KB).
  • PII redaction before the LLM for text inputs (existing Presidio path).
  • Tenant isolation + BYOK + residency-aware LLM routing inherited from the existing per-tenant LLM router.
  • TLS in transit + per-request memory hygiene.

Persistent file storage / re-extraction workflows are opt-in tenant config in V1+.


Next: Examples →