#Rate limits

Each API key is rate-limited independently. Limits are enforced per key, not per organisation, so creating multiple keys for distinct workloads (prod vs. batch backfill, e.g.) is a fine pattern.

#Default limits

TierRequests per minuteTokens per minute
Trial3030,000
Starter120200,000
Growth6001,000,000
EnterpriseNegotiatedNegotiated

If you need a higher limit, email us.

#Rate-limit headers

Every response includes:

HeaderDescription
X-RateLimit-LimitYour per-minute request cap.
X-RateLimit-RemainingRequests left in the current 60-second window.
X-RateLimit-ResetUTC epoch seconds when the window resets.

When you exceed the limit, you'll get a 429 response with a Retry-After header (seconds to wait):

HTTP/1.1 429 Too Many Requests
Retry-After: 18
Content-Type: application/json

{ "error": { "code": "rate_limited", "message": "Rate limit exceeded. Retry after 18s." } }

#Back-off strategy

Use exponential back-off with jitter when you see 429 or 503:

async function callWithBackoff(fn, attempts = 5) {
  for (let i = 0; i < attempts; i++) {
    const res = await fn()
    if (res.status !== 429 && res.status !== 503) return res

    const retryAfter = Number(res.headers.get('Retry-After')) || 2 ** i
    const jitter     = Math.random() * 0.3 * retryAfter
    await new Promise(r => setTimeout(r, (retryAfter + jitter) * 1000))
  }
  throw new Error('Exhausted retries')
}
Idempotency-Key + retries

When retrying, send the same Idempotency-Key header so a request that succeeded on the server but failed in transit doesn't get applied twice.

#Plan quota — quota_exceeded

Rate limits (above) are short-window protections — they govern requests per minute. Plan quotas are the monthly caps you subscribed to: the workflow count and the token ceiling on your tier (see pricing for the model).

When your tenant crosses its effective workflow cap or its token ceiling for the calendar month, the engine returns:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json

{
  "error":  "quota_exceeded",
  "detail": "workflow_cap reached",
  "used":   20000,
  "cap":    20000,
  "kind":   "workflow"
}

The kind field distinguishes the two cases:

kindMeaningWhat to do
workflowYou used every billable workflow your tier and active expansions allow this month.Buy a workflow expansion (+25 / +50 / +100%) in the admin, upgrade to the next tier, or wait for the next calendar month.
tokenWorkflows still available but the abuse-limit token ceiling fired.Same options: buy a token expansion, upgrade, or wait. Token ceiling only fires for unusually heavy workflows; consider whether prompts can be made more concise.

Retry-After is not set for quota_exceeded — the calendar-month boundary is not a few seconds away. Compare with rate-limit 429s above, which always carry Retry-After.

#Soft warnings at 80%

When current usage reaches 80% of either effective cap, successful responses include a warning header so you can react before the wall:

X-Usage-Warning: approaching_workflow_cap
X-Usage-Used:    16003
X-Usage-Cap:     20000

Watch for this in your client and surface it to your operators or auto-purchase an expansion (the admin API supports both).

#Discovering current usage

The public GET /api/v1/usage/me endpoint returns the tenant's current usage + effective caps without going to the admin console:

curl https://aiengine.velgent.com/api/v1/usage/me \
  -H "Authorization: Bearer $VELGENT_API_KEY"

Response includes per-day buckets, totals, and the effective workflow and token caps after expansions. Same shape as the admin /api/admin/usage/orgs/{me} endpoint.

Quota errors do NOT charge for the rejected request

When the engine returns quota_exceeded, no workflow row is written and no LLM call is made. You aren't billed for the rejected call. The rejection itself is fast (single DB lookup) so rate-limit headers from this endpoint are unaffected.