#Rate limits
Each API key is rate-limited independently. Limits are enforced per key, not per organisation, so creating multiple keys for distinct workloads (prod vs. batch backfill, e.g.) is a fine pattern.
#Default limits
| Tier | Requests per minute | Tokens per minute |
|---|---|---|
| Trial | 30 | 30,000 |
| Starter | 120 | 200,000 |
| Growth | 600 | 1,000,000 |
| Enterprise | Negotiated | Negotiated |
If you need a higher limit, email us.
#Rate-limit headers
Every response includes:
| Header | Description |
|---|---|
X-RateLimit-Limit | Your per-minute request cap. |
X-RateLimit-Remaining | Requests left in the current 60-second window. |
X-RateLimit-Reset | UTC epoch seconds when the window resets. |
When you exceed the limit, you'll get a 429 response with a
Retry-After header (seconds to wait):
HTTP/1.1 429 Too Many Requests
Retry-After: 18
Content-Type: application/json
{ "error": { "code": "rate_limited", "message": "Rate limit exceeded. Retry after 18s." } }
#Back-off strategy
Use exponential back-off with jitter when you see 429 or 503:
async function callWithBackoff(fn, attempts = 5) {
for (let i = 0; i < attempts; i++) {
const res = await fn()
if (res.status !== 429 && res.status !== 503) return res
const retryAfter = Number(res.headers.get('Retry-After')) || 2 ** i
const jitter = Math.random() * 0.3 * retryAfter
await new Promise(r => setTimeout(r, (retryAfter + jitter) * 1000))
}
throw new Error('Exhausted retries')
}
When retrying, send the same Idempotency-Key header so a
request that succeeded on the server but failed in transit doesn't get
applied twice.
#Plan quota — quota_exceeded
Rate limits (above) are short-window protections — they govern requests per minute. Plan quotas are the monthly caps you subscribed to: the workflow count and the token ceiling on your tier (see pricing for the model).
When your tenant crosses its effective workflow cap or its token ceiling for the calendar month, the engine returns:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
{
"error": "quota_exceeded",
"detail": "workflow_cap reached",
"used": 20000,
"cap": 20000,
"kind": "workflow"
}
The kind field distinguishes the two cases:
kind | Meaning | What to do |
|---|---|---|
workflow | You used every billable workflow your tier and active expansions allow this month. | Buy a workflow expansion (+25 / +50 / +100%) in the admin, upgrade to the next tier, or wait for the next calendar month. |
token | Workflows still available but the abuse-limit token ceiling fired. | Same options: buy a token expansion, upgrade, or wait. Token ceiling only fires for unusually heavy workflows; consider whether prompts can be made more concise. |
Retry-After is not set for quota_exceeded — the calendar-month
boundary is not a few seconds away. Compare with rate-limit 429s
above, which always carry Retry-After.
#Soft warnings at 80%
When current usage reaches 80% of either effective cap, successful responses include a warning header so you can react before the wall:
X-Usage-Warning: approaching_workflow_cap
X-Usage-Used: 16003
X-Usage-Cap: 20000
Watch for this in your client and surface it to your operators or auto-purchase an expansion (the admin API supports both).
#Discovering current usage
The public GET /api/v1/usage/me endpoint returns the tenant's current
usage + effective caps without going to the admin console:
curl https://aiengine.velgent.com/api/v1/usage/me \
-H "Authorization: Bearer $VELGENT_API_KEY"
Response includes per-day buckets, totals, and the effective workflow
and token caps after expansions. Same shape as the admin
/api/admin/usage/orgs/{me} endpoint.
When the engine returns quota_exceeded, no workflow row
is written and no LLM call is made. You aren't billed for the
rejected call. The rejection itself is fast (single DB lookup) so
rate-limit headers from this endpoint are unaffected.