Skip to main content
A fallback chain says: “if this model fails or rate-limits, try this one next, then this one.” The gateway handles it transparently — your code just sees a successful response. For the conceptual overview see Fallback chains. This guide walks through configuring one end-to-end.

1. Pick a source model

The source is the model your application calls. Common choices:
  • gpt-4o — your primary high-quality model
  • text-embedding-3-large — your primary embeddings model
  • gemini-2-5-flash — your latency-sensitive model
You can attach one chain per source slug per workspace.

2. Pick fallback targets

Order them by your preference — typically same family / same tier first, then a different provider, then a cheaper model. Three to four levels is the sweet spot; deeper chains rarely fire and add latency to debug. Examples:
Source: gpt-4o
├─ Priority 1: gpt-4o-mini             (same provider, cheaper — covers OpenAI rate limits)
├─ Priority 2: claude-sonnet-4-5       (different provider — covers OpenAI outage)
└─ Priority 3: gemini-2-5-flash        (last resort — always different region)
Source: text-embedding-3-large
└─ Priority 1: text-embedding-3-small  (same provider — cheaper, lower-dim, but graceful degradation)
For embeddings, mixing providers in a fallback breaks vector compatibility — the dimensions and embedding spaces differ. Keep fallbacks within the same model family or downgrade to a smaller model from the same provider.

3. Create the chain

Settings → FallbacksNew chain.
  1. Pick the source model
  2. Add fallback models one by one, drag to reorder
  3. (Optional) Per-target overrides: max retries, only retry on specific error codes
  4. Save
The chain is active immediately for every request from the workspace targeting that source slug.

4. Test it

Force the primary to fail to confirm the chain works:
# Set the primary's rate limit to 0 in dashboard → save
curl https://api.infery.ai/v1/chat/completions \
  -H "Authorization: Bearer $INFERY_API_KEY" \
  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "hi"}]}' -i
Look at the response headers:
x-fallback-from: gpt-4o
x-model-used: gpt-4o-mini
x-fallback-depth: 1
x-fallback-depth: 1 means the first fallback fired. If you see depth 0, the primary served — chain didn’t activate. If you see depth ≥ 2, the chain is digging deep, investigate. Then restore the rate limit.

5. Observe in production

Dashboards in Settings → Usage:
  • Fallback rate per chain (how often the primary failed)
  • Depth distribution (how often the chain went 1, 2, 3 deep)
  • Median added latency (one extra hop typically adds 50–200 ms)
Set a budget alert on fallback depth ≥ 2 spikes — that usually means a primary outage you want to know about.

What does not fall back

The gateway only retries on transient/upstream errors:
  • 429 rate limit, 502/503/504 upstream gateway, network timeout
  • Provider-specific transient codes we map to retryable
It does not retry on:
  • 4xx from your code (400 invalid_request_error, 401, 403, 422)
  • 402 insufficient_credits
  • Validation failures
These are caller errors — fallback wouldn’t help.

Disabling per call

For diagnostics, force primary-only:
curl ... -H "x-disable-fallback: true"
Useful when you want to confirm “is the primary actually up right now” rather than “did anything succeed”.

Cost

You pay for the model that served the request, not the attempts. Failed primary attempts are free; the cheaper fallback shows up on your invoice.

Patterns

Cost-tier descent: every step cheaper than the last. Best when quality differences are tolerable for the small fraction of fallback traffic. Provider diversification: every step a different provider. Best for resilience; quality may drift but availability is preserved. Region failover (Enterprise): every step a different region. Best for data-residency-aware deployments. You can mix these — chain gpt-4ogpt-4o-mini (cheaper, same provider) → claude-sonnet-4-5 (different provider, similar quality) → gemini-2-5-flash (different provider, faster, last resort).