Skip to main content
Fallback chains let you define what happens when a model call fails. Instead of the error bubbling up to your app, Infery transparently retries on a backup model of your choice.

Use cases

  • Primary provider outage — OpenAI 429s during peak hours → fall back to Google Gemini Flash
  • Cost-tier progression — try gpt-4o, on rate-limit step down to gpt-4o-mini, then gemini-flash
  • Capability routing — use a PDF-native model first; if unavailable, use a vision model with our PDF-to-image preprocessor
  • Regional — EU customers fall back from an OpenAI model to a Google one hosted in EU

Setting up a chain

Settings → FallbacksNew chain (or edit an existing one). A chain is attached to a source model slug and lists fallback models in priority order:
Source: gpt-4o
├─ Priority 1: gpt-4o-mini
├─ Priority 2: gemini-2-5-flash
└─ Priority 3: claude-haiku-4-5

When fallbacks fire

The gateway steps through fallbacks when the primary (or prior fallback) fails with one of:
  • 429 rate_limit_exceeded
  • 503 service_unavailable
  • 502 bad_gateway
  • Provider-specific errors tagged as retryable
  • Network timeout / connection reset
Not retried: 4xx errors from your code (bad prompt, invalid params, auth failure, quota exceeded).

Transparent to the caller

Client code doesn’t change. The response comes back in OpenAI format as normal, plus an extra header:
x-fallback-from: gpt-4o
x-model-used: gpt-4o-mini
x-fallback-depth: 1
Use these headers in logs/analytics to see which fallbacks are active in production.

Cost accounting

The final model that served the request is billed. If fallback to a cheaper model succeeds, you pay the cheaper price. Rate-limit attempts don’t incur cost.

Disabling per call

Add header x-disable-fallback: true on an individual request to force primary-only behaviour (useful for testing which primary is actually up).