Turning it on
Set"stream": true. The response is text/event-stream — a sequence of data: {...}\n\n chunks ending in data: [DONE]\n\n.
Chunk anatomy
A typical stream looks like this:[DONE]. The OpenAI SDK ignores chunks with empty choices, so the two usage chunks pass through harmlessly; only your raw parser sees them.
Reading credits_used
The chunk that contains credits_used is always the second-to-last event. If you want per-call cost on the client without polling Usage:
node
extra_body / cast.
Tool calls in a stream
Tool call deltas arrive in the samedelta channel:
function.arguments strings by tool_calls[i].index until you see finish_reason: "tool_calls", then parse the assembled JSON. See Tool calling.
Cancellation
Drop the connection. The gateway notices the client is gone, cancels the upstream call within ~250 ms, and only meters tokens already produced. Useful for “stop” buttons in chat UIs.Errors mid-stream
If the upstream provider errors after we’ve already streamed some tokens, you receive adata: {"error": {...}} chunk before the connection closes. Always handle this branch — it’s rare, but it does happen.
Buffering pitfalls
- Don’t put a buffering proxy (Cloudflare cache, nginx with default buffering) between client and Infery. SSE needs to flush per chunk.
- Browser
fetch()withawait response.text()swallows the stream — useresponse.body.getReader()or the OpenAI SDK. curlneeds-Nto disable output buffering.

