Text-to-speech
Pick a voice and model
| Model | Voices | Strong for |
|---|---|---|
tts-1 (OpenAI) | alloy, echo, onyx, nova, shimmer, sage, ash, coral, fable | Fast, low cost, monolingual-EN strong |
tts-1-hd (OpenAI) | same | Higher fidelity, ~2× cost |
gemini-2-5-tts | Multilingual | Natural prosody, 30+ languages |
qwen-tts | Multilingual incl. CN/JA/KO | PRC opt-in required |
Save to file
python
Sample output
“Welcome to Infery — one API for every AI model.” Generated withgemini-2.5-flash-preview-tts, voice Kore.
Stream to a player
For long passages, stream the bytes directly to the user’s audio element instead of downloading then playing:python
Format choice
| Format | Bitrate | Use |
|---|---|---|
mp3 | ~96 kbps | Default; widely supported |
opus | ~64 kbps | Best for streaming voice (web/WebRTC) |
wav | uncompressed | Editing, further processing |
flac | lossless compressed | Archival |
pcm | raw 24kHz mono | Custom pipelines (synth, modems) |
Pacing
speed is 0.25 → 4.0. Most listeners are comfortable at 0.95–1.15 . Speeding past 1.5 is intelligible but tiring; slowing below 0.85 gets robotic.
Speech-to-text
Quick transcription
python
Long-form audio (>25 MB)
Whisper-1 caps at 25 MB. For longer recordings, split first:python
Subtitle export
response_format="srt" or "vtt" returns ready-to-use subtitle files:
python
Word-level timestamps
python
Translation
Useclient.audio.translations.create(...) to transcribe and translate to English in one call. Source language is auto-detected.
Round-trip: voice agent
Combining STT → chat → TTS gives you a basic voice agent:python
Costs at a glance
- TTS: ~30 on
tts-1-hd - Whisper STT: ~$6 per hour of audio
- Long meetings (1 h) typically cost less than the chat completion that follows them
Pitfalls
- Wrong language hint drops STT accuracy. Auto-detect is good but a
language=hint is better when known. - Quiet/clipped recordings — Whisper handles noise well but not clipping. Normalise levels before transcribing.
- TTS swallowing punctuation — write naturally; “Hi—how are you?” reads better than “Hi how are you”.
- Long base64 audio over JSON wastes 33% bandwidth vs. multipart. Use multipart unless you have a JSON-only client.

