Skip to main content
POST
/
v1
/
audio
/
transcriptions
Speech-to-text
curl --request POST \
  --url https://api.infery.ai/v1/audio/transcriptions \
  --header 'Authorization: <api-key>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "audio": "<string>",
  "language": "<string>",
  "response_format": "json"
}
'
{
  "text": "Hello, this is a test transcription.",
  "language": "en",
  "duration": 12.5,
  "segments": [
    {
      "id": 123,
      "start": 123,
      "end": 123,
      "text": "<string>",
      "avg_logprob": 123,
      "compression_ratio": 123,
      "no_speech_prob": 123
    }
  ],
  "credits_used": 3
}
We accept both multipart and JSON base64 for STT, so the OpenAI SDK works out of the box.

Multipart (OpenAI SDK default)

from openai import OpenAI
client = OpenAI(api_key=API_KEY, base_url="https://api.infery.ai/v1")
tr = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("meeting.mp3", "rb"),
    response_format="verbose_json",
)
print(tr.text)

JSON base64 (light HTTP clients)

curl https://api.infery.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $INFERY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "whisper-1",
    "file_base64": "<base64 audio>",
    "filename": "meeting.mp3",
    "response_format": "json"
  }'

Response formats

json (default), text, srt, verbose_json (with segments + word timestamps), vtt.

Limits

  • Max 25 MB audio per request
  • Formats: MP3, MP4, M4A, WAV, WebM, OGG, FLAC

Authorizations

Authorization
string
header
required

API key in format: Bearer inf_***

Body

application/json
model
string
required

Model ID to use for STT

audio
string
required

Base64-encoded audio data

language
string

Language of the audio (ISO-639-1)

response_format
enum<string>
default:json
Available options:
json,
text,
srt,
verbose_json,
vtt

Response

Transcription result. Shape depends on response_format: JSON (json, verbose_json) or plain text (text, srt, vtt).

response_format: json (default) or verbose_json

text
string
Example:

"Hello, this is a test transcription."

language
string

Detected language (verbose_json only)

Example:

"en"

duration
number

Audio duration in seconds (verbose_json only)

Example:

12.5

segments
object[]

Time-stamped segments (verbose_json only)

credits_used
integer
Example:

3