Stream Speech
Converts text to speech and streams MP3 audio in real-time
Endpoint
POST /v1/audio/stream
Authentication
Bearer token required
Overview
The streaming endpoint converts text to speech and streams MP3 audio in real-time. This endpoint is ideal for applications requiring low-latency audio playback, such as real-time assistants or live caption-to-speech conversion.
Response Format
audio/mpeg
(chunked MP3)
First Byte Latency
< 500ms
Request Parameters
The text to convert to speech. Maximum 5,000 characters.
The synthesis model to use. Currently supported: legacy-v2.5
The voice ID to use. Get available voices from the voices endpoint.
Adjust the voice pitch. Range: -100%
to +100%
.
Default: +0%
Emotional speaking style. Options: neutral
, cheerful
,
calm
, angry
, sad
, excited
,
whispering
. Default: calm
Intensity of the selected style. Range: 0.5
to 2.0
.
Default: 1.5
BCP-47 language code (e.g., en-US
, fr-FR
). Default:
en-US
Examples
Response
The endpoint streams MP3 audio data with the following headers:
audio/mpeg
chunked
Error Responses
Best Practices
- Connection Management: Use HTTP/2 or keep-alive connections to reduce latency
- Back-pressure: Process chunks as they arrive to maintain stream health
- Error Recovery: Implement reconnection logic for network interruptions
- Browser Support: Use MediaSource API for optimal browser streaming
- Security: Keep your API key secure and never expose it in client-side code
Streaming vs Standard Endpoint
Use Streaming When
- Real-time playback is needed - Low latency is critical - Processing long texts - Building conversational apps
Use Standard When
- Saving audio to files - Offline caching - Simple playback - Short text snippets
Authorizations
Your API key as a Bearer token
Body
Response
Successful response
Streaming MP3 audio data