Compare 6 TTS APIs side-by-side. Pricing per million characters, voice quality, cloning, streaming latency, free tiers, and code examples.
Text is tokenized, processed for pronunciation and prosody, synthesized through neural voice models, and delivered as streaming audio or downloadable files in MP3, WAV, OGG, or PCM formats.
Build conversational AI agents with natural-sounding voices. Sub-300ms latency enables real-time voice interactions for customer support, sales, and virtual assistants.
Generate narrated content at scale. Long-form voices with natural pacing, breathing, and emotion turn written content into professional audio productions.
Make apps and content accessible to visually impaired users. Screen readers, navigation guidance, and content read-aloud features powered by natural TTS.
Translate and voice content in 140+ languages. Maintain brand voice consistency across markets with multilingual neural voices and voice cloning.
Generate voiceovers for explainer videos, product demos, and social media content. Studio-quality voices without hiring voice actors or booking studios.
Power interactive voice response systems with dynamic, natural-sounding prompts. SSML control for pronunciation, pauses, and emphasis in phone menus.
| Feature | ElevenLabs | OpenAI TTS | Amazon Polly | Google Cloud | Azure Speech | Deepgram Aura |
|---|---|---|---|---|---|---|
| Free Tier | 10K chars/mo | $5 credit | 5M chars/mo | 4M chars/mo | 500K chars/mo | $200 credits |
| Price/M Chars (Standard) | $120-300 | $15 | $4 | $4 | $16 | $15 |
| Price/M Chars (Premium) | $120 | $30 (HD) | $30 (Generative) | $160 (Studio) | $30 (Neural HD) | $15 |
| Languages | 29+ | 57 | 40+ | 50+ | 140+ | English only |
| Voice Count | 10,000+ | 9 | 100+ | 300+ | 600+ | 20+ |
| Voice Cloning | ✓ From $5/mo | ✗ | ✗ | ✓ $60/M chars | ✓ Gated | ✗ |
| Real-time Streaming | ✓ <300ms | ✓ | ✓ | ✓ | ✓ | ✓ 250ms TTFB |
| SSML Support | ✗ Prompt-based | ✗ Prompt-based | ✓ Full | ✓ Legacy models | ✓ Best | ✗ Limited |
| Emotion Control | ✓ Natural | ✓ gpt-4o-mini-tts | ✗ | ✗ | ✓ express-as | ✗ |
| Output Formats | MP3, PCM, ulaw | MP3, Opus, AAC, FLAC, WAV, PCM | MP3, OGG, PCM | MP3, WAV, OGG, mulaw, alaw | MP3, WAV, OGG, WebM, raw | MP3, WAV, PCM, mulaw, alaw |
| Multi-speaker | ✓ Projects | ✗ | ✗ | ✓ Gemini TTS | ✗ | ✗ |
| Voice Marketplace | ✓ 10K+ voices | ✗ | ✗ | ✗ | ✗ | ✗ |
| Long-form Audio | ✓ | ✓ | ✓ $100/M | ✓ | ✓ | ✗ |
| Speech Marks / Timing | ✓ Word-level | ✗ | ✓ Viseme + word | ✓ | ✓ Viseme + word | ✗ |
Prices shown per million characters. Most providers have multiple voice tiers.
Monthly cost estimates at different volumes using each provider's best neural/standard voice pricing.
| Monthly Volume | Amazon Polly (Standard) | Google Cloud (WaveNet) | OpenAI (tts-1) | Deepgram Aura | Azure (Neural) | ElevenLabs (Pro) |
|---|---|---|---|---|---|---|
| 100K chars | $0 (free) | $0 (free) | $1.50 | $1.50 | $0 (free) | $99/mo plan |
| 1M chars | $0 (free) | $0 (free) | $15 | $15 | $8 (500K free) | $99/mo plan |
| 5M chars | $0 (free) | $4 | $75 | $75 | $72 | $330/mo plan |
| 10M chars | $20 | $24 | $150 | $150 | $152 | $1,320/mo plan |
| 50M chars | $180 | $184 | $750 | $750 | $792 | Enterprise |
| 100M chars | $380 | $384 | $1,500 | $1,500 | $1,592 | Enterprise |
ElevenLabs pricing is plan-based with overage charges. Polly Neural ($16/M) and Google Neural2 ($16/M) are 4x more expensive than their standard tiers but offer better voice quality.
# ElevenLabs - Generate speech with voice selection
import requests
url = "https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM"
headers = {
"xi-api-key": "your-api-key",
"Content-Type": "application/json"
}
data = {
"text": "Hello! This is a test of the ElevenLabs text-to-speech API.",
"model_id": "eleven_multilingual_v2",
"voice_settings": {
"stability": 0.5,
"similarity_boost": 0.75
}
}
response = requests.post(url, json=data, headers=headers)
with open("output.mp3", "wb") as f:
f.write(response.content)
# OpenAI TTS - Simple text-to-speech
from openai import OpenAI
client = OpenAI() # Uses OPENAI_API_KEY env var
response = client.audio.speech.create(
model="tts-1", # or "tts-1-hd" for higher quality
voice="nova", # alloy, echo, fable, onyx, nova, shimmer
input="Hello! This is a test of the OpenAI text-to-speech API.",
response_format="mp3" # mp3, opus, aac, flac, wav, pcm
)
response.stream_to_file("output.mp3")
# Amazon Polly - Generate speech with SSML
import boto3
polly = boto3.client("polly", region_name="us-east-1")
response = polly.synthesize_speech(
Text='<speak>Hello! <break time="500ms"/> This is Amazon Polly.</speak>',
TextType="ssml",
OutputFormat="mp3",
VoiceId="Joanna", # Neural voice
Engine="neural" # standard, neural, long-form, generative
)
with open("output.mp3", "wb") as f:
f.write(response["AudioStream"].read())
# Google Cloud TTS - WaveNet voice
from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
input_text = texttospeech.SynthesisInput(
text="Hello! This is Google Cloud Text-to-Speech."
)
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name="en-US-Wavenet-D" # WaveNet, Neural2, or Studio
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
response = client.synthesize_speech(
input=input_text, voice=voice, audio_config=audio_config
)
with open("output.mp3", "wb") as f:
f.write(response.audio_content)
# Azure Speech - Neural voice with emotion
import azure.cognitiveservices.speech as speechsdk
config = speechsdk.SpeechConfig(
subscription="your-key",
region="eastus"
)
config.speech_synthesis_voice_name = "en-US-JennyNeural"
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=config,
audio_config=speechsdk.audio.AudioOutputConfig(filename="output.wav")
)
# SSML with emotional style
ssml = """<speak version="1.0" xmlns:mstts="http://www.w3.org/2001/mstts"
xml:lang="en-US">
<voice name="en-US-JennyNeural">
<mstts:express-as style="cheerful">
Hello! This is Azure Speech with emotional control.
</mstts:express-as>
</voice>
</speak>"""
result = synthesizer.speak_ssml_async(ssml).get()
# Deepgram Aura - Fast TTS for voice agents
import requests
url = "https://api.deepgram.com/v1/speak"
headers = {
"Authorization": "Token your-api-key",
"Content-Type": "application/json"
}
params = {
"model": "aura-asteria-en", # English female voice
"encoding": "mp3"
}
data = {
"text": "Hello! This is Deepgram Aura text-to-speech."
}
response = requests.post(url, headers=headers, params=params, json=data)
with open("output.mp3", "wb") as f:
f.write(response.content)
# ElevenLabs - Streaming TTS with WebSocket
import websockets
import asyncio
import json
async def stream_tts():
voice_id = "21m00Tcm4TlvDq8ikWAM"
model = "eleven_flash_v2_5" # Low-latency model
uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}"
async with websockets.connect(uri) as ws:
# Initialize stream
await ws.send(json.dumps({
"text": " ",
"xi_api_key": "your-api-key",
"voice_settings": {"stability": 0.5, "similarity_boost": 0.8}
}))
# Send text chunks
for chunk in ["Hello! ", "This is ", "streaming TTS."]:
await ws.send(json.dumps({"text": chunk}))
# Close stream
await ws.send(json.dumps({"text": ""}))
# Receive audio chunks
while True:
msg = await ws.recv()
data = json.loads(msg)
if data.get("audio"):
audio_bytes = base64.b64decode(data["audio"])
# Play or save audio_bytes
asyncio.run(stream_tts())
# OpenAI TTS - Streaming audio response
from openai import OpenAI
client = OpenAI()
# Stream audio chunks as they're generated
with client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="nova",
input="Hello! This is streaming text-to-speech from OpenAI.",
response_format="mp3"
) as response:
response.stream_to_file("output.mp3")
# Or process chunks manually:
response = client.audio.speech.create(
model="tts-1",
voice="nova",
input="Hello! Streaming TTS.",
)
for chunk in response.iter_bytes(chunk_size=4096):
# Process audio chunks in real-time
pass
# cURL examples for quick testing
# ElevenLabs
curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM" \
-H "xi-api-key: your-key" \
-H "Content-Type: application/json" \
-d '{"text":"Hello world","model_id":"eleven_multilingual_v2"}' \
--output speech.mp3
# OpenAI
curl https://api.openai.com/v1/audio/speech \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"tts-1","voice":"nova","input":"Hello world"}' \
--output speech.mp3
# Deepgram Aura
curl -X POST "https://api.deepgram.com/v1/speak?model=aura-asteria-en" \
-H "Authorization: Token $DEEPGRAM_API_KEY" \
-H "Content-Type: application/json" \
-d '{"text":"Hello world"}' \
--output speech.mp3
# ElevenLabs - Instant Voice Cloning
import requests
# Step 1: Create a cloned voice from audio sample
url = "https://api.elevenlabs.io/v1/voices/add"
headers = {"xi-api-key": "your-api-key"}
data = {
"name": "My Cloned Voice",
"description": "Cloned from audio sample"
}
files = [("files", ("sample.mp3", open("sample.mp3", "rb"), "audio/mpeg"))]
response = requests.post(url, headers=headers, data=data, files=files)
voice_id = response.json()["voice_id"]
# Step 2: Generate speech with cloned voice
tts_url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}"
tts_data = {
"text": "This is my cloned voice speaking!",
"model_id": "eleven_multilingual_v2"
}
audio = requests.post(tts_url, json=tts_data, headers={
"xi-api-key": "your-api-key",
"Content-Type": "application/json"
})
with open("cloned_output.mp3", "wb") as f:
f.write(audio.content)
# Google Cloud - Chirp 3 Instant Custom Voice
from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
# Create instant custom voice from reference audio
# Note: Requires Chirp 3 access and additional setup
input_text = texttospeech.SynthesisInput(
text="This is synthesized with a custom cloned voice."
)
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name="en-US-Chirp3-HD-Achernar" # Or custom voice ID
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
response = client.synthesize_speech(
input=input_text,
voice=voice,
audio_config=audio_config
)
with open("output.mp3", "wb") as f:
f.write(response.audio_content)
# Custom voice cloning requires Cloud Console setup:
# 1. Upload reference audio samples
# 2. Create custom voice profile
# 3. Use voice profile ID in API calls
# Pricing: $60 per million characters
| Quality Tier | Provider | Price/M Chars | Best For | Latency |
|---|---|---|---|---|
| Basic / Standard | Polly Standard, Google Standard | $4 | IVR, accessibility, high-volume read-aloud | Very low |
| Neural | Polly Neural, Google WaveNet, Azure Neural | $4-16 | General use, notifications, assistants | Low |
| Premium Neural | OpenAI tts-1, Deepgram Aura, Google Neural2 | $15-16 | Voice agents, apps, content | Low-Medium |
| HD / Conversational | OpenAI tts-1-hd, Azure Neural HD, Polly Generative, Google Chirp 3 HD | $30 | Customer-facing agents, brand voice | Medium |
| Studio / Premium | ElevenLabs, Google Studio, Polly Long-Form | $100-300 | Audiobooks, media, premium content | Higher |
| Factor | Cloud TTS APIs | Self-hosted (Coqui / Piper / XTTS) |
|---|---|---|
| Setup Time | Minutes (API key) | Hours to days (GPU, model download, optimization) |
| Voice Quality | Premium (ElevenLabs, Azure) | Good but behind commercial leaders |
| Cost at Scale | $4-300/M chars (ongoing) | GPU cost only ($0.50-2/hr for inference) |
| Latency | 200-500ms (network + inference) | 50-200ms (local inference, no network) |
| Privacy | Text sent to third party | All data stays on your infrastructure |
| Languages | Up to 140+ (Azure) | Varies by model (XTTS: 17 languages) |
| Maintenance | None (managed service) | GPU monitoring, model updates, scaling |
| Best For | Production apps, quick MVP, enterprise | High volume, privacy-critical, offline use |
Open-source models like Piper (fast, lightweight), Coqui XTTS (multilingual cloning), and Bark (expressive) are viable for teams with GPU infrastructure. At 50M+ characters/month, self-hosting typically becomes more cost-effective than cloud APIs.
You need the most natural, expressive voices for consumer-facing products, audiobooks, or premium content.
You need millions of characters per month at the lowest cost. IVR, accessibility, or high-volume notifications.
You want to add TTS in minutes with no subscription or setup complexity. Already using OpenAI for other features.
You need to support users in dozens of languages with consistent voice quality across all locales.
You're building a conversational AI agent and need the lowest time-to-first-byte for natural dialogue.
You need to clone a voice from audio samples without enterprise contracts or gated access programs.
Frostbyte Agent Gateway gives you 40+ APIs with one key. Screenshots, DNS, geolocation, crypto prices, and more. 50 free requests/day to start.
Get Free API Key →