Which speech-to-text API supports the most languages?

Google Cloud Speech-to-Text supports the most languages with 125+ languages and locales. OpenAI Whisper and AssemblyAI both support 99 languages. Amazon Transcribe supports 100+ languages. Rev AI supports 58+ languages. Deepgram supports 36+ languages but is expanding regularly. For streaming, language support is typically more limited than batch mode.

Can I get real-time streaming transcription from these APIs?

Yes, all six providers support real-time streaming transcription. Deepgram has the lowest latency at sub-300ms. OpenAI GPT-4o Transcribe streams via WebSocket. AssemblyAI streams in 6 languages (EN, ES, FR, DE, IT, PT). Google Cloud uses StreamingRecognize. Amazon Transcribe supports both batch and streaming. Rev AI streams with sub-second latency via WebSocket in 9 languages.

Which speech-to-text API is best for medical transcription?

Amazon Transcribe Medical is purpose-built for healthcare at $0.075/min and is HIPAA-eligible. Deepgram offers Nova-3 Medical for clinical documentation. Both provide medical-specific vocabulary and compliance features. Google Cloud offers BAA agreements for HIPAA compliance. For other providers, you'd need to evaluate their compliance certifications and potentially use custom vocabulary features.

What is speaker diarization and which APIs support it?

Speaker diarization identifies who is speaking in multi-speaker audio. All six providers support it. OpenAI GPT-4o Transcribe includes it free via the gpt-4o-transcribe-diarize model. Deepgram includes it free for batch (add-on for streaming at $0.002/min). AssemblyAI charges $0.02/hr. Google Chirp 3 includes it free. Amazon Transcribe includes it free with 98%+ accuracy. Rev AI enables it by default at no extra cost.

Speech-to-Text API Comparison 2026 — Whisper vs Deepgram vs AssemblyAI vs Google vs AWS vs Rev AI

Q: What is the cheapest speech-to-text API?

Rev AI Reverb Turbo is the cheapest at $0.0017/min ($0.10/hr) for English-only batch transcription. Deepgram Nova-3 is the cheapest multilingual option at $0.0043/min ($0.26/hr). OpenAI GPT-4o Mini Transcribe offers $0.003/min with 99-language support. At high volume (5M+ minutes/month), Amazon Transcribe drops to $0.0078/min through volume discounts.

Q: Whisper vs Deepgram: which is better?

Deepgram Nova-3 is better for production use. It has lower latency (sub-300ms vs batch-only for Whisper), lower cost ($0.0043/min vs $0.006/min), and published accuracy benchmarks (5.26% WER). Whisper is better if you need the open-source model to self-host, want 99-language support, or are already in the OpenAI ecosystem. GPT-4o Transcribe improves on Whisper with streaming and diarization support.

Q: Which speech-to-text API has the best free tier?

Deepgram offers $200 in free credits (no expiry, no credit card) covering approximately 45,000 minutes of transcription. AssemblyAI gives $50 in one-time credits (~185 hours). Rev AI offers 5 free hours. Google Cloud gives 60 minutes per month free (ongoing) plus $300 in general cloud credits. OpenAI gives $5 in credits (~833 minutes). Amazon offers 60 minutes/month free for the first 12 months.

How Speech-to-Text APIs Work

1Upload Audio

→

2Audio Processing

→

3Speech Recognition

→

4Post-Processing

→

5Structured Transcript

Audio is uploaded (or streamed in real-time), processed by AI models that convert speech patterns to text, then enhanced with punctuation, speaker labels, timestamps, and optional features like sentiment analysis or summarization.

What Developers Build with Speech-to-Text APIs

🎤

Meeting Transcription

Transcribe meetings, calls, and conferences in real-time with speaker diarization. Zoom, Teams, and Google Meet integrations.

🎧

Podcast & Media

Generate transcripts for podcasts, YouTube videos, and audio content. Improve SEO and accessibility with searchable text.

📞

Call Center Analytics

Transcribe support calls, detect sentiment, extract topics, and monitor agent performance. HIPAA-compliant options for healthcare.

🗣️

Voice Assistants

Power voice interfaces with sub-300ms streaming transcription. Build voice-controlled apps, smart speakers, and accessibility tools.

🎬

Subtitles & Captions

Auto-generate subtitles for video content in 99+ languages. SRT/VTT export with precise word-level timestamps.

🤖

AI Agent Integration

Give AI agents the ability to hear and process audio. Real-time voice-to-action pipelines for customer service and workflow automation.

Feature Comparison Table

Feature	OpenAI Whisper	Deepgram	AssemblyAI	Google Cloud	Amazon Transcribe	Rev AI
Free Tier	$5 credits	$200 credits	$50 credits	60 min/mo + $300	60 min/mo (12mo)	5 hours
Price/Min (Batch)	$0.006	$0.0043	$0.0061	$0.016	$0.024	$0.003
Price/Min (Stream)	$0.006	$0.0077	$0.0061	$0.016	$0.024	$0.003
Languages	99	36+	99	125+	100+	58+
Real-time Streaming	✓	✓	✓ (6 langs)	✓	✓	✓ (9 langs)
Speaker Diarization	✓ Free	✓ Free (batch)	✓ $0.02/hr	✓ Free	✓ Free	✓ Free
Custom Vocabulary	✗	✓ 100 terms	✓ 200 terms	✓	✓	✓
Sentiment Analysis	✗	✗	✓ Add-on	✗	✗	✓ Add-on
Summarization	✗	✗	✓ Add-on	✗	✗	✓ Add-on
Content Redaction	✗	✓	✓	✗	✓	✗
Medical Model	✗	✓	✗	✗	✓ HIPAA	✗
Code-switching	✗	✓ 10 langs	✓	✓	✗	✗
Open Source Model	✓ Whisper	✗	✗	✗	✗	✓ Reverb
Human Transcription	✗	✗	✗	✗	✗	✓ $1.99/min

Pricing at a Glance

Rev AI (Turbo)

$0.10

per hour (English)

5 hours free

OpenAI (Mini)

$0.18

per hour (99 langs)

$5 free credits

Deepgram

$0.26

per hour (batch)

$50 free requests/day

AssemblyAI

$0.37

per hour (batch)

$50 free credits

Google Cloud

$0.96

per hour (Chirp 3)

60 min/mo free

Amazon Transcribe

$1.44

per hour (tier 1)

60 min/mo (12mo)

Provider Deep Dive

OpenAI Whisper / GPT-4o Transcribe

$0.003-0.006/min

The most recognizable name in AI. GPT-4o Transcribe builds on Whisper with streaming, diarization, and multimodal understanding. 99-language support with flat-rate pricing.

Pros:

99 languages at flat rate — no per-language pricing
GPT-4o Mini at $0.003/min is very competitive
Diarization included free (gpt-4o-transcribe-diarize model)
Open-source Whisper model for self-hosting
Massive ecosystem integration

Cons:

No volume discounts — flat rate at any scale
$5 free tier is smallest of all providers
No custom vocabulary support
25MB file size limit on whisper-1

Deepgram Nova-3

$0.0043/min

Purpose-built for production speech recognition. Sub-300ms streaming latency, industry-leading accuracy (5.26% WER), and the most generous free tier at $200 in credits.

Pros:

$50 free requests/day (no expiry, no card) — ~45K minutes
Best published accuracy: 5.26% WER (Nova-3)
Sub-300ms streaming latency (Deepgram Flux even lower)
Code-switching between 10 languages in real-time
Medical model (Nova-3 Medical) for healthcare

Cons:

Only 36+ languages (vs 99-125+ competitors)
Streaming costs ~79% more than batch ($0.0077 vs $0.0043)
Growth tier requires $4K annual prepay
No sentiment analysis or summarization features

AssemblyAI Universal-2

$0.0061/min

Richest feature set of any STT provider. Audio Intelligence add-ons for sentiment, entities, topics, summaries, and content moderation. 99-language support at flat rate.

Pros:

99 languages at same flat rate, auto language detection
Rich Audio Intelligence: sentiment, entities, topics, summaries
Sub-5% WER consistently across audio conditions
200 custom key terms for vocabulary boosting
64% fewer speaker counting errors (latest diarization)

Cons:

Add-on costs stack up (diarization $0.02/hr + sentiment + entities)
Streaming limited to 6 languages only
$50 free credits are one-time, not recurring
Streaming billed by connection time, not audio time

Google Cloud Speech-to-Text

$0.016/min

Broadest language support (125+) with enterprise-grade compliance. Chirp 3 model with auto language detection. Only provider with recurring free monthly minutes.

Pros:

125+ languages — widest coverage
60 minutes/month free (ongoing, every month)
Enterprise: data residency, customer-managed encryption keys
Deep GCP ecosystem (BigQuery, Cloud Functions, etc.)
Diarization included free with Chirp 3

Cons:

Most expensive specialist STT at $0.016/min
Higher WER (~11.6%) than Deepgram or AssemblyAI
Requires Google Cloud account setup
Complex pricing page and billing

Amazon Transcribe

$0.024/min

Best volume discounts (drops to $0.0078/min at 5M+ min). HIPAA-eligible medical transcription. Deep AWS integration with 100+ language support.

Pros:

Best volume discounts: $0.0078/min at 5M+ minutes
HIPAA-eligible medical transcription ($0.075/min)
98%+ diarization accuracy, content redaction built-in
Call Analytics for contact centers
100+ languages, 1-second billing increments

Cons:

Most expensive at low volume ($0.024/min)
Free tier expires after 12 months
Complex IAM/AWS account setup required
Medical model very expensive ($0.075/min)

Rev AI

$0.003/min

Cheapest English-only transcription (Reverb at $0.003/min, Turbo at $0.0017/min). Only provider offering both AI and human transcription through a single API.

Pros:

Cheapest option: Reverb Turbo at $0.0017/min
Unique AI + human transcription hybrid ($1.99/min human)
Open-source Reverb ASR and diarization models
Forced alignment for existing transcripts
Streaming with sub-second latency

Cons:

Cheapest model is English-only
Only 58+ languages (fewer than competitors)
Small free tier (5 hours one-time)
Streaming limited to 9 languages
15-second minimum per request

Batch vs Streaming: When to Use Each

Aspect	Batch (Pre-recorded)	Real-time Streaming
Latency	Seconds to minutes (depends on file size)	Sub-300ms (Deepgram) to ~1s
Cost	Cheaper (e.g., Deepgram: $0.0043/min)	More expensive (e.g., Deepgram: $0.0077/min)
Use Cases	Podcasts, recorded meetings, media archives	Live captions, voice assistants, call centers
Language Support	Full (all languages supported)	Limited (6-9 langs for some providers)
Features	Full feature set (diarization, summaries, etc.)	Limited features (basic transcription + interim results)
File Size Limits	Varies (25MB-2GB depending on provider)	No file limit (continuous stream)
Best Provider	Rev AI ($0.003/min) or Deepgram ($0.0043/min)	Deepgram (sub-300ms) or OpenAI (WebSocket)

Cost at Scale (Batch Transcription, per Month)

Volume	OpenAI (Mini)	Deepgram	AssemblyAI	Google Cloud	Amazon	Rev AI (Reverb)
100 hours	$18	$26	$37	$96	$144	$18
1,000 hours	$180	$258	$366	$960	$1,440	$180
10,000 hours	$1,800	$2,580	$3,660	$9,600	$9,180*	$1,800
100,000 hours	$18,000	$25,800	$36,600	$96,000	$46,800*	$18,000

* Amazon Transcribe pricing uses volume tiers: $0.024 (0-250K min), $0.015 (250K-1M), $0.0102 (1M-5M), $0.0078 (5M+). At very high volumes, Amazon becomes most cost-effective. Rev AI prices shown for Reverb ($0.003/min, English only). OpenAI Mini is $0.003/min.

Code Examples

Transcribe Audio File

OpenAI

Deepgram

AssemblyAI

Rev AI

# OpenAI Whisper / GPT-4o Transcribe
curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file="@meeting.mp3" \
  -F model="gpt-4o-transcribe" \
  -F response_format="verbose_json" \
  -F timestamp_granularities[]="word"

// Node.js
import OpenAI from 'openai';
const openai = new OpenAI();

const transcript = await openai.audio.transcriptions.create({
  file: fs.createReadStream('meeting.mp3'),
  model: 'gpt-4o-transcribe',
  response_format: 'verbose_json',
  timestamp_granularities: ['word']
});
console.log(transcript.text);

# Deepgram Nova-3
curl -X POST "https://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true" \
  -H "Authorization: Token $DEEPGRAM_API_KEY" \
  -H "Content-Type: audio/mp3" \
  --data-binary @meeting.mp3

// Node.js
import { createClient } from '@deepgram/sdk';
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);

const { result } = await deepgram.listen.prerecorded.transcribeFile(
  fs.readFileSync('meeting.mp3'),
  { model: 'nova-3', smart_format: true, diarize: true }
);
console.log(result.results.channels[0].alternatives[0].transcript);

# AssemblyAI Universal-2
# Step 1: Upload file
curl -X POST "https://api.assemblyai.com/v2/upload" \
  -H "authorization: $ASSEMBLYAI_API_KEY" \
  --data-binary @meeting.mp3

# Step 2: Create transcription (use upload_url from step 1)
curl -X POST "https://api.assemblyai.com/v2/transcript" \
  -H "authorization: $ASSEMBLYAI_API_KEY" \
  -H "content-type: application/json" \
  -d '{"audio_url": "https://...", "speaker_labels": true}'

// Node.js
import { AssemblyAI } from 'assemblyai';
const client = new AssemblyAI({ apiKey: process.env.ASSEMBLYAI_API_KEY });

const transcript = await client.transcripts.transcribe({
  audio: './meeting.mp3',
  speaker_labels: true,
  language_detection: true
});
console.log(transcript.text);

# Rev AI Reverb
curl -X POST "https://api.rev.ai/speechtotext/v1/jobs" \
  -H "Authorization: Bearer $REVAI_ACCESS_TOKEN" \
  -H "Content-Type: audio/mp3" \
  --data-binary @meeting.mp3

# Check status and get transcript
curl "https://api.rev.ai/speechtotext/v1/jobs/{id}/transcript" \
  -H "Authorization: Bearer $REVAI_ACCESS_TOKEN" \
  -H "Accept: text/plain"

// Node.js
import { RevAiApiClient } from 'revai-node-sdk';
const client = new RevAiApiClient(process.env.REVAI_ACCESS_TOKEN);

const job = await client.submitJobLocalFile('./meeting.mp3');
// Poll until complete
const transcript = await client.getTranscriptText(job.id);
console.log(transcript);

Real-time Streaming

Deepgram

OpenAI

AssemblyAI

// Deepgram real-time streaming (Node.js)
import { createClient, LiveTranscriptionEvents } from '@deepgram/sdk';
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);

const connection = deepgram.listen.live({
  model: 'nova-3',
  language: 'en',
  smart_format: true,
  interim_results: true
});

connection.on(LiveTranscriptionEvents.Transcript, (data) => {
  const transcript = data.channel.alternatives[0].transcript;
  if (transcript) console.log('Live:', transcript);
});

// Send audio chunks from microphone or stream
connection.send(audioBuffer);

// OpenAI GPT-4o Transcribe streaming (WebSocket)
const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-transcribe', {
  headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}` }
});

ws.on('open', () => {
  ws.send(JSON.stringify({
    type: 'session.update',
    session: { input_audio_transcription: { model: 'gpt-4o-transcribe' } }
  }));
});

ws.on('message', (data) => {
  const event = JSON.parse(data);
  if (event.type === 'conversation.item.input_audio_transcription.completed') {
    console.log('Live:', event.transcript);
  }
});

// Send audio as base64 input_audio_buffer.append events
ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: base64Audio }));

// AssemblyAI real-time streaming (Node.js)
import { RealtimeTranscriber } from 'assemblyai';

const transcriber = new RealtimeTranscriber({
  apiKey: process.env.ASSEMBLYAI_API_KEY,
  sampleRate: 16_000
});

transcriber.on('transcript', (transcript) => {
  if (transcript.message_type === 'FinalTranscript') {
    console.log('Live:', transcript.text);
  }
});

await transcriber.connect();

// Stream audio from microphone
recorder.on('data', (audio) => {
  transcriber.sendAudio(audio);
});

Speaker Diarization

OpenAI

Deepgram

AssemblyAI

# OpenAI GPT-4o Transcribe with speaker diarization (no extra cost)
curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file="@meeting.mp3" \
  -F model="gpt-4o-transcribe-diarize" \
  -F response_format="verbose_json"

# Response includes speaker labels:
# { "logprobs": [...], "text": "...",
#   "words": [{"word": "Hello", "speaker": "speaker_0", "start": 0.0, "end": 0.5}] }

# Deepgram with diarization (free for batch)
curl -X POST "https://api.deepgram.com/v1/listen?model=nova-3&diarize=true" \
  -H "Authorization: Token $DEEPGRAM_API_KEY" \
  -H "Content-Type: audio/mp3" \
  --data-binary @meeting.mp3

# Response includes per-word speaker IDs:
# "words": [{"word": "Hello", "speaker": 0, "start": 0.0, "end": 0.5}]

// AssemblyAI with speaker labels (+$0.02/hr)
const transcript = await client.transcripts.transcribe({
  audio: './meeting.mp3',
  speaker_labels: true,
  speakers_expected: 3  // optional hint
});

for (const utterance of transcript.utterances) {
  console.log(`Speaker ${utterance.speaker}: ${utterance.text}`);
}
// Speaker A: Welcome to the meeting
// Speaker B: Thanks for having me

Accuracy Benchmarks

Provider	Model	Word Error Rate (WER)	Notes
Deepgram	Nova-3	5.26%	Best published WER. 47% lower than competitors (batch)
AssemblyAI	Universal-2	<5%	Consistent sub-5% across diverse audio conditions
OpenAI	GPT-4o Transcribe	~5-7%*	Highest accuracy among OpenAI models. No official WER
Rev AI	Reverb v2	~7-8%*	32% error reduction over v1. Strong across accents
Google Cloud	Chirp 3	~11.6%	Behind specialized STT providers on English benchmarks
Amazon	Transcribe	~8-12%*	92%+ F1 across 65 demographic groups. No official WER

* Estimated from independent benchmarks. Official WER not published by provider. Actual accuracy varies by audio quality, accent, domain, and noise level. Lower WER = better accuracy.

Which API Should You Choose?

Cheapest English Transcription

English-only batch transcription at the lowest possible cost. Perfect for podcasts, recordings, and media archives.

Pick: Rev AI Reverb Turbo ($0.0017/min)

Best Overall Production API

Best accuracy, lowest latency, generous free tier. Purpose-built for real-time production use cases.

Pick: Deepgram Nova-3 ($200 free, 5.26% WER)

Richest Feature Set

Need sentiment, topics, summaries, entity detection on top of transcription? Best Audio Intelligence add-ons.

Pick: AssemblyAI Universal-2 (99 langs + add-ons)

Most Languages Supported

Global apps needing 100+ languages with enterprise compliance, data residency, and encryption.

Pick: Google Cloud (125+ languages, CMEK)

Massive Volume (5M+ min/mo)

At extreme scale, volume discounts make the biggest difference. Best tier pricing for enterprise workloads.

Pick: Amazon Transcribe ($0.0078/min at 5M+)

AI + Human Hybrid

Need AI speed for most content and 99%+ human accuracy for critical transcription through one API.

Pick: Rev AI ($0.003/min AI + $1.99/min human)

Need APIs for Your AI Agents?

Frostbyte offers 40+ API tools for developers and AI agents. IP geolocation, crypto prices, screenshots, DNS, and more. No signup required for basic endpoints.

Get Your Free API Key →

Frequently Asked Questions

What is the cheapest speech-to-text API?▼

Rev AI Reverb Turbo is the cheapest at $0.0017/min ($0.10/hr) for English-only batch transcription. Deepgram Nova-3 is the cheapest multilingual option at $0.0043/min ($0.26/hr). OpenAI GPT-4o Mini Transcribe offers $0.003/min with 99-language support. At high volume (5M+ minutes/month), Amazon Transcribe drops to $0.0078/min.

Whisper vs Deepgram: which is better?▼

Deepgram Nova-3 is better for production use with lower latency (sub-300ms vs batch-only for Whisper), lower cost ($0.0043/min vs $0.006/min), and published accuracy benchmarks (5.26% WER). Whisper is better if you need the open-source model to self-host, want 99-language support, or are already in the OpenAI ecosystem. GPT-4o Transcribe improves on Whisper with streaming and diarization.

Which speech-to-text API has the best free tier?▼

Deepgram offers $200 in free credits (no expiry, no credit card) covering ~45,000 minutes. AssemblyAI gives $50 in one-time credits (~185 hours). Rev AI offers 5 free hours. Google Cloud gives 60 min/month free ongoing plus $300 in general cloud credits. OpenAI gives $5 credits (~833 minutes). Amazon offers 60 min/month free for 12 months.

Which API supports the most languages?▼

Google Cloud Speech-to-Text supports 125+ languages and locales. OpenAI Whisper and AssemblyAI both support 99 languages. Amazon Transcribe supports 100+. Rev AI supports 58+. Deepgram supports 36+ but is expanding. For streaming, language support is more limited than batch mode.

Can I get real-time streaming transcription?▼

Yes, all six providers support real-time streaming. Deepgram has the lowest latency at sub-300ms. OpenAI GPT-4o Transcribe streams via WebSocket. AssemblyAI streams in 6 languages. Google uses StreamingRecognize. Amazon supports both batch and streaming. Rev AI streams with sub-second latency in 9 languages.

Which API is best for medical transcription?▼

Amazon Transcribe Medical ($0.075/min) is purpose-built for healthcare and HIPAA-eligible. Deepgram Nova-3 Medical is another option for clinical documentation. Google Cloud offers BAA agreements for HIPAA compliance. For critical medical notes, Rev AI's human transcription ($1.99/min, 99%+ accuracy) may be worth the cost.

What is speaker diarization?▼

Speaker diarization identifies who is speaking in multi-speaker audio. All six providers support it. OpenAI includes it free via gpt-4o-transcribe-diarize. Deepgram includes it free for batch. AssemblyAI charges $0.02/hr. Google Chirp 3 and Amazon include it free. Rev AI enables it by default at no cost.

Best Speech-to-Text API Comparison 2026

How Speech-to-Text APIs Work

What Developers Build with Speech-to-Text APIs

Meeting Transcription

Podcast & Media

Call Center Analytics

Voice Assistants

Subtitles & Captions

AI Agent Integration

Feature Comparison Table

Pricing at a Glance

Provider Deep Dive

OpenAI Whisper / GPT-4o Transcribe

Deepgram Nova-3

AssemblyAI Universal-2

Google Cloud Speech-to-Text

Amazon Transcribe

Rev AI

Batch vs Streaming: When to Use Each

Cost at Scale (Batch Transcription, per Month)

Code Examples

Transcribe Audio File

Real-time Streaming

Speaker Diarization

Accuracy Benchmarks

Which API Should You Choose?

Cheapest English Transcription

Best Overall Production API

Richest Feature Set

Most Languages Supported

Massive Volume (5M+ min/mo)

AI + Human Hybrid

Need APIs for Your AI Agents?

Frequently Asked Questions