Best Speech-to-Text API Comparison 2026

Compare 6 transcription APIs side-by-side. Pricing per minute, accuracy benchmarks, real-time streaming, speaker diarization, and free tiers.

$200 Free Credits 99+ Languages Real-time Streaming Speaker Diarization From $0.0017/min

How Speech-to-Text APIs Work

1Upload Audio
2Audio Processing
3Speech Recognition
4Post-Processing
5Structured Transcript

Audio is uploaded (or streamed in real-time), processed by AI models that convert speech patterns to text, then enhanced with punctuation, speaker labels, timestamps, and optional features like sentiment analysis or summarization.

What Developers Build with Speech-to-Text APIs

🎤

Meeting Transcription

Transcribe meetings, calls, and conferences in real-time with speaker diarization. Zoom, Teams, and Google Meet integrations.

🎧

Podcast & Media

Generate transcripts for podcasts, YouTube videos, and audio content. Improve SEO and accessibility with searchable text.

📞

Call Center Analytics

Transcribe support calls, detect sentiment, extract topics, and monitor agent performance. HIPAA-compliant options for healthcare.

🗣️

Voice Assistants

Power voice interfaces with sub-300ms streaming transcription. Build voice-controlled apps, smart speakers, and accessibility tools.

🎬

Subtitles & Captions

Auto-generate subtitles for video content in 99+ languages. SRT/VTT export with precise word-level timestamps.

🤖

AI Agent Integration

Give AI agents the ability to hear and process audio. Real-time voice-to-action pipelines for customer service and workflow automation.

Feature Comparison Table

Feature OpenAI Whisper Deepgram AssemblyAI Google Cloud Amazon Transcribe Rev AI
Free Tier $5 credits $200 credits $50 credits 60 min/mo + $300 60 min/mo (12mo) 5 hours
Price/Min (Batch) $0.006 $0.0043 $0.0061 $0.016 $0.024 $0.003
Price/Min (Stream) $0.006 $0.0077 $0.0061 $0.016 $0.024 $0.003
Languages 99 36+ 99 125+ 100+ 58+
Real-time Streaming ✓ (6 langs) ✓ (9 langs)
Speaker Diarization ✓ Free ✓ Free (batch) ✓ $0.02/hr ✓ Free ✓ Free ✓ Free
Custom Vocabulary ✓ 100 terms ✓ 200 terms
Sentiment Analysis ✓ Add-on ✓ Add-on
Summarization ✓ Add-on ✓ Add-on
Content Redaction
Medical Model ✓ HIPAA
Code-switching ✓ 10 langs
Open Source Model ✓ Whisper ✓ Reverb
Human Transcription ✓ $1.99/min

Pricing at a Glance

Rev AI (Turbo)
$0.10
per hour (English)
5 hours free
OpenAI (Mini)
$0.18
per hour (99 langs)
$5 free credits
Deepgram
$0.26
per hour (batch)
$50 free requests/day
AssemblyAI
$0.37
per hour (batch)
$50 free credits
Google Cloud
$0.96
per hour (Chirp 3)
60 min/mo free
Amazon Transcribe
$1.44
per hour (tier 1)
60 min/mo (12mo)

Provider Deep Dive

OpenAI Whisper / GPT-4o Transcribe

$0.003-0.006/min
The most recognizable name in AI. GPT-4o Transcribe builds on Whisper with streaming, diarization, and multimodal understanding. 99-language support with flat-rate pricing.
Pros:
  • 99 languages at flat rate — no per-language pricing
  • GPT-4o Mini at $0.003/min is very competitive
  • Diarization included free (gpt-4o-transcribe-diarize model)
  • Open-source Whisper model for self-hosting
  • Massive ecosystem integration
Cons:
  • No volume discounts — flat rate at any scale
  • $5 free tier is smallest of all providers
  • No custom vocabulary support
  • 25MB file size limit on whisper-1

Deepgram Nova-3

$0.0043/min
Purpose-built for production speech recognition. Sub-300ms streaming latency, industry-leading accuracy (5.26% WER), and the most generous free tier at $200 in credits.
Pros:
  • $50 free requests/day (no expiry, no card) — ~45K minutes
  • Best published accuracy: 5.26% WER (Nova-3)
  • Sub-300ms streaming latency (Deepgram Flux even lower)
  • Code-switching between 10 languages in real-time
  • Medical model (Nova-3 Medical) for healthcare
Cons:
  • Only 36+ languages (vs 99-125+ competitors)
  • Streaming costs ~79% more than batch ($0.0077 vs $0.0043)
  • Growth tier requires $4K annual prepay
  • No sentiment analysis or summarization features

AssemblyAI Universal-2

$0.0061/min
Richest feature set of any STT provider. Audio Intelligence add-ons for sentiment, entities, topics, summaries, and content moderation. 99-language support at flat rate.
Pros:
  • 99 languages at same flat rate, auto language detection
  • Rich Audio Intelligence: sentiment, entities, topics, summaries
  • Sub-5% WER consistently across audio conditions
  • 200 custom key terms for vocabulary boosting
  • 64% fewer speaker counting errors (latest diarization)
Cons:
  • Add-on costs stack up (diarization $0.02/hr + sentiment + entities)
  • Streaming limited to 6 languages only
  • $50 free credits are one-time, not recurring
  • Streaming billed by connection time, not audio time

Google Cloud Speech-to-Text

$0.016/min
Broadest language support (125+) with enterprise-grade compliance. Chirp 3 model with auto language detection. Only provider with recurring free monthly minutes.
Pros:
  • 125+ languages — widest coverage
  • 60 minutes/month free (ongoing, every month)
  • Enterprise: data residency, customer-managed encryption keys
  • Deep GCP ecosystem (BigQuery, Cloud Functions, etc.)
  • Diarization included free with Chirp 3
Cons:
  • Most expensive specialist STT at $0.016/min
  • Higher WER (~11.6%) than Deepgram or AssemblyAI
  • Requires Google Cloud account setup
  • Complex pricing page and billing

Amazon Transcribe

$0.024/min
Best volume discounts (drops to $0.0078/min at 5M+ min). HIPAA-eligible medical transcription. Deep AWS integration with 100+ language support.
Pros:
  • Best volume discounts: $0.0078/min at 5M+ minutes
  • HIPAA-eligible medical transcription ($0.075/min)
  • 98%+ diarization accuracy, content redaction built-in
  • Call Analytics for contact centers
  • 100+ languages, 1-second billing increments
Cons:
  • Most expensive at low volume ($0.024/min)
  • Free tier expires after 12 months
  • Complex IAM/AWS account setup required
  • Medical model very expensive ($0.075/min)

Rev AI

$0.003/min
Cheapest English-only transcription (Reverb at $0.003/min, Turbo at $0.0017/min). Only provider offering both AI and human transcription through a single API.
Pros:
  • Cheapest option: Reverb Turbo at $0.0017/min
  • Unique AI + human transcription hybrid ($1.99/min human)
  • Open-source Reverb ASR and diarization models
  • Forced alignment for existing transcripts
  • Streaming with sub-second latency
Cons:
  • Cheapest model is English-only
  • Only 58+ languages (fewer than competitors)
  • Small free tier (5 hours one-time)
  • Streaming limited to 9 languages
  • 15-second minimum per request

Batch vs Streaming: When to Use Each

Aspect Batch (Pre-recorded) Real-time Streaming
LatencySeconds to minutes (depends on file size)Sub-300ms (Deepgram) to ~1s
CostCheaper (e.g., Deepgram: $0.0043/min)More expensive (e.g., Deepgram: $0.0077/min)
Use CasesPodcasts, recorded meetings, media archivesLive captions, voice assistants, call centers
Language SupportFull (all languages supported)Limited (6-9 langs for some providers)
FeaturesFull feature set (diarization, summaries, etc.)Limited features (basic transcription + interim results)
File Size LimitsVaries (25MB-2GB depending on provider)No file limit (continuous stream)
Best ProviderRev AI ($0.003/min) or Deepgram ($0.0043/min)Deepgram (sub-300ms) or OpenAI (WebSocket)

Cost at Scale (Batch Transcription, per Month)

Volume OpenAI (Mini) Deepgram AssemblyAI Google Cloud Amazon Rev AI (Reverb)
100 hours $18 $26 $37 $96 $144 $18
1,000 hours $180 $258 $366 $960 $1,440 $180
10,000 hours $1,800 $2,580 $3,660 $9,600 $9,180* $1,800
100,000 hours $18,000 $25,800 $36,600 $96,000 $46,800* $18,000

* Amazon Transcribe pricing uses volume tiers: $0.024 (0-250K min), $0.015 (250K-1M), $0.0102 (1M-5M), $0.0078 (5M+). At very high volumes, Amazon becomes most cost-effective. Rev AI prices shown for Reverb ($0.003/min, English only). OpenAI Mini is $0.003/min.

Code Examples

Transcribe Audio File

OpenAI
Deepgram
AssemblyAI
Rev AI
# OpenAI Whisper / GPT-4o Transcribe
curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file="@meeting.mp3" \
  -F model="gpt-4o-transcribe" \
  -F response_format="verbose_json" \
  -F timestamp_granularities[]="word"
// Node.js
import OpenAI from 'openai';
const openai = new OpenAI();

const transcript = await openai.audio.transcriptions.create({
  file: fs.createReadStream('meeting.mp3'),
  model: 'gpt-4o-transcribe',
  response_format: 'verbose_json',
  timestamp_granularities: ['word']
});
console.log(transcript.text);
# Deepgram Nova-3
curl -X POST "https://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true" \
  -H "Authorization: Token $DEEPGRAM_API_KEY" \
  -H "Content-Type: audio/mp3" \
  --data-binary @meeting.mp3
// Node.js
import { createClient } from '@deepgram/sdk';
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);

const { result } = await deepgram.listen.prerecorded.transcribeFile(
  fs.readFileSync('meeting.mp3'),
  { model: 'nova-3', smart_format: true, diarize: true }
);
console.log(result.results.channels[0].alternatives[0].transcript);
# AssemblyAI Universal-2
# Step 1: Upload file
curl -X POST "https://api.assemblyai.com/v2/upload" \
  -H "authorization: $ASSEMBLYAI_API_KEY" \
  --data-binary @meeting.mp3

# Step 2: Create transcription (use upload_url from step 1)
curl -X POST "https://api.assemblyai.com/v2/transcript" \
  -H "authorization: $ASSEMBLYAI_API_KEY" \
  -H "content-type: application/json" \
  -d '{"audio_url": "https://...", "speaker_labels": true}'
// Node.js
import { AssemblyAI } from 'assemblyai';
const client = new AssemblyAI({ apiKey: process.env.ASSEMBLYAI_API_KEY });

const transcript = await client.transcripts.transcribe({
  audio: './meeting.mp3',
  speaker_labels: true,
  language_detection: true
});
console.log(transcript.text);
# Rev AI Reverb
curl -X POST "https://api.rev.ai/speechtotext/v1/jobs" \
  -H "Authorization: Bearer $REVAI_ACCESS_TOKEN" \
  -H "Content-Type: audio/mp3" \
  --data-binary @meeting.mp3

# Check status and get transcript
curl "https://api.rev.ai/speechtotext/v1/jobs/{id}/transcript" \
  -H "Authorization: Bearer $REVAI_ACCESS_TOKEN" \
  -H "Accept: text/plain"
// Node.js
import { RevAiApiClient } from 'revai-node-sdk';
const client = new RevAiApiClient(process.env.REVAI_ACCESS_TOKEN);

const job = await client.submitJobLocalFile('./meeting.mp3');
// Poll until complete
const transcript = await client.getTranscriptText(job.id);
console.log(transcript);

Real-time Streaming

Deepgram
OpenAI
AssemblyAI
// Deepgram real-time streaming (Node.js)
import { createClient, LiveTranscriptionEvents } from '@deepgram/sdk';
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);

const connection = deepgram.listen.live({
  model: 'nova-3',
  language: 'en',
  smart_format: true,
  interim_results: true
});

connection.on(LiveTranscriptionEvents.Transcript, (data) => {
  const transcript = data.channel.alternatives[0].transcript;
  if (transcript) console.log('Live:', transcript);
});

// Send audio chunks from microphone or stream
connection.send(audioBuffer);
// OpenAI GPT-4o Transcribe streaming (WebSocket)
const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-transcribe', {
  headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}` }
});

ws.on('open', () => {
  ws.send(JSON.stringify({
    type: 'session.update',
    session: { input_audio_transcription: { model: 'gpt-4o-transcribe' } }
  }));
});

ws.on('message', (data) => {
  const event = JSON.parse(data);
  if (event.type === 'conversation.item.input_audio_transcription.completed') {
    console.log('Live:', event.transcript);
  }
});

// Send audio as base64 input_audio_buffer.append events
ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: base64Audio }));
// AssemblyAI real-time streaming (Node.js)
import { RealtimeTranscriber } from 'assemblyai';

const transcriber = new RealtimeTranscriber({
  apiKey: process.env.ASSEMBLYAI_API_KEY,
  sampleRate: 16_000
});

transcriber.on('transcript', (transcript) => {
  if (transcript.message_type === 'FinalTranscript') {
    console.log('Live:', transcript.text);
  }
});

await transcriber.connect();

// Stream audio from microphone
recorder.on('data', (audio) => {
  transcriber.sendAudio(audio);
});

Speaker Diarization

OpenAI
Deepgram
AssemblyAI
# OpenAI GPT-4o Transcribe with speaker diarization (no extra cost)
curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file="@meeting.mp3" \
  -F model="gpt-4o-transcribe-diarize" \
  -F response_format="verbose_json"

# Response includes speaker labels:
# { "logprobs": [...], "text": "...",
#   "words": [{"word": "Hello", "speaker": "speaker_0", "start": 0.0, "end": 0.5}] }
# Deepgram with diarization (free for batch)
curl -X POST "https://api.deepgram.com/v1/listen?model=nova-3&diarize=true" \
  -H "Authorization: Token $DEEPGRAM_API_KEY" \
  -H "Content-Type: audio/mp3" \
  --data-binary @meeting.mp3

# Response includes per-word speaker IDs:
# "words": [{"word": "Hello", "speaker": 0, "start": 0.0, "end": 0.5}]
// AssemblyAI with speaker labels (+$0.02/hr)
const transcript = await client.transcripts.transcribe({
  audio: './meeting.mp3',
  speaker_labels: true,
  speakers_expected: 3  // optional hint
});

for (const utterance of transcript.utterances) {
  console.log(`Speaker ${utterance.speaker}: ${utterance.text}`);
}
// Speaker A: Welcome to the meeting
// Speaker B: Thanks for having me

Accuracy Benchmarks

Provider Model Word Error Rate (WER) Notes
Deepgram Nova-3 5.26% Best published WER. 47% lower than competitors (batch)
AssemblyAI Universal-2 <5% Consistent sub-5% across diverse audio conditions
OpenAI GPT-4o Transcribe ~5-7%* Highest accuracy among OpenAI models. No official WER
Rev AI Reverb v2 ~7-8%* 32% error reduction over v1. Strong across accents
Google Cloud Chirp 3 ~11.6% Behind specialized STT providers on English benchmarks
Amazon Transcribe ~8-12%* 92%+ F1 across 65 demographic groups. No official WER

* Estimated from independent benchmarks. Official WER not published by provider. Actual accuracy varies by audio quality, accent, domain, and noise level. Lower WER = better accuracy.

Which API Should You Choose?

Cheapest English Transcription

English-only batch transcription at the lowest possible cost. Perfect for podcasts, recordings, and media archives.

Pick: Rev AI Reverb Turbo ($0.0017/min)

Best Overall Production API

Best accuracy, lowest latency, generous free tier. Purpose-built for real-time production use cases.

Pick: Deepgram Nova-3 ($200 free, 5.26% WER)

Richest Feature Set

Need sentiment, topics, summaries, entity detection on top of transcription? Best Audio Intelligence add-ons.

Pick: AssemblyAI Universal-2 (99 langs + add-ons)

Most Languages Supported

Global apps needing 100+ languages with enterprise compliance, data residency, and encryption.

Pick: Google Cloud (125+ languages, CMEK)

Massive Volume (5M+ min/mo)

At extreme scale, volume discounts make the biggest difference. Best tier pricing for enterprise workloads.

Pick: Amazon Transcribe ($0.0078/min at 5M+)

AI + Human Hybrid

Need AI speed for most content and 99%+ human accuracy for critical transcription through one API.

Pick: Rev AI ($0.003/min AI + $1.99/min human)

Need APIs for Your AI Agents?

Frostbyte offers 40+ API tools for developers and AI agents. IP geolocation, crypto prices, screenshots, DNS, and more. No signup required for basic endpoints.

Get Your Free API Key →

Frequently Asked Questions

What is the cheapest speech-to-text API?
Rev AI Reverb Turbo is the cheapest at $0.0017/min ($0.10/hr) for English-only batch transcription. Deepgram Nova-3 is the cheapest multilingual option at $0.0043/min ($0.26/hr). OpenAI GPT-4o Mini Transcribe offers $0.003/min with 99-language support. At high volume (5M+ minutes/month), Amazon Transcribe drops to $0.0078/min.
Whisper vs Deepgram: which is better?
Deepgram Nova-3 is better for production use with lower latency (sub-300ms vs batch-only for Whisper), lower cost ($0.0043/min vs $0.006/min), and published accuracy benchmarks (5.26% WER). Whisper is better if you need the open-source model to self-host, want 99-language support, or are already in the OpenAI ecosystem. GPT-4o Transcribe improves on Whisper with streaming and diarization.
Which speech-to-text API has the best free tier?
Deepgram offers $200 in free credits (no expiry, no credit card) covering ~45,000 minutes. AssemblyAI gives $50 in one-time credits (~185 hours). Rev AI offers 5 free hours. Google Cloud gives 60 min/month free ongoing plus $300 in general cloud credits. OpenAI gives $5 credits (~833 minutes). Amazon offers 60 min/month free for 12 months.
Which API supports the most languages?
Google Cloud Speech-to-Text supports 125+ languages and locales. OpenAI Whisper and AssemblyAI both support 99 languages. Amazon Transcribe supports 100+. Rev AI supports 58+. Deepgram supports 36+ but is expanding. For streaming, language support is more limited than batch mode.
Can I get real-time streaming transcription?
Yes, all six providers support real-time streaming. Deepgram has the lowest latency at sub-300ms. OpenAI GPT-4o Transcribe streams via WebSocket. AssemblyAI streams in 6 languages. Google uses StreamingRecognize. Amazon supports both batch and streaming. Rev AI streams with sub-second latency in 9 languages.
Which API is best for medical transcription?
Amazon Transcribe Medical ($0.075/min) is purpose-built for healthcare and HIPAA-eligible. Deepgram Nova-3 Medical is another option for clinical documentation. Google Cloud offers BAA agreements for HIPAA compliance. For critical medical notes, Rev AI's human transcription ($1.99/min, 99%+ accuracy) may be worth the cost.
What is speaker diarization?
Speaker diarization identifies who is speaking in multi-speaker audio. All six providers support it. OpenAI includes it free via gpt-4o-transcribe-diarize. Deepgram includes it free for batch. AssemblyAI charges $0.02/hr. Google Chirp 3 and Amazon include it free. Rev AI enables it by default at no cost.