Compare 6 transcription APIs side-by-side. Pricing per minute, accuracy benchmarks, real-time streaming, speaker diarization, and free tiers.
Audio is uploaded (or streamed in real-time), processed by AI models that convert speech patterns to text, then enhanced with punctuation, speaker labels, timestamps, and optional features like sentiment analysis or summarization.
Transcribe meetings, calls, and conferences in real-time with speaker diarization. Zoom, Teams, and Google Meet integrations.
Generate transcripts for podcasts, YouTube videos, and audio content. Improve SEO and accessibility with searchable text.
Transcribe support calls, detect sentiment, extract topics, and monitor agent performance. HIPAA-compliant options for healthcare.
Power voice interfaces with sub-300ms streaming transcription. Build voice-controlled apps, smart speakers, and accessibility tools.
Auto-generate subtitles for video content in 99+ languages. SRT/VTT export with precise word-level timestamps.
Give AI agents the ability to hear and process audio. Real-time voice-to-action pipelines for customer service and workflow automation.
| Feature | OpenAI Whisper | Deepgram | AssemblyAI | Google Cloud | Amazon Transcribe | Rev AI |
|---|---|---|---|---|---|---|
| Free Tier | $5 credits | $200 credits | $50 credits | 60 min/mo + $300 | 60 min/mo (12mo) | 5 hours |
| Price/Min (Batch) | $0.006 | $0.0043 | $0.0061 | $0.016 | $0.024 | $0.003 |
| Price/Min (Stream) | $0.006 | $0.0077 | $0.0061 | $0.016 | $0.024 | $0.003 |
| Languages | 99 | 36+ | 99 | 125+ | 100+ | 58+ |
| Real-time Streaming | ✓ | ✓ | ✓ (6 langs) | ✓ | ✓ | ✓ (9 langs) |
| Speaker Diarization | ✓ Free | ✓ Free (batch) | ✓ $0.02/hr | ✓ Free | ✓ Free | ✓ Free |
| Custom Vocabulary | ✗ | ✓ 100 terms | ✓ 200 terms | ✓ | ✓ | ✓ |
| Sentiment Analysis | ✗ | ✗ | ✓ Add-on | ✗ | ✗ | ✓ Add-on |
| Summarization | ✗ | ✗ | ✓ Add-on | ✗ | ✗ | ✓ Add-on |
| Content Redaction | ✗ | ✓ | ✓ | ✗ | ✓ | ✗ |
| Medical Model | ✗ | ✓ | ✗ | ✗ | ✓ HIPAA | ✗ |
| Code-switching | ✗ | ✓ 10 langs | ✓ | ✓ | ✗ | ✗ |
| Open Source Model | ✓ Whisper | ✗ | ✗ | ✗ | ✗ | ✓ Reverb |
| Human Transcription | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ $1.99/min |
| Aspect | Batch (Pre-recorded) | Real-time Streaming |
|---|---|---|
| Latency | Seconds to minutes (depends on file size) | Sub-300ms (Deepgram) to ~1s |
| Cost | Cheaper (e.g., Deepgram: $0.0043/min) | More expensive (e.g., Deepgram: $0.0077/min) |
| Use Cases | Podcasts, recorded meetings, media archives | Live captions, voice assistants, call centers |
| Language Support | Full (all languages supported) | Limited (6-9 langs for some providers) |
| Features | Full feature set (diarization, summaries, etc.) | Limited features (basic transcription + interim results) |
| File Size Limits | Varies (25MB-2GB depending on provider) | No file limit (continuous stream) |
| Best Provider | Rev AI ($0.003/min) or Deepgram ($0.0043/min) | Deepgram (sub-300ms) or OpenAI (WebSocket) |
| Volume | OpenAI (Mini) | Deepgram | AssemblyAI | Google Cloud | Amazon | Rev AI (Reverb) |
|---|---|---|---|---|---|---|
| 100 hours | $18 | $26 | $37 | $96 | $144 | $18 |
| 1,000 hours | $180 | $258 | $366 | $960 | $1,440 | $180 |
| 10,000 hours | $1,800 | $2,580 | $3,660 | $9,600 | $9,180* | $1,800 |
| 100,000 hours | $18,000 | $25,800 | $36,600 | $96,000 | $46,800* | $18,000 |
* Amazon Transcribe pricing uses volume tiers: $0.024 (0-250K min), $0.015 (250K-1M), $0.0102 (1M-5M), $0.0078 (5M+). At very high volumes, Amazon becomes most cost-effective. Rev AI prices shown for Reverb ($0.003/min, English only). OpenAI Mini is $0.003/min.
# OpenAI Whisper / GPT-4o Transcribe
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file="@meeting.mp3" \
-F model="gpt-4o-transcribe" \
-F response_format="verbose_json" \
-F timestamp_granularities[]="word"
// Node.js
import OpenAI from 'openai';
const openai = new OpenAI();
const transcript = await openai.audio.transcriptions.create({
file: fs.createReadStream('meeting.mp3'),
model: 'gpt-4o-transcribe',
response_format: 'verbose_json',
timestamp_granularities: ['word']
});
console.log(transcript.text);
# Deepgram Nova-3
curl -X POST "https://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true" \
-H "Authorization: Token $DEEPGRAM_API_KEY" \
-H "Content-Type: audio/mp3" \
--data-binary @meeting.mp3
// Node.js
import { createClient } from '@deepgram/sdk';
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
const { result } = await deepgram.listen.prerecorded.transcribeFile(
fs.readFileSync('meeting.mp3'),
{ model: 'nova-3', smart_format: true, diarize: true }
);
console.log(result.results.channels[0].alternatives[0].transcript);
# AssemblyAI Universal-2
# Step 1: Upload file
curl -X POST "https://api.assemblyai.com/v2/upload" \
-H "authorization: $ASSEMBLYAI_API_KEY" \
--data-binary @meeting.mp3
# Step 2: Create transcription (use upload_url from step 1)
curl -X POST "https://api.assemblyai.com/v2/transcript" \
-H "authorization: $ASSEMBLYAI_API_KEY" \
-H "content-type: application/json" \
-d '{"audio_url": "https://...", "speaker_labels": true}'
// Node.js
import { AssemblyAI } from 'assemblyai';
const client = new AssemblyAI({ apiKey: process.env.ASSEMBLYAI_API_KEY });
const transcript = await client.transcripts.transcribe({
audio: './meeting.mp3',
speaker_labels: true,
language_detection: true
});
console.log(transcript.text);
# Rev AI Reverb
curl -X POST "https://api.rev.ai/speechtotext/v1/jobs" \
-H "Authorization: Bearer $REVAI_ACCESS_TOKEN" \
-H "Content-Type: audio/mp3" \
--data-binary @meeting.mp3
# Check status and get transcript
curl "https://api.rev.ai/speechtotext/v1/jobs/{id}/transcript" \
-H "Authorization: Bearer $REVAI_ACCESS_TOKEN" \
-H "Accept: text/plain"
// Node.js
import { RevAiApiClient } from 'revai-node-sdk';
const client = new RevAiApiClient(process.env.REVAI_ACCESS_TOKEN);
const job = await client.submitJobLocalFile('./meeting.mp3');
// Poll until complete
const transcript = await client.getTranscriptText(job.id);
console.log(transcript);
// Deepgram real-time streaming (Node.js)
import { createClient, LiveTranscriptionEvents } from '@deepgram/sdk';
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
const connection = deepgram.listen.live({
model: 'nova-3',
language: 'en',
smart_format: true,
interim_results: true
});
connection.on(LiveTranscriptionEvents.Transcript, (data) => {
const transcript = data.channel.alternatives[0].transcript;
if (transcript) console.log('Live:', transcript);
});
// Send audio chunks from microphone or stream
connection.send(audioBuffer);
// OpenAI GPT-4o Transcribe streaming (WebSocket)
const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-transcribe', {
headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}` }
});
ws.on('open', () => {
ws.send(JSON.stringify({
type: 'session.update',
session: { input_audio_transcription: { model: 'gpt-4o-transcribe' } }
}));
});
ws.on('message', (data) => {
const event = JSON.parse(data);
if (event.type === 'conversation.item.input_audio_transcription.completed') {
console.log('Live:', event.transcript);
}
});
// Send audio as base64 input_audio_buffer.append events
ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: base64Audio }));
// AssemblyAI real-time streaming (Node.js)
import { RealtimeTranscriber } from 'assemblyai';
const transcriber = new RealtimeTranscriber({
apiKey: process.env.ASSEMBLYAI_API_KEY,
sampleRate: 16_000
});
transcriber.on('transcript', (transcript) => {
if (transcript.message_type === 'FinalTranscript') {
console.log('Live:', transcript.text);
}
});
await transcriber.connect();
// Stream audio from microphone
recorder.on('data', (audio) => {
transcriber.sendAudio(audio);
});
# OpenAI GPT-4o Transcribe with speaker diarization (no extra cost)
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file="@meeting.mp3" \
-F model="gpt-4o-transcribe-diarize" \
-F response_format="verbose_json"
# Response includes speaker labels:
# { "logprobs": [...], "text": "...",
# "words": [{"word": "Hello", "speaker": "speaker_0", "start": 0.0, "end": 0.5}] }
# Deepgram with diarization (free for batch)
curl -X POST "https://api.deepgram.com/v1/listen?model=nova-3&diarize=true" \
-H "Authorization: Token $DEEPGRAM_API_KEY" \
-H "Content-Type: audio/mp3" \
--data-binary @meeting.mp3
# Response includes per-word speaker IDs:
# "words": [{"word": "Hello", "speaker": 0, "start": 0.0, "end": 0.5}]
// AssemblyAI with speaker labels (+$0.02/hr)
const transcript = await client.transcripts.transcribe({
audio: './meeting.mp3',
speaker_labels: true,
speakers_expected: 3 // optional hint
});
for (const utterance of transcript.utterances) {
console.log(`Speaker ${utterance.speaker}: ${utterance.text}`);
}
// Speaker A: Welcome to the meeting
// Speaker B: Thanks for having me
| Provider | Model | Word Error Rate (WER) | Notes |
|---|---|---|---|
| Deepgram | Nova-3 | 5.26% | Best published WER. 47% lower than competitors (batch) |
| AssemblyAI | Universal-2 | <5% | Consistent sub-5% across diverse audio conditions |
| OpenAI | GPT-4o Transcribe | ~5-7%* | Highest accuracy among OpenAI models. No official WER |
| Rev AI | Reverb v2 | ~7-8%* | 32% error reduction over v1. Strong across accents |
| Google Cloud | Chirp 3 | ~11.6% | Behind specialized STT providers on English benchmarks |
| Amazon | Transcribe | ~8-12%* | 92%+ F1 across 65 demographic groups. No official WER |
* Estimated from independent benchmarks. Official WER not published by provider. Actual accuracy varies by audio quality, accent, domain, and noise level. Lower WER = better accuracy.
English-only batch transcription at the lowest possible cost. Perfect for podcasts, recordings, and media archives.
Best accuracy, lowest latency, generous free tier. Purpose-built for real-time production use cases.
Need sentiment, topics, summaries, entity detection on top of transcription? Best Audio Intelligence add-ons.
Global apps needing 100+ languages with enterprise compliance, data residency, and encryption.
At extreme scale, volume discounts make the biggest difference. Best tier pricing for enterprise workloads.
Need AI speed for most content and 99%+ human accuracy for critical transcription through one API.
Frostbyte offers 40+ API tools for developers and AI agents. IP geolocation, crypto prices, screenshots, DNS, and more. No signup required for basic endpoints.
Get Your Free API Key →