How transcription works
When a recording is captured (via bot, desktop app, browser extension, or mobile app), the audio is sent to Mavio’s transcription engine. The process involves multiple stages:Audio preprocessing
The raw audio is normalized, noise-reduced, and segmented into manageable chunks. Silence detection removes dead air to speed up processing.
Speech-to-text conversion
Each audio segment is processed by Mavio’s ASR (automatic speech recognition) models. These models are trained on millions of hours of conversational speech and optimized for meeting scenarios.
Punctuation and formatting
AI adds sentence boundaries, punctuation, capitalization, and paragraph breaks to produce natural, readable text.
Speaker diarization
Speaker identification models determine who is speaking at each moment. Segments are labeled with speaker names when known or speaker numbers (Speaker 1, Speaker 2) when not yet identified.
Accuracy
Mavio achieves 95%+ word-level accuracy under standard meeting conditions:| Factor | Impact on accuracy |
|---|---|
| Clear audio, one speaker at a time | 97-99% |
| Multiple speakers with minimal crosstalk | 95-97% |
| Background noise (office, cafe) | 90-95% |
| Heavy accents or fast speech | 88-95% |
| Poor audio quality or heavy compression | 80-90% |
Supported languages
Mavio supports transcription in 40+ languages including:Widely supported
English, Spanish, French, German, Portuguese, Italian, Dutch, Japanese, Korean, Mandarin Chinese, Hindi, Arabic
European
Polish, Czech, Slovak, Romanian, Hungarian, Greek, Swedish, Norwegian, Danish, Finnish, Turkish, Ukrainian
Asian & other
Thai, Vietnamese, Indonesian, Malay, Tagalog, Bengali, Tamil, Urdu, Hebrew, Russian, Cantonese
Automatic language detection
Mavio detects the spoken language automatically. If a meeting includes multiple languages, Mavio identifies language switches and transcribes each segment in the appropriate language. To force a specific language, go to Settings > Transcription > Default language and select your preferred language.Real-time transcription
When using the desktop app or browser extension, you can enable live transcription to see text appear as the meeting progresses:- Go to Settings > Transcription > Live transcription and toggle it on.
- During a recording, open the Mavio window to see the live transcript feed.
Real-time transcription requires a stable internet connection with at least 1 Mbps upload speed. It is not available in privacy mode, which processes audio on-device after the recording ends.
Custom vocabulary
Improve transcription of domain-specific terms, product names, and jargon:- Go to Settings > Transcription > Custom vocabulary.
- Add words, phrases, or acronyms that Mavio should recognize (e.g., “Kubernetes”, “OKRs”, “Mavio”).
- Optionally add pronunciation hints for unusual spellings.
Editing transcripts
Click any transcript segment in the meeting viewer to edit the text. Edits are saved immediately and reflected in the summary and search index. Edited segments are marked with a pencil icon so you know what was modified.Exporting transcripts
Export transcripts from the meeting detail page in these formats:- TXT — plain text with speaker labels
- SRT / VTT — subtitle format with timestamps
- PDF — formatted document with speaker labels and timestamps
- Markdown — for pasting into wikis, docs, or Notion
- JSON — structured data for programmatic access
Transcription settings
Language detection vs manual language selection
Language detection vs manual language selection
By default, Mavio uses automatic language detection, which identifies the spoken language from the audio and selects the appropriate transcription model. This works well for most meetings, including those where participants switch between languages mid-conversation.If automatic detection is producing incorrect results (for example, confusing closely related languages like Norwegian and Danish), you can force a specific language:
- Go to Settings > Transcription > Default language.
- Select your preferred language from the list.
- All future recordings will use that language model exclusively.
Punctuation and formatting options
Punctuation and formatting options
Mavio automatically adds punctuation, capitalization, and paragraph breaks to produce readable transcripts. You can fine-tune formatting behavior in Settings > Transcription > Formatting:
- Paragraph breaks — choose between breaking on speaker changes only, or also inserting breaks on long pauses (default: speaker changes + pauses over 3 seconds).
- Filler word removal — automatically strip “um”, “uh”, “like”, “you know” and similar fillers from the transcript. Disabled by default to preserve the verbatim record.
- Number formatting — choose between spelled-out numbers (“twenty-three”) or numeric format (“23”). Default is numeric for values above ten.
- Timestamp granularity — display timestamps per sentence, per paragraph, or per speaker turn.
Custom vocabulary and industry terms
Custom vocabulary and industry terms
Add domain-specific terms to improve recognition accuracy for specialized vocabulary:
- Go to Settings > Transcription > Custom vocabulary.
- Click Add term and enter the word or phrase exactly as it should appear in the transcript.
- Optionally add a pronunciation hint (phonetic spelling) for unusual words.
- Click Save.
- Product names (e.g., “Mavio”, “Kubernetes”, “BigQuery”)
- Industry acronyms (e.g., “OKRs”, “KPIs”, “HIPAA”)
- People’s names that are frequently misspelled
- Company-specific jargon
Speaker diarization sensitivity
Speaker diarization sensitivity
Speaker diarization is the process of determining “who spoke when.” You can adjust sensitivity in Settings > Transcription > Speaker detection:
- Conservative — requires clear speaker changes with distinct voice characteristics. Fewer false speaker switches, but may merge speakers with similar voices into one label.
- Balanced (default) — good tradeoff between accuracy and sensitivity for most meeting scenarios with 2-8 participants.
- Aggressive — detects subtle speaker changes. Best for meetings with many participants or speakers with similar voices. May occasionally split a single speaker into two labels.
Real-time vs post-processing accuracy comparison
Real-time vs post-processing accuracy comparison
Mavio offers two transcription modes, and their accuracy characteristics differ:
The real-time transcript is a preview designed for following along during the meeting. The final post-processed transcript is the authoritative version and is what gets used for summaries, action items, and search indexing.
| Aspect | Real-time transcription | Post-processing transcription |
|---|---|---|
| Accuracy | 90-95% | 95-99% |
| Latency | Under 2 seconds | 2-5 minutes after recording ends |
| Context awareness | Limited (processes small audio chunks) | Full (analyzes the entire recording) |
| Custom vocabulary | Not applied | Fully applied |
| Speaker labels | Basic (Speaker 1, Speaker 2) | Matched to known voice profiles |
| Punctuation | Basic | Full with paragraph formatting |
| Availability | Desktop app and browser extension only | All recording methods |