Mavio converts spoken words into text using state-of-the-art speech-to-text models. Every recording is automatically transcribed with speaker labels, timestamps, and punctuation — no manual setup required.

How transcription works

When a recording is captured (via bot, desktop app, browser extension, or mobile app), the audio is sent to Mavio’s transcription engine. The process involves multiple stages:
1

Audio preprocessing

The raw audio is normalized, noise-reduced, and segmented into manageable chunks. Silence detection removes dead air to speed up processing.
2

Speech-to-text conversion

Each audio segment is processed by Mavio’s ASR (automatic speech recognition) models. These models are trained on millions of hours of conversational speech and optimized for meeting scenarios.
3

Punctuation and formatting

AI adds sentence boundaries, punctuation, capitalization, and paragraph breaks to produce natural, readable text.
4

Speaker diarization

Speaker identification models determine who is speaking at each moment. Segments are labeled with speaker names when known or speaker numbers (Speaker 1, Speaker 2) when not yet identified.
5

Post-processing

Domain-specific terms, proper nouns, and acronyms are refined using context from your organization’s custom vocabulary (if configured).

Accuracy

Mavio achieves 95%+ word-level accuracy under standard meeting conditions:
FactorImpact on accuracy
Clear audio, one speaker at a time97-99%
Multiple speakers with minimal crosstalk95-97%
Background noise (office, cafe)90-95%
Heavy accents or fast speech88-95%
Poor audio quality or heavy compression80-90%
You can improve accuracy significantly by using the meeting bot or system audio capture instead of mobile recording in noisy environments. Direct audio streams produce the cleanest input.

Supported languages

Mavio supports transcription in 40+ languages including:

Widely supported

English, Spanish, French, German, Portuguese, Italian, Dutch, Japanese, Korean, Mandarin Chinese, Hindi, Arabic

European

Polish, Czech, Slovak, Romanian, Hungarian, Greek, Swedish, Norwegian, Danish, Finnish, Turkish, Ukrainian

Asian & other

Thai, Vietnamese, Indonesian, Malay, Tagalog, Bengali, Tamil, Urdu, Hebrew, Russian, Cantonese

Automatic language detection

Mavio detects the spoken language automatically. If a meeting includes multiple languages, Mavio identifies language switches and transcribes each segment in the appropriate language. To force a specific language, go to Settings > Transcription > Default language and select your preferred language.

Real-time transcription

When using the desktop app or browser extension, you can enable live transcription to see text appear as the meeting progresses:
  1. Go to Settings > Transcription > Live transcription and toggle it on.
  2. During a recording, open the Mavio window to see the live transcript feed.
Live transcription has slightly lower accuracy than the final processed transcript, which benefits from additional context and post-processing.
Real-time transcription requires a stable internet connection with at least 1 Mbps upload speed. It is not available in privacy mode, which processes audio on-device after the recording ends.

Custom vocabulary

Improve transcription of domain-specific terms, product names, and jargon:
  1. Go to Settings > Transcription > Custom vocabulary.
  2. Add words, phrases, or acronyms that Mavio should recognize (e.g., “Kubernetes”, “OKRs”, “Mavio”).
  3. Optionally add pronunciation hints for unusual spellings.
Custom vocabulary is applied during post-processing and can significantly improve accuracy for technical discussions.

Editing transcripts

Click any transcript segment in the meeting viewer to edit the text. Edits are saved immediately and reflected in the summary and search index. Edited segments are marked with a pencil icon so you know what was modified.

Exporting transcripts

Export transcripts from the meeting detail page in these formats:
  • TXT — plain text with speaker labels
  • SRT / VTT — subtitle format with timestamps
  • PDF — formatted document with speaker labels and timestamps
  • Markdown — for pasting into wikis, docs, or Notion
  • JSON — structured data for programmatic access

Transcription settings

By default, Mavio uses automatic language detection, which identifies the spoken language from the audio and selects the appropriate transcription model. This works well for most meetings, including those where participants switch between languages mid-conversation.If automatic detection is producing incorrect results (for example, confusing closely related languages like Norwegian and Danish), you can force a specific language:
  1. Go to Settings > Transcription > Default language.
  2. Select your preferred language from the list.
  3. All future recordings will use that language model exclusively.
You can also set the language per-recording by clicking the language selector on the recording detail page before transcription begins.
Leave automatic detection enabled if your meetings regularly include multiple languages. Manual selection is best when all your meetings are consistently in one language and you want to avoid occasional misdetection.
Mavio automatically adds punctuation, capitalization, and paragraph breaks to produce readable transcripts. You can fine-tune formatting behavior in Settings > Transcription > Formatting:
  • Paragraph breaks — choose between breaking on speaker changes only, or also inserting breaks on long pauses (default: speaker changes + pauses over 3 seconds).
  • Filler word removal — automatically strip “um”, “uh”, “like”, “you know” and similar fillers from the transcript. Disabled by default to preserve the verbatim record.
  • Number formatting — choose between spelled-out numbers (“twenty-three”) or numeric format (“23”). Default is numeric for values above ten.
  • Timestamp granularity — display timestamps per sentence, per paragraph, or per speaker turn.
Formatting changes apply to new recordings. To reformat an existing transcript, click Reprocess on the recording detail page.
Add domain-specific terms to improve recognition accuracy for specialized vocabulary:
  1. Go to Settings > Transcription > Custom vocabulary.
  2. Click Add term and enter the word or phrase exactly as it should appear in the transcript.
  3. Optionally add a pronunciation hint (phonetic spelling) for unusual words.
  4. Click Save.
Examples of terms worth adding:
  • Product names (e.g., “Mavio”, “Kubernetes”, “BigQuery”)
  • Industry acronyms (e.g., “OKRs”, “KPIs”, “HIPAA”)
  • People’s names that are frequently misspelled
  • Company-specific jargon
Custom vocabulary is applied during the post-processing stage and can be added or updated at any time. Changes apply to all future recordings. To apply custom vocabulary to an existing transcript, use the Reprocess button.
Speaker diarization is the process of determining “who spoke when.” You can adjust sensitivity in Settings > Transcription > Speaker detection:
  • Conservative — requires clear speaker changes with distinct voice characteristics. Fewer false speaker switches, but may merge speakers with similar voices into one label.
  • Balanced (default) — good tradeoff between accuracy and sensitivity for most meeting scenarios with 2-8 participants.
  • Aggressive — detects subtle speaker changes. Best for meetings with many participants or speakers with similar voices. May occasionally split a single speaker into two labels.
Regardless of the setting, you can always correct speaker labels manually on the transcript page. Corrections improve future speaker matching when voice profiles are enabled.
Mavio offers two transcription modes, and their accuracy characteristics differ:
AspectReal-time transcriptionPost-processing transcription
Accuracy90-95%95-99%
LatencyUnder 2 seconds2-5 minutes after recording ends
Context awarenessLimited (processes small audio chunks)Full (analyzes the entire recording)
Custom vocabularyNot appliedFully applied
Speaker labelsBasic (Speaker 1, Speaker 2)Matched to known voice profiles
PunctuationBasicFull with paragraph formatting
AvailabilityDesktop app and browser extension onlyAll recording methods
The real-time transcript is a preview designed for following along during the meeting. The final post-processed transcript is the authoritative version and is what gets used for summaries, action items, and search indexing.