Transcription

Mavio converts spoken words into text using state-of-the-art speech-to-text models. Every recording is automatically transcribed with speaker labels, timestamps, and punctuation — no manual setup required.

How transcription works

When a recording is captured (via bot, desktop app, browser extension, or mobile app), the audio is sent to Mavio’s transcription engine. The process involves multiple stages:

Audio preprocessing

The raw audio is normalized, noise-reduced, and segmented into manageable chunks. Silence detection removes dead air to speed up processing.

Speech-to-text conversion

Each audio segment is processed by Mavio’s ASR (automatic speech recognition) models. These models are trained on millions of hours of conversational speech and optimized for meeting scenarios.

Punctuation and formatting

AI adds sentence boundaries, punctuation, capitalization, and paragraph breaks to produce natural, readable text.

Speaker diarization

Speaker identification models determine who is speaking at each moment. Segments are labeled with speaker names when known or speaker numbers (Speaker 1, Speaker 2) when not yet identified.

Post-processing

Domain-specific terms, proper nouns, and acronyms are refined using context from your organization’s custom vocabulary (if configured).

Accuracy

Mavio achieves 95%+ word-level accuracy under standard meeting conditions:

Factor	Impact on accuracy
Clear audio, one speaker at a time	97-99%
Multiple speakers with minimal crosstalk	95-97%
Background noise (office, cafe)	90-95%
Heavy accents or fast speech	88-95%
Poor audio quality or heavy compression	80-90%

You can improve accuracy significantly by using the meeting bot or system audio capture instead of mobile recording in noisy environments. Direct audio streams produce the cleanest input.

Supported languages

Mavio supports transcription in 40+ languages including:

Widely supported

English, Spanish, French, German, Portuguese, Italian, Dutch, Japanese, Korean, Mandarin Chinese, Hindi, Arabic

European

Polish, Czech, Slovak, Romanian, Hungarian, Greek, Swedish, Norwegian, Danish, Finnish, Turkish, Ukrainian

Asian & other

Thai, Vietnamese, Indonesian, Malay, Tagalog, Bengali, Tamil, Urdu, Hebrew, Russian, Cantonese

Automatic language detection

Mavio detects the spoken language automatically. If a meeting includes multiple languages, Mavio identifies language switches and transcribes each segment in the appropriate language. To force a specific language, go to Settings > Transcription > Default language and select your preferred language.

Real-time transcription

When using the desktop app or browser extension, you can enable live transcription to see text appear as the meeting progresses:

Go to Settings > Transcription > Live transcription and toggle it on.
During a recording, open the Mavio window to see the live transcript feed.

Live transcription has slightly lower accuracy than the final processed transcript, which benefits from additional context and post-processing.

Real-time transcription requires a stable internet connection with at least 1 Mbps upload speed. It is not available in privacy mode, which processes audio on-device after the recording ends.

Custom vocabulary

Improve transcription of domain-specific terms, product names, and jargon:

Go to Settings > Transcription > Custom vocabulary.
Add words, phrases, or acronyms that Mavio should recognize (e.g., “Kubernetes”, “OKRs”, “Mavio”).
Optionally add pronunciation hints for unusual spellings.

Custom vocabulary is applied during post-processing and can significantly improve accuracy for technical discussions.

Editing transcripts

Click any transcript segment in the meeting viewer to edit the text. Edits are saved immediately and reflected in the summary and search index. Edited segments are marked with a pencil icon so you know what was modified.

Exporting transcripts

Export transcripts from the meeting detail page in these formats:

TXT — plain text with speaker labels
SRT / VTT — subtitle format with timestamps
PDF — formatted document with speaker labels and timestamps
Markdown — for pasting into wikis, docs, or Notion
JSON — structured data for programmatic access

Transcription settings

Language detection vs manual language selection

By default, Mavio uses automatic language detection, which identifies the spoken language from the audio and selects the appropriate transcription model. This works well for most meetings, including those where participants switch between languages mid-conversation.If automatic detection is producing incorrect results (for example, confusing closely related languages like Norwegian and Danish), you can force a specific language:

Go to Settings > Transcription > Default language.
Select your preferred language from the list.
All future recordings will use that language model exclusively.

You can also set the language per-recording by clicking the language selector on the recording detail page before transcription begins.

Leave automatic detection enabled if your meetings regularly include multiple languages. Manual selection is best when all your meetings are consistently in one language and you want to avoid occasional misdetection.

Punctuation and formatting options

Mavio automatically adds punctuation, capitalization, and paragraph breaks to produce readable transcripts. You can fine-tune formatting behavior in Settings > Transcription > Formatting:

Paragraph breaks — choose between breaking on speaker changes only, or also inserting breaks on long pauses (default: speaker changes + pauses over 3 seconds).
Filler word removal — automatically strip “um”, “uh”, “like”, “you know” and similar fillers from the transcript. Disabled by default to preserve the verbatim record.
Number formatting — choose between spelled-out numbers (“twenty-three”) or numeric format (“23”). Default is numeric for values above ten.
Timestamp granularity — display timestamps per sentence, per paragraph, or per speaker turn.

Formatting changes apply to new recordings. To reformat an existing transcript, click Reprocess on the recording detail page.

Custom vocabulary and industry terms

Add domain-specific terms to improve recognition accuracy for specialized vocabulary:

Go to Settings > Transcription > Custom vocabulary.
Click Add term and enter the word or phrase exactly as it should appear in the transcript.
Optionally add a pronunciation hint (phonetic spelling) for unusual words.
Click Save.

Examples of terms worth adding:

Product names (e.g., “Mavio”, “Kubernetes”, “BigQuery”)
Industry acronyms (e.g., “OKRs”, “KPIs”, “HIPAA”)
People’s names that are frequently misspelled
Company-specific jargon

Custom vocabulary is applied during the post-processing stage and can be added or updated at any time. Changes apply to all future recordings. To apply custom vocabulary to an existing transcript, use the Reprocess button.

Speaker diarization sensitivity

Speaker diarization is the process of determining “who spoke when.” You can adjust sensitivity in Settings > Transcription > Speaker detection:

Conservative — requires clear speaker changes with distinct voice characteristics. Fewer false speaker switches, but may merge speakers with similar voices into one label.
Balanced (default) — good tradeoff between accuracy and sensitivity for most meeting scenarios with 2-8 participants.
Aggressive — detects subtle speaker changes. Best for meetings with many participants or speakers with similar voices. May occasionally split a single speaker into two labels.

Regardless of the setting, you can always correct speaker labels manually on the transcript page. Corrections improve future speaker matching when voice profiles are enabled.

Real-time vs post-processing accuracy comparison

Mavio offers two transcription modes, and their accuracy characteristics differ:

Aspect	Real-time transcription	Post-processing transcription
Accuracy	90-95%	95-99%
Latency	Under 2 seconds	2-5 minutes after recording ends
Context awareness	Limited (processes small audio chunks)	Full (analyzes the entire recording)
Custom vocabulary	Not applied	Fully applied
Speaker labels	Basic (Speaker 1, Speaker 2)	Matched to known voice profiles
Punctuation	Basic	Full with paragraph formatting
Availability	Desktop app and browser extension only	All recording methods

The real-time transcript is a preview designed for following along during the meeting. The final post-processed transcript is the authoritative version and is what gets used for summaries, action items, and search indexing.

​How transcription works

​Accuracy

​Supported languages

Widely supported

European

Asian & other

​Automatic language detection

​Real-time transcription

​Custom vocabulary

​Editing transcripts

​Exporting transcripts

​Transcription settings

How transcription works

Accuracy

Supported languages

Automatic language detection

Real-time transcription

Custom vocabulary

Editing transcripts

Exporting transcripts

Transcription settings