Automatic speaker identification for any audio or video file.

Vook detects every voice in your recording, labels each speaker, and adds timestamps throughout the transcript. Up to 99% accuracy, processed in under a minute per hour of audio, hosted in the EU.

Identify speakers See pricing

Audio transcribed in under a minute with over 98% accuracy — New York Times

Speaker identification

Drop your audio or video file here

or click to browse

Browse files →

MP3WAVMP4M4AMOVOGG

+14 more

Trusted by over 75,000 people worldwide

99% accuracy

1 free transcription per day

With or without a plan

Accuracy on clear audio: 99 %
Per hour of audio: < 1 min
Languages supported: 100+
Professionals trust Vook.ai: 75k+

How it works

Three steps to a labeled transcript

No software to install, no forms to fill. Drop your file and we'll handle the rest.

Upload your file

Drag and drop your file or pick it from your computer. Files up to 6 GB are accepted, no installation needed.

Vook.ai transcribes in minutes

Vook.ai detects speakers, adds timestamps, and produces a clean, punctuated transcript. Typically under one minute per audio hour.

Edit, export, ask

Review in our editor, export to PDF, DOCX, MD, SRT or HTML, and ask the chat to summarize, extract quotes, or pull themes.

Why Vook

The transcription AI that doesn't read your data.

European sovereignty isn't a feature, it's the foundation. Your files stay yours: encrypted, EU-hosted, and never used for training.

Hosted in the EU

Your files stay on French infrastructure and never cross the Atlantic. GDPR-native, no Cloud Act exposure.

AES-256 encryption

Encrypted at rest with AES-256. Only you can access your transcripts.

Never used for training

Your audio and transcripts are never used for training, never resold, never analyzed for ads.

GDPR-native

Built from day one for European compliance. DPA on request, full audit trail, your right to deletion respected.

Formats

Works with every major audio and video format

Vook.ai reads every common audio and video format, and exports to whatever your workflow needs.

We built speaker identification into Vook because a transcript without context is just noise. Knowing who said what changes everything.

Vook.ai engineering team

Input formats

.mp3Most common

.wavLossless

.mp4Video audio

.m4aApple devices

.movQuickTime

.oggOpen source

.mpgaMPEG audio

.mpegMPEG audio

.opusLow-bitrate

.flacStudio quality

.aacStreaming

.webmWeb recordings

.wmaWindows

.aviVideo

.mtsAVCHD video

.m4vApple video

.mkvMatroska video

.wmvWindows video

.flvFlash video

.3gpMobile video

Export to

.pdfPrint-ready

.docxWord document

.mdMarkdown

.srtSubtitles

.htmlWeb page

For your profession

Made for people who work with words.

From newsrooms to research labs, anyone working with multi-speaker recordings gets more value from a labeled transcript.

Interview transcription for journalists and newsrooms

Interview transcription, without typing a line

“Every speaker identified”
“Quotes ready to extract”
“Accurate transcripts in minutes”

Learn more

Guide

Speaker identification: everything you need to know

What is speaker identification?

Speaker identification is the process of detecting and labeling each distinct voice in an audio recording. The output is a transcript where every line is attributed to a specific speaker, often combined with timestamps so you know exactly when each person spoke.

This is also called speaker diarization. It answers the question "who spoke when?" rather than just "what was said?" For any recording with more than one person, a diarized transcript is far more useful than a plain text block.

How AI diarization works

Modern diarization systems analyze the acoustic properties of a recording to segment it into regions where a single speaker is active. Each segment is then grouped with other segments that share the same voice characteristics. The result is a set of speaker clusters, each assigned a label.

Voice activity detection. the system first identifies which parts of the audio contain speech versus silence or background noise.
Speaker segmentation. the audio is split at points where the speaker changes.
Speaker clustering. segments from the same voice are grouped together under one label.
Transcript alignment. the speech-to-text output is matched to the speaker segments, producing a labeled transcript.

Speaker identification vs. speaker verification

These two terms are often confused. Speaker identification asks "which of these known or unknown voices is speaking?" without requiring a pre-registered voiceprint. Speaker verification asks "is this person who they claim to be?" and compares against a stored reference.

Vook performs speaker identification and diarization: it detects and labels voices without needing any prior enrollment. This is the right tool for transcription, journalism, research, and content production. Speaker verification is a separate biometric use case.

What affects accuracy

Vook reaches up to 99% accuracy on clear audio in supported languages. Speaker separation quality depends on several factors:

Audio clarity. recordings made in quiet environments with a good microphone produce the best results.
Voice overlap. simultaneous speech makes it harder to separate speakers cleanly. The built-in editor lets you correct any misattributions.
Number of speakers. files with two or three distinct voices are easier to diarize than large group recordings.
Recording quality. low-bitrate phone calls or heavily compressed files reduce accuracy. WAV or FLAC files give the best results.

How to get the best results with Vook

A few simple steps before uploading can significantly improve speaker identification quality. Record in a quiet space, use a dedicated microphone rather than a built-in laptop mic, and avoid having multiple people speak at the same time.

Choose a high-quality format. WAV or FLAC preserves audio detail that helps the diarization engine separate voices accurately.
Use the built-in editor. after processing, rename Speaker 1, Speaker 2, etc. to real names, merge any incorrectly split segments, and mask sensitive information before exporting.
Try Vook Chat. once the transcript is labeled, use Vook Chat to extract quotes per speaker, summarize each person's contributions, or identify key themes across the conversation.

Privacy and compliance considerations

Speaker identification involves processing voice data, which is classified as biometric data under GDPR. Choosing a processor that is GDPR-native and hosted in the EU is essential for organizations subject to European data protection law.

Vook stores all files on servers in France, encrypts data with AES-256 at rest. Your recordings are never used to train AI models and are never shared with third parties. A Data Processing Agreement is available on request for organizations that need formal documentation.

FAQ

Frequently Asked Questions

Have a different question and can’t find the answer you’re looking for? Contact us.

How does Vook's speaker identification work?

Vook uses AI-powered diarization to detect and label each distinct voice in your audio or video file. Every speaker gets a unique label (Speaker 1, Speaker 2, etc.) with timestamps, so you can follow who said what throughout the transcript.

How many speakers can Vook identify in a single file?

Vook can identify multiple speakers in a single recording. The diarization engine separates voices automatically, and you can merge or rename speaker labels in the built-in editor after processing.

Is speaker identification free?

Yes. You get one free transcription with speaker identification per day, with no sign-up and no credit card required. Paid plans unlock saved files and additional features.

What file formats are supported for speaker identification?

Vook accepts 20 audio and video formats, including MP3, WAV, MP4, M4A, MOV, and OGG, up to 6 GB per file. Both audio and video files are supported.

How accurate is the speaker identification?

Vook reaches up to 99% accuracy on clear audio in supported languages. Speaker separation works best when voices are distinct and there is minimal overlap. The built-in editor lets you correct any labeling errors quickly.

Where is my audio stored and for how long?

Your files are stored on EU servers in France. Vook never uses your audio to train AI models and never sells your data.

Can I export the transcript with speaker labels?

Yes. Vook exports transcripts as PDF, DOCX, MD, SRT, and HTML. All formats preserve speaker labels and timestamps, so the structure of the conversation is clear in every file.

Free plan

Get 1 free transcript per day. Upgrade for unlimited power.

Subscribe now, cancel anytime

Get 4 months free with annual plans

API plan

Integrate Vook.ai into your stack

Custom pricing and features

Explore

Dedicated API access
Custom-built features
Centralized billing

Credits never expire

10h pass - no subscription

Use these hours whenever you want, they never expire

per hour

Buy hours

Know exactly who said what, in minutes

Free for occasional use. No credit card. One file per day, every day, forever.

Try now

Related conversion tools

Timestamped transcription Transcribe long audio Private transcription MP3 to text MP4 to text WAV to text