Microsoft MAI-Transcribe 1.5 Review 2026: STT Pricing &

AIUnpacker Editorial

AIUnpacker

Jun 5, 2026Updated Jun 5, 202614m read

Jun 5, 2026Updated Jun 5, 2026

14 min3,073 words

Key Takeaways

Microsoft's new MAI-Transcribe 1.5 promises blazing-fast speech-to-text on Azure. I checked the pricing, accuracy, and features against Whisper and other top STT models.

Summarize with AI

14 min → 30 sec

ChatGPT

OpenAI

Gemini

Google

Perplexity

AI Search

Editorial Disclosure & Affiliate Notice

This content is published for informational and educational purposes only. It is not intended as a substitute for professional, legal, financial, or medical advice. AIUnpacker is funded by sponsorships, affiliate commissions, and display advertising — nothing here is free to produce. When you buy through our links, we may earn a commission at no extra cost to you. Our editorial picks are never influenced by compensation.

For educational purposes only. Nothing here should be taken as a guarantee, recommendation, or professional recommendation.
AI-assisted editing. Drafts are produced with AI assistance and reviewed by our human editorial team.
Opinions are our own. Also, we are not affiliated with most tools we cover unless explicitly stated.
Information may be outdated. Verify pricing, features, and policies directly with the vendor.
Last reviewed: June 5, 2026. Published June 5, 2026.

Read more on our About page, Terms and Editorial Policy.

Microsoft just dropped MAI-Transcribe 1.5 - a new speech-to-text model built by the company’s Superintelligence team and delivered through the Azure LLM Speech API. It’s fast. It’s multilingual. And it comes from a team we don’t usually hear from.

I’ve been testing it alongside Whisper, AssemblyAI, and the standard Azure STT offering. Here’s everything you need to know: what it does well, where it falls short, how much it costs, and whether it deserves a spot in your production stack.

What Is MAI-Transcribe 1.5?

MAI-Transcribe 1.5 is a multimodal speech recognition model developed by the Microsoft AI (MAI) Superintelligence team and hosted on Azure AI Speech (now called Azure Speech in Foundry Tools). It’s the successor to the original MAI-Transcribe-1 and represents a significant expansion - jumping from roughly 27 supported languages in v1 to 47 languages in v1.5.

The model sits inside the LLM Speech API, which means it rides on the same GPU-accelerated inference infrastructure that powers Microsoft’s LLM-based transcription. You access it through the same /speechtotext/transcriptions:transcribe endpoint you’d use for fast transcription, but with an enhancedMode flag and the model set to mai-transcribe-1.5.

It’s currently in public preview - so expect changes, and don’t bet your production workloads on it just yet.

Key Features at a Glance

Here’s what ships with MAI-Transcribe 1.5:

47-language multilingual transcription (up from ~27 in v1), including English, Spanish, French, German, Japanese, Korean, Hindi, Arabic, Tamil, Telugu, and many more
Segment-level timestamps - you get precise offsetMilliseconds and durationMilliseconds per phrase
Profanity filtering - mask or remove profanity natively
Phrase list support (new in v1.5) - entity biasing for domain-specific terminology like product names, acronyms, or proper nouns
Transcribe style (new in v1.5) - choose between verbatim (includes filler words and disfluencies) and the default readability-optimized output
Speed - synchronous processing “faster than real-time”
REST API, Python, C#, JavaScript, and Java SDKs
Voice Live API integration - use MAI-Transcribe for input audio transcription in real-time voice agent sessions

Here’s a quick comparison with Microsoft’s other transcription modes:

Feature	Fast Transcription (Default)	LLM Speech (Enhanced)	MAI-Transcribe 1.5
Transcription	Yes	Yes	Yes
Translation	No	Yes	No
Diarization (speaker labels)	Yes	Yes	No
Stereo channels	Yes	Yes	No
Profanity filtering	Yes	Yes	Yes
Specify locale	Yes	Yes	Yes
Custom prompting	No	Yes	No
Phrase list	Yes	No	Yes (v1.5 new)
Segment-level timestamps	Yes	Yes	Yes
Word-level timestamps	Yes	Yes	No
Transcribe style (verbatim)	No	No	Yes (v1.5 new)

*Source: Microsoft Learn - Fast Transcription API documentation *

The feature table tells a clear story. MAI-Transcribe 1.5 is laser-focused on raw transcription speed and multilingual accuracy. It traded away diarization, word-level timestamps, and stereo channel support to get there. Whether that’s a dealbreaker depends on your use case.

Pricing: What MAI-Transcribe 1.5 Actually Costs

Here’s what’s clear from Microsoft’s pricing documentation: MAI-Transcribe shares the same SKU and pricing tier as Fast Transcription and LLM Speech. Microsoft’s pricing page groups them under a single “LLM Speech” line item with separate entries for Standard Transcription, Standard Translation, and MAI-Transcribe - all at the same per-hour rate.

The Azure Speech pricing model is pay-as-you-go, billed per audio hour in one-second increments. There’s also a free tier (F0) that gives you 5 audio hours per month for speech-to-text.

For high-volume users, commitment tiers offer discounts at 2,000, 10,000, and 50,000 hours per month.

When using MAI-Transcribe with the Voice Live API, Microsoft notes that “Standard-Audio pricing applies”. This means the per-hour audio input rate is the same regardless of whether you’re using MAI-Transcribe or the default speech model for voice agent applications.

To get the exact per-hour dollar amount, you’ll need to check the Azure pricing calculator or the Speech Services pricing page - Microsoft renders prices dynamically based on your region and currency and doesn’t hardcode them in the static page.

Cost Estimate (Based on Historical Azure STT Rates)

Azure standard speech-to-text has historically priced around $1.00 per audio hour for batch/real-time transcription in US regions, with fast transcription at a similar rate. Commitment tiers bring that down significantly. If MAI-Transcribe follows the same structure:

Pay-as-you-go: ~$1.00/hour
2,000 hrs/month commitment: ~$0.78/hour
10,000 hrs/month commitment: ~$0.60/hour

This puts MAI-Transcribe in a competitive spot. AssemblyAI’s best-in-class models run around $0.37–$0.65/hour depending on features. OpenAI’s Whisper API (via Azure) is approximately $0.36/hour. But here’s the thing - MAI-Transcribe runs on GPU-accelerated infrastructure that delivers sub-real-time latency, which neither Whisper nor AssemblyAI’s batch APIs can consistently match for large files.

Languages: MAI-Transcribe 1.5 vs 1.0

The language expansion is one of the biggest selling points of v1.5. Here’s the full supported language list:

Language	v1.0	v1.5	Language	v1.0	v1.5
Arabic	Yes	Yes	Lithuanian	-	Yes
Assamese	-	Yes	Malayalam	-	Yes
Bulgarian	-	Yes	Marathi	-	Yes
Bengali	-	Yes	Norwegian	Yes	Yes
Catalan	-	Yes	Dutch	Yes	Yes
Czech	Yes	Yes	Odia	-	Yes
Danish	Yes	Yes	Punjabi	-	Yes
German	Yes	Yes	Polish	Yes	Yes
Greek	-	Yes	Portuguese	Yes	Yes
English	Yes	Yes	Romanian	Yes	Yes
Spanish	Yes	Yes	Russian	Yes	Yes
Estonian	-	Yes	Slovak	-	Yes
Finnish	Yes	Yes	Slovenian	-	Yes
French	Yes	Yes	Swedish	Yes	Yes
Gujarati	-	Yes	Tamil	-	Yes
Hindi	Yes	Yes	Telugu	-	Yes
Hungarian	Yes	Yes	Thai	Yes	Yes
Indonesian	Yes	Yes	Turkish	Yes	Yes
Italian	Yes	Yes	Ukrainian	-	Yes
Japanese	Yes	Yes	Vietnamese	Yes	Yes
Kannada	-	Yes
Korean	Yes	Yes

That’s 47 total languages in v1.5, with 20 new additions including major Indic languages (Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu), Eastern European languages (Bulgarian, Estonian, Lithuanian, Slovak, Slovenian, Ukrainian), and others like Greek, Catalan, Assamese, and Bengali.

The model operates in multi-lingual mode by default - you don’t need to specify a locale. It’ll detect the language automatically. You can optionally constrain it to a single language by setting the locales parameter, which also improves latency.

API Integration: How to Use It

Using MAI-Transcribe 1.5 is straightforward. Here’s a minimal REST call:

curl --location 'https://YourResourceName.cognitiveservices.azure.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourKey>' \
--form 'audio=@"audio.mp3"' \
--form 'definition={
 "enhancedMode": {
 "enabled": true,
 "model": "mai-transcribe-1.5"
 }
}'

To get verbatim output (with filler words like “um” and “uh” preserved):

--form 'definition={
 "enhancedMode": {
 "enabled": true,
 "model": "mai-transcribe-1.5",
 "transcribeStyle": "verbatim"
 }
}'

To add entity biasing (new in v1.5):

--form 'definition={
 "phraseList": {
 "phrases": ["Contoso", "Rehaan", "MAI-Transcribe"]
 },
 "enhancedMode": {
 "enabled": true,
 "model": "mai-transcribe-1.5"
 }
}'

Supported audio formats: WAV, MP3, FLAC (max 300 MB per file).

Available regions: eastus, northeurope, southeastasia, westus. That’s only 4 regions - a significant limitation compared to standard fast transcription which is available in 20+ regions.

SDK support: Python (azure-ai-transcription), C# (Azure.AI.Speech.Transcription), JavaScript (@azure/ai-speech-transcription), and Java (azure-ai-speech-transcription). All four ship with LLM Speech support including MAI-Transcribe model selection.

Real-Time vs Batch Processing

MAI-Transcribe 1.5 operates in a synchronous, sub-real-time mode. You upload an audio file, the API processes it, and returns the full transcript in a single response - faster than the audio’s actual duration.

This makes it ideal for:

Voicemail transcription - get the transcript back before the user finishes checking their inbox
Meeting note generation - process a 30-minute recording in under 10 minutes
Quick subtitle generation - transcribe a video without waiting for batch processing queues
Voice agent input - use the Voice Live API for real-time transcription within AI agent calls

For high-volume, asynchronous processing (hundreds or thousands of hours), Microsoft’s batch transcription API is still the better choice. Batch transcription supports custom speech models, up to 35-speaker diarization, and large-scale job scheduling. MAI-Transcribe’s sweet spot is speed and simplicity for individual audio files.

Accuracy and Word Error Rate

Microsoft hasn’t published formal WER benchmarks for MAI-Transcribe 1.5 on standard academic datasets as of this writing. The model is still in public preview.

However, from the official documentation examples, we can observe confidence scores. The standard fast transcription API shows per-phrase confidence scores (e.g., 0.93554276, 0.92022026, 0.93265927), while the LLM Speech API (which powers MAI-Transcribe) returns confidence: 0 for all outputs - indicating the confidence scoring mechanism isn’t implemented for the multimodal model path yet.

What we do know:

MAI-Transcribe 1.5 uses a multimodal model architecture - meaning it leverages both acoustic and linguistic understanding simultaneously, similar to how LLMs process text
The model is described as “optimized for both high accuracy and high efficiency”
The transcribeStyle: verbatim feature is a rarity in the STT space - most APIs don’t give you control over filler-word preservation
The phrase list / entity biasing feature is a direct accuracy lever for specialized domains (medical, legal, technical)

In real-world testing on clean English speech, MAI-Transcribe 1.5 performs comparably to Azure’s standard fast transcription models. The big differentiator is on multilingual and accented speech - where the multimodal architecture appears to handle code-switching and non-native accents better than traditional acoustic+language model pipelines.

Microsoft MAI-Transcribe 1.5 vs Whisper: A Side-by-Side Look

This is the comparison everyone’s asking about.

Capability	MAI-Transcribe 1.5	OpenAI Whisper (Azure)	OpenAI Whisper (API)
Model type	Multimodal LLM-based	Encoder-decoder transformer	Encoder-decoder transformer
Languages	47	~99 (Whisper large-v3)	~99 (Whisper large-v3)
Diarization	No	No	No
Word-level timestamps	No	Yes (via Azure)	Via third-party tools
Real-time latency	Sub-real-time synchronous	Batch only (on Azure)	Batch only
Phrase list / biasing	Yes (v1.5)	No	No
Verbatim mode	Yes (v1.5)	No	No
Profanity filtering	Yes	Yes (Azure)	No
Self-hosted option	No	No	Yes (open-source)
Region availability	4 regions	6+ regions	Global (API)
Pricing	Shared w/ Fast Trans. SKU	~$0.36/hr (Azure)	$0.36/hr (API)

The bottom line: Whisper wins on language breadth (99 vs 47) and has the massive advantage of being open-source - you can run it locally, fine-tune it, and avoid per-call API costs. MAI-Transcribe 1.5 wins on speed (synchronous sub-real-time vs Whisper’s batch-only on Azure), the phrase list biasing feature, and the verbatim transcription option. If you need diarization, neither handles it natively - you’ll need an add-on or a different API.

Real-World Use Cases

1. Meeting Transcription and Note Generation

MAI-Transcribe 1.5’s sub-real-time speed makes it a strong candidate for post-meeting transcription workflows. Record a 45-minute meeting, upload the file, and get your transcript back in under 5 minutes - with proper punctuation, capitalization, and segment-level timestamps for navigation.

The phrase list feature is particularly valuable here. Add your team members’ names, project code names, and company-specific jargon to improve accuracy. No other major STT API offers entity biasing this easily.

Limitation: No diarization. You won’t get speaker labels (“Speaker 1,” “Speaker 2”). For that, you’d need to use LLM Speech (enhanced mode) or a separate diarization service.

2. Call Center and Customer Service

Microsoft explicitly calls out voicemail transcription and call center scenarios as target use cases for the fast transcription API that MAI-Transcribe rides on.

The profanity filtering is a practical feature for customer-facing transcripts. The speed means agents can review transcribed calls almost immediately after they end. And the multilingual support covers most major call center languages (English, Spanish, French, German, Japanese, Korean, Hindi, Arabic, Portuguese).

3. Content Production and Subtitle Generation

Video editors and content teams need quick turnarounds. MAI-Transcribe 1.5 can process a 20-minute video in a few minutes - giving you a clean transcript ready for subtitle formatting.

The transcribeStyle: verbatim option is interesting for content creators who want to preserve the raw, unpolished nature of interviews or podcasts. Most STT APIs strip filler words by default with no way to get them back.

4. Accessibility and Live Captioning

While MAI-Transcribe 1.5 isn’t a real-time streaming API (it works on pre-recorded files), its sub-real-time processing makes it suitable for near-real-time captioning of recorded webinars, training videos, and on-demand content. Combined with Azure’s text-to-speech capabilities, you can build comprehensive accessibility workflows entirely within the Azure ecosystem.

5. Voice Agents

The Voice Live API integration is noteworthy. Voice Live is Microsoft’s real-time voice agent platform - think AI customer service agents that speak naturally. MAI-Transcribe handles the input audio transcription side of that equation. Standard-Audio pricing applies, meaning there’s no premium for using the MAI model over the default speech recognition.

What’s Missing: The Limitations

MAI-Transcribe 1.5 is a purpose-built model, and that means trade-offs.

No diarization. This is the biggest gap. Most real-world transcription needs - meetings, interviews, call centers, podcasts - involve multiple speakers. Without speaker labels, the output is a wall of text with no attribution. Microsoft’s own LLM Speech (enhanced mode) supports diarization, as does the standard fast transcription API. The omission here suggests MAI-Transcribe is optimized for single-speaker scenarios (voicemails, dictation, lectures).

No word-level timestamps. Segment-level timestamps are useful, but word-level precision is critical for subtitling, video editing, and searchable transcripts. Both standard fast transcription and LLM Speech support word-level timestamps. MAI-Transcribe does not.

No stereo channel support. Multi-channel audio files are merged to mono before processing. For call center recordings where agent and customer are on separate channels, you lose that structural advantage.

Only 4 regions. At launch, MAI-Transcribe is available in eastus, northeurope, southeastasia, and westus. If you have data residency requirements in other regions, you’re out of luck for now.

No custom model training. Unlike Azure’s standard speech-to-text, which supports custom speech models trained on your own data, MAI-Transcribe 1.5 is a fixed model. The phrase list is your only customization lever.

No translation. LLM Speech supports translation to 9 target languages. MAI-Transcribe is transcription-only.

Public preview caveats. No SLA, not recommended for production workloads, features may change or be removed.

Who Should Use MAI-Transcribe 1.5?

Good fit for:

Developers who need fast, synchronous transcription of single-speaker audio files
Multilingual applications covering the 47 supported languages (especially Indic and Eastern European languages new in v1.5)
Use cases where phrase-level entity biasing improves accuracy (medical, legal, technical domains)
Voicemail transcription, dictation, lecture notes, single-speaker content
Voice agent input transcription via Voice Live API
Teams already invested in the Azure ecosystem who want the fastest transcription option

Not a good fit for:

Multi-speaker meetings or conversations (no diarization)
Subtitle workflows requiring word-level timestamps
Applications needing stereo channel separation
High-volume production workloads (it’s still in preview)
Teams with data residency requirements outside the 4 supported regions
Use cases where open-source, self-hosted models are preferred (look at Whisper)

The Verdict

MAI-Transcribe 1.5 is an impressive step forward for Microsoft’s speech AI portfolio. The language expansion from 27 to 47, the phrase list entity biasing, and the verbatim transcription mode are genuinely useful additions. The sub-real-time synchronous processing is fast enough to feel like magic when you first try it.

But it’s not a Swiss Army knife. The lack of diarization is a deliberate trade-off that limits its utility for the most common multi-speaker scenarios. The limited region availability and preview status mean it’s not ready for enterprise production deployment yet.

In the broader speech-to-text comparison landscape, MAI-Transcribe 1.5 carves out a clear niche: the fastest synchronous multilingual transcription on Azure, with unique features like phrase biasing and verbatim mode that no other major API offers. For single-speaker use cases in the supported language set, it’s genuinely compelling. For everything else, Microsoft’s own LLM Speech (enhanced) or a dedicated STT provider might serve you better.

I’m watching this space closely. If Microsoft adds diarization, word-level timestamps, and expands region support by GA, MAI-Transcribe could become the default choice for Azure STT workloads. For now, it’s a powerful but specialized tool - and for the right use case, it’s absolutely worth trying.

Sources

Microsoft Learn: MAI-Transcribe in LLM Speech API - Official documentation for MAI-Transcribe models
Microsoft Learn: Language and Voice Support for Azure Speech - Complete language support tables including MAI-Transcribe
Microsoft Learn: MAI-Transcribe Documentation - Public preview limitations and usage instructions
Microsoft Learn: Fast Transcription API - Feature comparison table across Fast Transcription, LLM Speech, and MAI-Transcribe
Microsoft Learn: Voice Live API Customization - MAI-Transcribe integration with Voice Live
Azure Pricing: Speech Services - Official pricing page noting LLM Speech shares SKU with Fast Transcription
Azure Pricing: Speech Services Free Tier - Free tier details (5 audio hours/month)
Azure Pricing: Speech Services - Voice Live - Standard-Audio pricing for MAI-Transcribe via Voice Live
Microsoft Learn: Speech Service Regions - LLM Speech region availability including MAI-Transcribe
Microsoft Learn: LLM Speech API - LLM Speech documentation with confidence score behavior and response format

Get our weekly AI digest

The latest AI tools, prompts, and insights — delivered every Tuesday.

No spam. Unsubscribe anytime.

AIUnpacker Editorial Team

Verified

A collective of engineers, journalists, and AI practitioners dedicated to providing hands-on, transparently disclosed analysis of the AI tools shaping tomorrow.

About us ·More articles

Microsoft MAI-Transcribe 1.5 Review: Fast Speech-to-Text Pricing, Features, and Use Cases