Discover the best AI tools curated for professionals.

AIUnpacker

Search everything

Find AI tools, reviews, prompts, and more

Quick links

Microsoft MAI-Transcribe 1.5 Review: Fast Speech-to-Text Pricing, Features, and Use Cases

Microsoft's new MAI-Transcribe 1.5 promises blazing-fast speech-to-text on Azure. I checked the pricing, accuracy, and features against Whisper and other top STT models.

AIUnpacker

AIUnpacker Editorial

June 5, 2026

14 min read
AIUnpacker

AIUnpacker

Jun 5, 2026 · 14m read

Jun 5, 2026 14 min

Key Takeaways

Microsoft's new MAI-Transcribe 1.5 promises blazing-fast speech-to-text on Azure. I checked the pricing, accuracy, and features against Whisper and other top STT models.

Editorial Disclosure & Affiliate Notice

This content is published for informational and educational purposes only. It is not intended as a substitute for professional, legal, financial, or medical advice. AIUnpacker is reader-supported — when you buy through our links, we may earn a commission at no extra cost to you, and our editorial picks are never influenced by compensation.

  • For educational purposes only. Nothing here should be taken as a guarantee, recommendation, or professional recommendation.
  • AI-assisted editing. Drafts are produced with AI assistance and reviewed by our human editorial team.
  • Opinions are our own. Also, we are not affiliated with most tools we cover unless explicitly stated.
  • Information may be outdated. Verify pricing, features, and policies directly with the vendor.
  • Last reviewed: June 5, 2026.

Read more on our About page, Terms and Editorial Policy.

Microsoft MAI-Transcribe 1.5 Review: Fast Speech-to-Text Pricing, Features, and Use Cases

Microsoft just dropped MAI-Transcribe 1.5 - a new speech-to-text model built by the company’s Superintelligence team and delivered through the Azure LLM Speech API. It’s fast. It’s multilingual. And it comes from a team we don’t usually hear from.

I’ve been testing it alongside Whisper, AssemblyAI, and the standard Azure STT offering. Here’s everything you need to know: what it does well, where it falls short, how much it costs, and whether it deserves a spot in your production stack.


What Is MAI-Transcribe 1.5?

MAI-Transcribe 1.5 is a multimodal speech recognition model developed by the Microsoft AI (MAI) Superintelligence team and hosted on Azure AI Speech (now called Azure Speech in Foundry Tools). It’s the successor to the original MAI-Transcribe-1 and represents a significant expansion - jumping from roughly 27 supported languages in v1 to 47 languages in v1.5.

The model sits inside the LLM Speech API, which means it rides on the same GPU-accelerated inference infrastructure that powers Microsoft’s LLM-based transcription. You access it through the same /speechtotext/transcriptions:transcribe endpoint you’d use for fast transcription, but with an enhancedMode flag and the model set to mai-transcribe-1.5.

It’s currently in public preview - so expect changes, and don’t bet your production workloads on it just yet.


Key Features at a Glance

Here’s what ships with MAI-Transcribe 1.5:

  • 47-language multilingual transcription (up from ~27 in v1), including English, Spanish, French, German, Japanese, Korean, Hindi, Arabic, Tamil, Telugu, and many more
  • Segment-level timestamps - you get precise offsetMilliseconds and durationMilliseconds per phrase
  • Profanity filtering - mask or remove profanity natively
  • Phrase list support (new in v1.5) - entity biasing for domain-specific terminology like product names, acronyms, or proper nouns
  • Transcribe style (new in v1.5) - choose between verbatim (includes filler words and disfluencies) and the default readability-optimized output
  • Speed - synchronous processing “faster than real-time”
  • REST API, Python, C#, JavaScript, and Java SDKs
  • Voice Live API integration - use MAI-Transcribe for input audio transcription in real-time voice agent sessions

Here’s a quick comparison with Microsoft’s other transcription modes:

FeatureFast Transcription (Default)LLM Speech (Enhanced)MAI-Transcribe 1.5
TranscriptionYesYesYes
TranslationNoYesNo
Diarization (speaker labels)YesYesNo
Stereo channelsYesYesNo
Profanity filteringYesYesYes
Specify localeYesYesYes
Custom promptingNoYesNo
Phrase listYesNoYes (v1.5 new)
Segment-level timestampsYesYesYes
Word-level timestampsYesYesNo
Transcribe style (verbatim)NoNoYes (v1.5 new)

*Source: Microsoft Learn - Fast Transcription API documentation *

The feature table tells a clear story. MAI-Transcribe 1.5 is laser-focused on raw transcription speed and multilingual accuracy. It traded away diarization, word-level timestamps, and stereo channel support to get there. Whether that’s a dealbreaker depends on your use case.


Pricing: What MAI-Transcribe 1.5 Actually Costs

Here’s what’s clear from Microsoft’s pricing documentation: MAI-Transcribe shares the same SKU and pricing tier as Fast Transcription and LLM Speech. Microsoft’s pricing page groups them under a single “LLM Speech” line item with separate entries for Standard Transcription, Standard Translation, and MAI-Transcribe - all at the same per-hour rate.

The Azure Speech pricing model is pay-as-you-go, billed per audio hour in one-second increments. There’s also a free tier (F0) that gives you 5 audio hours per month for speech-to-text.

For high-volume users, commitment tiers offer discounts at 2,000, 10,000, and 50,000 hours per month.

When using MAI-Transcribe with the Voice Live API, Microsoft notes that “Standard-Audio pricing applies”. This means the per-hour audio input rate is the same regardless of whether you’re using MAI-Transcribe or the default speech model for voice agent applications.

To get the exact per-hour dollar amount, you’ll need to check the Azure pricing calculator or the Speech Services pricing page - Microsoft renders prices dynamically based on your region and currency and doesn’t hardcode them in the static page.

Cost Estimate (Based on Historical Azure STT Rates)

Azure standard speech-to-text has historically priced around $1.00 per audio hour for batch/real-time transcription in US regions, with fast transcription at a similar rate. Commitment tiers bring that down significantly. If MAI-Transcribe follows the same structure:

  • Pay-as-you-go: ~$1.00/hour
  • 2,000 hrs/month commitment: ~$0.78/hour
  • 10,000 hrs/month commitment: ~$0.60/hour

This puts MAI-Transcribe in a competitive spot. AssemblyAI’s best-in-class models run around $0.37–$0.65/hour depending on features. OpenAI’s Whisper API (via Azure) is approximately $0.36/hour. But here’s the thing - MAI-Transcribe runs on GPU-accelerated infrastructure that delivers sub-real-time latency, which neither Whisper nor AssemblyAI’s batch APIs can consistently match for large files.


Languages: MAI-Transcribe 1.5 vs 1.0

The language expansion is one of the biggest selling points of v1.5. Here’s the full supported language list:

Languagev1.0v1.5Languagev1.0v1.5
ArabicYesYesLithuanian-Yes
Assamese-YesMalayalam-Yes
Bulgarian-YesMarathi-Yes
Bengali-YesNorwegianYesYes
Catalan-YesDutchYesYes
CzechYesYesOdia-Yes
DanishYesYesPunjabi-Yes
GermanYesYesPolishYesYes
Greek-YesPortugueseYesYes
EnglishYesYesRomanianYesYes
SpanishYesYesRussianYesYes
Estonian-YesSlovak-Yes
FinnishYesYesSlovenian-Yes
FrenchYesYesSwedishYesYes
Gujarati-YesTamil-Yes
HindiYesYesTelugu-Yes
HungarianYesYesThaiYesYes
IndonesianYesYesTurkishYesYes
ItalianYesYesUkrainian-Yes
JapaneseYesYesVietnameseYesYes
Kannada-Yes
KoreanYesYes

That’s 47 total languages in v1.5, with 20 new additions including major Indic languages (Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu), Eastern European languages (Bulgarian, Estonian, Lithuanian, Slovak, Slovenian, Ukrainian), and others like Greek, Catalan, Assamese, and Bengali.

The model operates in multi-lingual mode by default - you don’t need to specify a locale. It’ll detect the language automatically. You can optionally constrain it to a single language by setting the locales parameter, which also improves latency.


API Integration: How to Use It

Using MAI-Transcribe 1.5 is straightforward. Here’s a minimal REST call:

curl --location 'https://YourResourceName.cognitiveservices.azure.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourKey>' \
--form 'audio=@"audio.mp3"' \
--form 'definition={
 "enhancedMode": {
 "enabled": true,
 "model": "mai-transcribe-1.5"
 }
}'

To get verbatim output (with filler words like “um” and “uh” preserved):

--form 'definition={
 "enhancedMode": {
 "enabled": true,
 "model": "mai-transcribe-1.5",
 "transcribeStyle": "verbatim"
 }
}'

To add entity biasing (new in v1.5):

--form 'definition={
 "phraseList": {
 "phrases": ["Contoso", "Rehaan", "MAI-Transcribe"]
 },
 "enhancedMode": {
 "enabled": true,
 "model": "mai-transcribe-1.5"
 }
}'

Supported audio formats: WAV, MP3, FLAC (max 300 MB per file).

Available regions: eastus, northeurope, southeastasia, westus. That’s only 4 regions - a significant limitation compared to standard fast transcription which is available in 20+ regions.

SDK support: Python (azure-ai-transcription), C# (Azure.AI.Speech.Transcription), JavaScript (@azure/ai-speech-transcription), and Java (azure-ai-speech-transcription). All four ship with LLM Speech support including MAI-Transcribe model selection.


Real-Time vs Batch Processing

MAI-Transcribe 1.5 operates in a synchronous, sub-real-time mode. You upload an audio file, the API processes it, and returns the full transcript in a single response - faster than the audio’s actual duration.

This makes it ideal for:

  • Voicemail transcription - get the transcript back before the user finishes checking their inbox
  • Meeting note generation - process a 30-minute recording in under 10 minutes
  • Quick subtitle generation - transcribe a video without waiting for batch processing queues
  • Voice agent input - use the Voice Live API for real-time transcription within AI agent calls

For high-volume, asynchronous processing (hundreds or thousands of hours), Microsoft’s batch transcription API is still the better choice. Batch transcription supports custom speech models, up to 35-speaker diarization, and large-scale job scheduling. MAI-Transcribe’s sweet spot is speed and simplicity for individual audio files.


Accuracy and Word Error Rate

Microsoft hasn’t published formal WER benchmarks for MAI-Transcribe 1.5 on standard academic datasets as of this writing. The model is still in public preview.

However, from the official documentation examples, we can observe confidence scores. The standard fast transcription API shows per-phrase confidence scores (e.g., 0.93554276, 0.92022026, 0.93265927), while the LLM Speech API (which powers MAI-Transcribe) returns confidence: 0 for all outputs - indicating the confidence scoring mechanism isn’t implemented for the multimodal model path yet.

What we do know:

  • MAI-Transcribe 1.5 uses a multimodal model architecture - meaning it leverages both acoustic and linguistic understanding simultaneously, similar to how LLMs process text
  • The model is described as “optimized for both high accuracy and high efficiency”
  • The transcribeStyle: verbatim feature is a rarity in the STT space - most APIs don’t give you control over filler-word preservation
  • The phrase list / entity biasing feature is a direct accuracy lever for specialized domains (medical, legal, technical)

In real-world testing on clean English speech, MAI-Transcribe 1.5 performs comparably to Azure’s standard fast transcription models. The big differentiator is on multilingual and accented speech - where the multimodal architecture appears to handle code-switching and non-native accents better than traditional acoustic+language model pipelines.


Microsoft MAI-Transcribe 1.5 vs Whisper: A Side-by-Side Look

This is the comparison everyone’s asking about.

CapabilityMAI-Transcribe 1.5OpenAI Whisper (Azure)OpenAI Whisper (API)
Model typeMultimodal LLM-basedEncoder-decoder transformerEncoder-decoder transformer
Languages47~99 (Whisper large-v3)~99 (Whisper large-v3)
DiarizationNoNoNo
Word-level timestampsNoYes (via Azure)Via third-party tools
Real-time latencySub-real-time synchronousBatch only (on Azure)Batch only
Phrase list / biasingYes (v1.5)NoNo
Verbatim modeYes (v1.5)NoNo
Profanity filteringYesYes (Azure)No
Self-hosted optionNoNoYes (open-source)
Region availability4 regions6+ regionsGlobal (API)
PricingShared w/ Fast Trans. SKU~$0.36/hr (Azure)$0.36/hr (API)

The bottom line: Whisper wins on language breadth (99 vs 47) and has the massive advantage of being open-source - you can run it locally, fine-tune it, and avoid per-call API costs. MAI-Transcribe 1.5 wins on speed (synchronous sub-real-time vs Whisper’s batch-only on Azure), the phrase list biasing feature, and the verbatim transcription option. If you need diarization, neither handles it natively - you’ll need an add-on or a different API.


Real-World Use Cases

1. Meeting Transcription and Note Generation

MAI-Transcribe 1.5’s sub-real-time speed makes it a strong candidate for post-meeting transcription workflows. Record a 45-minute meeting, upload the file, and get your transcript back in under 5 minutes - with proper punctuation, capitalization, and segment-level timestamps for navigation.

The phrase list feature is particularly valuable here. Add your team members’ names, project code names, and company-specific jargon to improve accuracy. No other major STT API offers entity biasing this easily.

Limitation: No diarization. You won’t get speaker labels (“Speaker 1,” “Speaker 2”). For that, you’d need to use LLM Speech (enhanced mode) or a separate diarization service.

2. Call Center and Customer Service

Microsoft explicitly calls out voicemail transcription and call center scenarios as target use cases for the fast transcription API that MAI-Transcribe rides on.

The profanity filtering is a practical feature for customer-facing transcripts. The speed means agents can review transcribed calls almost immediately after they end. And the multilingual support covers most major call center languages (English, Spanish, French, German, Japanese, Korean, Hindi, Arabic, Portuguese).

3. Content Production and Subtitle Generation

Video editors and content teams need quick turnarounds. MAI-Transcribe 1.5 can process a 20-minute video in a few minutes - giving you a clean transcript ready for subtitle formatting.

The transcribeStyle: verbatim option is interesting for content creators who want to preserve the raw, unpolished nature of interviews or podcasts. Most STT APIs strip filler words by default with no way to get them back.

4. Accessibility and Live Captioning

While MAI-Transcribe 1.5 isn’t a real-time streaming API (it works on pre-recorded files), its sub-real-time processing makes it suitable for near-real-time captioning of recorded webinars, training videos, and on-demand content. Combined with Azure’s text-to-speech capabilities, you can build comprehensive accessibility workflows entirely within the Azure ecosystem.

5. Voice Agents

The Voice Live API integration is noteworthy. Voice Live is Microsoft’s real-time voice agent platform - think AI customer service agents that speak naturally. MAI-Transcribe handles the input audio transcription side of that equation. Standard-Audio pricing applies, meaning there’s no premium for using the MAI model over the default speech recognition.


What’s Missing: The Limitations

MAI-Transcribe 1.5 is a purpose-built model, and that means trade-offs.

No diarization. This is the biggest gap. Most real-world transcription needs - meetings, interviews, call centers, podcasts - involve multiple speakers. Without speaker labels, the output is a wall of text with no attribution. Microsoft’s own LLM Speech (enhanced mode) supports diarization, as does the standard fast transcription API. The omission here suggests MAI-Transcribe is optimized for single-speaker scenarios (voicemails, dictation, lectures).

No word-level timestamps. Segment-level timestamps are useful, but word-level precision is critical for subtitling, video editing, and searchable transcripts. Both standard fast transcription and LLM Speech support word-level timestamps. MAI-Transcribe does not.

No stereo channel support. Multi-channel audio files are merged to mono before processing. For call center recordings where agent and customer are on separate channels, you lose that structural advantage.

Only 4 regions. At launch, MAI-Transcribe is available in eastus, northeurope, southeastasia, and westus. If you have data residency requirements in other regions, you’re out of luck for now.

No custom model training. Unlike Azure’s standard speech-to-text, which supports custom speech models trained on your own data, MAI-Transcribe 1.5 is a fixed model. The phrase list is your only customization lever.

No translation. LLM Speech supports translation to 9 target languages. MAI-Transcribe is transcription-only.

Public preview caveats. No SLA, not recommended for production workloads, features may change or be removed.


Who Should Use MAI-Transcribe 1.5?

Good fit for:

  • Developers who need fast, synchronous transcription of single-speaker audio files
  • Multilingual applications covering the 47 supported languages (especially Indic and Eastern European languages new in v1.5)
  • Use cases where phrase-level entity biasing improves accuracy (medical, legal, technical domains)
  • Voicemail transcription, dictation, lecture notes, single-speaker content
  • Voice agent input transcription via Voice Live API
  • Teams already invested in the Azure ecosystem who want the fastest transcription option

Not a good fit for:

  • Multi-speaker meetings or conversations (no diarization)
  • Subtitle workflows requiring word-level timestamps
  • Applications needing stereo channel separation
  • High-volume production workloads (it’s still in preview)
  • Teams with data residency requirements outside the 4 supported regions
  • Use cases where open-source, self-hosted models are preferred (look at Whisper)

The Verdict

MAI-Transcribe 1.5 is an impressive step forward for Microsoft’s speech AI portfolio. The language expansion from 27 to 47, the phrase list entity biasing, and the verbatim transcription mode are genuinely useful additions. The sub-real-time synchronous processing is fast enough to feel like magic when you first try it.

But it’s not a Swiss Army knife. The lack of diarization is a deliberate trade-off that limits its utility for the most common multi-speaker scenarios. The limited region availability and preview status mean it’s not ready for enterprise production deployment yet.

In the broader speech-to-text comparison landscape, MAI-Transcribe 1.5 carves out a clear niche: the fastest synchronous multilingual transcription on Azure, with unique features like phrase biasing and verbatim mode that no other major API offers. For single-speaker use cases in the supported language set, it’s genuinely compelling. For everything else, Microsoft’s own LLM Speech (enhanced) or a dedicated STT provider might serve you better.

I’m watching this space closely. If Microsoft adds diarization, word-level timestamps, and expands region support by GA, MAI-Transcribe could become the default choice for Azure STT workloads. For now, it’s a powerful but specialized tool - and for the right use case, it’s absolutely worth trying.


Sources

  1. Microsoft Learn: MAI-Transcribe in LLM Speech API - Official documentation for MAI-Transcribe models
  2. Microsoft Learn: Language and Voice Support for Azure Speech - Complete language support tables including MAI-Transcribe
  3. Microsoft Learn: MAI-Transcribe Documentation - Public preview limitations and usage instructions
  4. Microsoft Learn: Fast Transcription API - Feature comparison table across Fast Transcription, LLM Speech, and MAI-Transcribe
  5. Microsoft Learn: Voice Live API Customization - MAI-Transcribe integration with Voice Live
  6. Azure Pricing: Speech Services - Official pricing page noting LLM Speech shares SKU with Fast Transcription
  7. Azure Pricing: Speech Services Free Tier - Free tier details (5 audio hours/month)
  8. Azure Pricing: Speech Services - Voice Live - Standard-Audio pricing for MAI-Transcribe via Voice Live
  9. Microsoft Learn: Speech Service Regions - LLM Speech region availability including MAI-Transcribe
  10. Microsoft Learn: LLM Speech API - LLM Speech documentation with confidence score behavior and response format

Get our weekly AI digest

The latest AI tools, prompts, and insights — delivered every Tuesday.

No spam. Unsubscribe anytime.

AIUnpacker

AIUnpacker Editorial Team

Verified

A collective of engineers, journalists, and AI practitioners dedicated to providing clear, unbiased analysis of the AI tools shaping tomorrow.