What Is MAI-Transcribe 1.5? Microsoft’s AI Transcription Model Explained
If you’ve been keeping tabs on Microsoft’s AI releases lately, you’ve probably heard whispers about MAI-Transcribe 1.5. It’s Microsoft’s newest speech-to-text model, and it’s making waves for all the right reasons. But what exactly is it? Who built it? How does it compare to the older version? And more importantly - should you, as a developer or enterprise decision-maker, actually care?
I’ve spent the last week digging through every piece of official documentation, pricing page, and API reference Microsoft has published on this model. Here’s everything you need to know, no fluff.
What Is MAI-Transcribe 1.5? The Short Version
MAI-Transcribe 1.5 is a multimodal speech recognition model built by the Microsoft AI (MAI) Superintelligence team. It’s the second-generation version of their in-house transcription model, replacing the original mai-transcribe-1. The model lives inside Azure’s Speech service, accessed through the LLM Speech API - the same endpoint that powers Microsoft’s other fast transcription offerings.
Think of it as Microsoft’s own homegrown answer to OpenAI’s Whisper. It’s designed for two things: high accuracy and high efficiency. It transcribes audio files synchronously - you send an audio file, you get text back - and it does it faster than real-time playback.
Right now, MAI-Transcribe 1.5 is in public preview. That means it’s available to use, but it’s not yet covered by Microsoft’s standard SLA and shouldn’t be treated as production-ready. Still, its feature set is unusually mature for a preview offering.
Who Built It? The MAI Superintelligence Team
This is where things get interesting. MAI-Transcribe wasn’t built by the Azure Cognitive Services team that traditionally handled speech recognition at Microsoft. It comes from the Microsoft AI (MAI) Superintelligence team - a relatively new research group within Microsoft that’s been pumping out foundation models at a rapid clip.
You may have heard of other MAI-branded models: MAI-Code-1-Flash, MAI-1, and various other “MAI” prefixed releases that have appeared on Microsoft’s model catalog over the past year. The “MAI” prefix signals models developed by Microsoft’s in-house AI research division rather than licensed from partners like OpenAI. The Superintelligence team’s mandate, based on their public output, appears to be building cost-efficient, high-performance models that compete with both open-source alternatives and proprietary offerings.
MAI-Transcribe 1.5 is their second crack at speech recognition. The first version (mai-transcribe-1) launched earlier and supported 24 languages. Version 1.5 nearly doubles that.
How MAI-Transcribe 1.5 Works (The Technical Picture)
Under the hood, MAI-Transcribe 1.5 is a multimodal model, meaning it processes audio input directly through neural architectures designed to handle both acoustic and linguistic patterns simultaneously. This is different from traditional speech-to-text pipelines that use separate acoustic models, language models, and decoders stitched together.
The model operates through the LLM Speech API, which is Microsoft’s umbrella endpoint for large language model-enhanced speech services. You interact with it via a straightforward REST call:
curl --location 'https://YourResourceName.cognitiveservices.azure.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourSpeechResourceKey>' \
--form 'audio=@"YourAudioFile.wav"' \
--form 'definition={
"enhancedMode": {
"enabled": true,
"model":"mai-transcribe-1.5"
}
}'
The API version 2025-10-15 is the current standard. You set enhancedMode.enabled to true and specify mai-transcribe-1.5 as the model name. That’s it. The endpoint handles authentication via either an API key or Entra ID (formerly Azure AD) token.
The model returns results in the same JSON structure as other fast transcription endpoints: a combinedPhrases array with the full transcript and a phrases array with segment-level breakdowns including offsetMilliseconds, durationMilliseconds, text, and confidence scores.
Supported audio formats are WAV, MP3, and FLAC. Files must be under 300 MB.
MAI-Transcribe 1.5 vs. MAI-Transcribe 1: What’s New?
The jump from v1 to v1.5 is significant. Here’s the breakdown:
Language Support: 24 → 42 Languages
Version 1 supported 24 languages. Version 1.5 adds 18 new languages, bringing the total to 42. The new additions include Indic languages (Assamese, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu), Eastern European languages (Bulgarian, Slovak, Slovenian, Ukrainian), and others like Catalan, Greek, Estonian, and Lithuanian.
Here’s the full language comparison:
| Language | MAI-Transcribe 1 | MAI-Transcribe 1.5 |
|---|---|---|
| Arabic | ✅ | ✅ |
| Assamese | ❌ | ✅ (new) |
| Bulgarian | ❌ | ✅ (new) |
| Bengali | ❌ | ✅ (new) |
| Catalan | ❌ | ✅ (new) |
| Czech | ✅ | ✅ |
| Danish | ✅ | ✅ |
| German | ✅ | ✅ |
| Greek | ❌ | ✅ (new) |
| English | ✅ | ✅ |
| Spanish | ✅ | ✅ |
| Estonian | ❌ | ✅ (new) |
| Finnish | ✅ | ✅ |
| French | ✅ | ✅ |
| Gujarati | ❌ | ✅ (new) |
| Hindi | ✅ | ✅ |
| Hungarian | ✅ | ✅ |
| Indonesian | ✅ | ✅ |
| Italian | ✅ | ✅ |
| Japanese | ✅ | ✅ |
| Kannada | ❌ | ✅ (new) |
| Korean | ✅ | ✅ |
| Lithuanian | ❌ | ✅ (new) |
| Malayalam | ❌ | ✅ (new) |
| Marathi | ❌ | ✅ (new) |
| Norwegian Bokmål | ✅ | ✅ |
| Dutch | ✅ | ✅ |
| Odia | ❌ | ✅ (new) |
| Punjabi (Gurmukhi) | ❌ | ✅ (new) |
| Polish | ✅ | ✅ |
| Portuguese | ✅ | ✅ |
| Romanian | ✅ | ✅ |
| Russian | ✅ | ✅ |
| Slovak | ❌ | ✅ (new) |
| Slovenian | ❌ | ✅ (new) |
| Swedish | ✅ | ✅ |
| Tamil | ❌ | ✅ (new) |
| Telugu | ❌ | ✅ (new) |
| Thai | ✅ | ✅ |
| Turkish | ✅ | ✅ |
| Ukrainian | ❌ | ✅ (new) |
| Vietnamese | ✅ | ✅ |
New Features: Phrase Lists and Transcription Style
Two major features landed in v1.5 that weren’t available in v1:
- Phrase Lists (Entity Biasing): You can now pass a list of domain-specific terms - company names, product codes, technical jargon - and the model will bias its recognition toward those phrases. This is huge for enterprise use cases where accuracy on proper nouns and industry terms is non-negotiable.
"phraseList": {
"phrases": ["Contoso", "Jessie", "Rehaan"]
}
- Transcription Style Control: v1.5 introduces a
transcribeStyleparameter that lets you toggle between two output modes:
- Default (display): Returns a readability-optimized transcript with punctuation and capitalization, cleaned of filler words.
- Verbatim: Preserves the original spoken content, including filler words (“um,” “uh”) and disfluencies.
"enhancedMode": {
"enabled": true,
"model":"mai-transcribe-1.5",
"transcribeStyle":"verbatim"
}
These two additions alone make v1.5 significantly more practical for real-world applications.
Feature Comparison: MAI-Transcribe vs. Default Fast Transcription vs. LLM Speech
Microsoft now offers three distinct transcription modes through the same API endpoint. Here’s how they stack up:
| Feature | Fast Transcription (Default) | LLM Speech (Enhanced) | MAI-Transcribe 1.5 |
|---|---|---|---|
| Model type | Traditional speech models | Multimodal LLM | Multimodal LLM |
| Transcription | ✅ | ✅ | ✅ |
| Translation | ❌ | ✅ | ❌ |
| Speaker diarization | ✅ | ✅ | ❌ |
| Channel separation (stereo) | ✅ | ✅ | ❌ |
| Profanity filtering | ✅ | ✅ | ✅ |
| Specify locale | ✅ | ✅ | ✅ |
| Custom prompting | ❌ | ✅ | ❌ |
| Phrase lists | ✅ | ❌¹ | ✅ |
| Segment-level timestamps | ✅ | ✅ | ✅ |
| Word-level timestamps | ✅ | ✅ | ❌ |
¹ LLM Speech uses prompting instead of explicit phrase lists.
The takeaway: MAI-Transcribe 1.5 occupies a middle ground. It gives you the multimodal model architecture of LLM Speech (for higher baseline accuracy) but strips away advanced features like diarization and translation in exchange for a more focused, potentially faster transcription pipeline. It’s the model you choose when you want the best raw transcription quality without the overhead of prompting or speaker separation.
Accuracy: What We Know
Because MAI-Transcribe 1.5 is still in public preview, Microsoft hasn’t published formal benchmark results or Word Error Rate (WER) comparisons against Whisper, the default fast transcription model, or third-party alternatives.
However, the official documentation describes the model as “optimized for both high accuracy and high efficiency”. The fact that it’s a multimodal model - processing audio holistically rather than through pipelined components - suggests an architecture that can capture context better than traditional ASR systems. The inclusion of phrase lists for domain adaptation and verbatim mode for disfluency preservation also indicates the team has thought carefully about enterprise accuracy requirements.
In practice, confidence scores are returned with each transcribed phrase - typically in the 0.90–0.95 range based on sample outputs in Microsoft’s documentation. These are comparable to what you’d expect from other production-grade speech recognition systems, though without independent benchmarks, treat that as a directional signal rather than a hard metric.
Pricing: What It Costs
MAI-Transcribe pricing appears as its own line item on the Azure Speech pricing page under the “Speech Model Prices” section, listed simply as “MAI-transcribe” at a per-hour rate. It shares the same pricing SKU as LLM Speech and fast transcription - meaning you’re not paying a premium for the MAI model over Microsoft’s other transcription offerings.
The pricing is per audio hour, billed in one-second increments. There’s also a free tier (F0) that gives you 5 audio hours per month for real-time transcription, though fast transcription and LLM Speech endpoints are pay-as-you-go only.
For high-volume users, Microsoft offers commitment tiers starting at 2,000 hours per month with discounted rates and overage pricing. If you’re processing more than a few hundred hours of audio monthly, it’s worth running the numbers through the Azure pricing calculator.
Integration Options: How to Use MAI-Transcribe 1.5
You’ve got several paths to integrate this model into your stack:
1. REST API (Direct HTTP Calls)
The simplest approach. Send a multipart/form-data POST request to the transcriptions:transcribe endpoint. Works with any HTTP client - curl, Python’s requests, Node’s fetch, you name it. No SDK required.
2. Python SDK
Microsoft provides an azure-ai-transcription Python package (available on PyPI) that wraps the REST calls in a clean, idiomatic interface. Use the EnhancedModeProperties class to specify the model.
from azure.ai.transcription import TranscriptionClient
from azure.ai.transcription.models import EnhancedModeProperties, TranscriptionOptions, TranscriptionContent
enhanced_mode = EnhancedModeProperties(
task="transcribe",
model="mai-transcribe-1.5"
)
options = TranscriptionOptions(enhanced_mode=enhanced_mode)
request_content = TranscriptionContent(definition=options, audio=audio_file)
result = client.transcribe(request_content)
3..NET SDK
The Azure.AI.Speech.Transcription NuGet package supports MAI-Transcribe through the EnhancedModeProperties class. Set Model = "mai-transcribe-1.5" in your options.
4. JavaScript/TypeScript SDK
The @azure/ai-speech-transcription npm package provides the same functionality for Node.js environments.
5. Java SDK
The azure-ai-speech-transcription Maven package is available for JVM-based applications.
6. Voice Live API (Real-Time Voice Agents)
MAI-Transcribe 1.5 can be plugged into Microsoft’s Voice Live API as the input audio transcription engine for real-time voice agents. Set the model field in the input_audio_transcription session configuration. This means you can use MAI-Transcribe as the speech recognition layer for conversational AI applications like customer service bots and voice assistants.
7. Microsoft Foundry Portal (No-Code)
You can test MAI-Transcribe directly in the Microsoft Foundry portal (the new Azure AI Foundry experience) without writing a single line of code. Navigate to Build → Models → Azure Speech - Speech to text, select LLM speech from the dropdown, and specify mai-transcribe-1.5 as the model.
Region Availability
MAI-Transcribe 1.5 is available in four Azure regions as of June 2026:
- East US (
eastus) - North Europe (
northeurope) - Southeast Asia (
southeastasia) - West US (
westus)
If your application needs to run in a specific geography for compliance or latency reasons, check whether one of these four regions works for you before building a dependency on this model. More regions will likely be added as the model moves toward general availability.
Limitations You Should Know About
MAI-Transcribe 1.5 isn’t a Swiss Army knife. Here’s what it can’t do:
-
No speaker diarization. If you need “who said what” attribution in multi-speaker audio, you’ll need to use the default fast transcription or LLM Speech modes instead.
-
No channel separation. Stereo audio with separate channels for different speakers won’t be processed independently.
-
No translation. The model transcribes in the source language only. For speech translation, use the LLM Speech mode with
task: "translate". -
No custom prompting. Unlike the full LLM Speech mode, you can’t guide the model with natural language instructions about output formatting or domain context (though phrase lists partially compensate for this).
-
No word-level timestamps. You get segment-level timing but not per-word offsets.
-
Public preview limitations. No SLA, not recommended for production workloads, and features may change before GA.
Best Use Cases: When Should You Use MAI-Transcribe 1.5?
Based on its feature set and limitations, here’s where MAI-Transcribe 1.5 shines:
1. High-Volume Audio/Video Transcription
If you’re transcribing podcasts, meeting recordings, webinars, or training videos at scale, MAI-Transcribe 1.5’s synchronous, faster-than-real-time processing and broad language support make it an excellent fit. The phrase list feature means you can tune it for names and terminology specific to your content.
2. Call Center Analytics (Single-Channel)
For post-call analysis where you’re processing agent-side or customer-side audio separately (not joint stereo), the model’s accuracy and profanity filtering give you clean transcripts ready for downstream NLP. Note: the lack of diarization means you can’t automatically split speaker turns from a mono recording.
3. Multilingual Content Platforms
With 42 supported languages, MAI-Transcribe 1.5 is one of the most broadly multilingual transcription models available through Azure. If your platform serves content in Hindi, Tamil, Telugu, or other Indic languages - newly supported in v1.5 - this is a compelling option.
4. Voice Agent Speech Recognition
When paired with the Voice Live API, MAI-Transcribe 1.5 handles the speech-to-text leg of real-time voice conversations. This is ideal for customer service bots, in-car assistants, and interactive voice response systems.
5. Domain-Specific Transcription (Legal, Medical, Technical)
The phrase list feature is a game-changer for vertical applications. Legal firms can bias toward case law citations. Medical practices can ensure drug names and conditions are captured accurately. Engineering teams can handle product codes and technical acronyms.
When NOT to Use It
Skip MAI-Transcribe 1.5 and use the full LLM Speech mode instead if you need any of the following:
- Multi-speaker diarization
- Speech translation
- Word-level timestamps
- Custom LLM-style prompting for output formatting
How MAI-Transcribe 1.5 Fits into Microsoft’s AI Strategy
Stepping back, MAI-Transcribe 1.5 is part of a broader pattern at Microsoft. The company is systematically building in-house alternatives to the third-party models it hosts on Azure. Whisper (OpenAI’s speech model) is available through Azure OpenAI Service and Azure AI Speech. But with MAI-Transcribe, Microsoft now has a first-party option that it controls end-to-end - from training data to architecture to deployment.
This matters for a few reasons:
- Cost control. First-party models typically have better margins, which can translate to competitive pricing.
- Data residency. Microsoft can ensure MAI-Transcribe runs entirely within Azure’s regional boundaries, which is critical for regulated industries.
- Integration depth. A first-party model can be more tightly integrated with other Azure services (Voice Live, Foundry, Copilot stack) than a third-party API.
The MAI Superintelligence team is clearly moving fast. MAI-Transcribe 1.5 went from not existing to supporting 42 languages with phrase lists and verbatim mode in what appears to be a matter of months. If the trajectory holds, we’ll likely see diarization, translation, and word-level timestamps in a future version.
Getting Started: A 60-Second Quickstart
- Create an Azure Speech resource in one of the four supported regions (eastus, northeurope, southeastasia, or westus).
- Grab your resource key from the Azure portal.
- Open a terminal and run:
curl --location 'https://YOUR_REGION.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Ocp-Apim-Subscription-Key: YOUR_KEY' \
--form 'audio=@"your-audio.mp3"' \
--form 'definition={"enhancedMode":{"enabled":true,"model":"mai-transcribe-1.5"}}'
- Read the JSON response. The
combinedPhrases.textfield contains your transcript.
That’s genuinely it. No model deployment, no infrastructure provisioning, no GPU management.
The Bottom Line
MAI-Transcribe 1.5 is a serious speech recognition model that’s still in preview but already shipping features that make it practical for real workloads. The 42-language support, phrase list customization, and verbatim transcription mode put it ahead of many alternatives in the Azure ecosystem, and the fact that it’s built by Microsoft’s in-house Superintelligence team suggests aggressive iteration will continue.
If you’re already on Azure and need transcription that’s faster than real-time, supports your target languages, and doesn’t require diarization or translation, MAI-Transcribe 1.5 is worth testing right now. The pricing is competitive, the API is dead simple, and the feature gap from the “full” LLM Speech mode is narrowing fast.
Sources
- Microsoft Learn - MAI-Transcribe in LLM Speech API (Accessed June 2026)
- Microsoft Learn - Fast Transcription API Overview (Accessed June 2026)
- Microsoft Learn - MAI-Transcribe Documentation (Updated May 2026)
- Microsoft Learn - Transcriptions - Transcribe REST API Reference (API Version 2025-10-15)
- Azure - Speech Services Pricing (Accessed June 2026)
- Microsoft Learn - LLM Speech API Quickstart (Accessed June 2026)
- Microsoft Learn - Voice Live API Overview (Accessed June 2026)
- Microsoft Learn - Speech Service Regions (Accessed June 2026)
- Microsoft Learn - Language and Voice Support for Azure Speech (Updated December 2025)
- Microsoft Learn - Speech to Text Overview (Updated February 2026)