What Is MAI-Transcribe 1.5? Microsoft AI Transcription

AIUnpacker Editorial

AIUnpacker

Jun 5, 2026Updated Jun 5, 202613m read

Jun 5, 2026Updated Jun 5, 2026

13 min2,925 words

Key Takeaways

Microsoft's MAI-Transcribe 1.5 is their latest AI transcription model on Azure. Here's what it is, how it works, and why it matters for developers.

Summarize with AI

13 min → 30 sec

ChatGPT

OpenAI

Gemini

Google

Perplexity

AI Search

Editorial Disclosure & Affiliate Notice

This content is published for informational and educational purposes only. It is not intended as a substitute for professional, legal, financial, or medical advice. AIUnpacker is funded by sponsorships, affiliate commissions, and display advertising — nothing here is free to produce. When you buy through our links, we may earn a commission at no extra cost to you. Our editorial picks are never influenced by compensation.

For educational purposes only. Nothing here should be taken as a guarantee, recommendation, or professional recommendation.
AI-assisted editing. Drafts are produced with AI assistance and reviewed by our human editorial team.
Opinions are our own. Also, we are not affiliated with most tools we cover unless explicitly stated.
Information may be outdated. Verify pricing, features, and policies directly with the vendor.
Last reviewed: June 5, 2026. Published June 5, 2026.

Read more on our About page, Terms and Editorial Policy.

If you’ve been keeping tabs on Microsoft’s AI releases lately, you’ve probably heard whispers about MAI-Transcribe 1.5. It’s Microsoft’s newest speech-to-text model, and it’s making waves for all the right reasons. But what exactly is it? Who built it? How does it compare to the older version? And more importantly - should you, as a developer or enterprise decision-maker, actually care?

I’ve spent the last week digging through every piece of official documentation, pricing page, and API reference Microsoft has published on this model. Here’s everything you need to know, no fluff.

What Is MAI-Transcribe 1.5? The Short Version

MAI-Transcribe 1.5 is a multimodal speech recognition model built by the Microsoft AI (MAI) Superintelligence team. It’s the second-generation version of their in-house transcription model, replacing the original mai-transcribe-1. The model lives inside Azure’s Speech service, accessed through the LLM Speech API - the same endpoint that powers Microsoft’s other fast transcription offerings.

Think of it as Microsoft’s own homegrown answer to OpenAI’s Whisper. It’s designed for two things: high accuracy and high efficiency. It transcribes audio files synchronously - you send an audio file, you get text back - and it does it faster than real-time playback.

Right now, MAI-Transcribe 1.5 is in public preview. That means it’s available to use, but it’s not yet covered by Microsoft’s standard SLA and shouldn’t be treated as production-ready. Still, its feature set is unusually mature for a preview offering.

Who Built It? The MAI Superintelligence Team

This is where things get interesting. MAI-Transcribe wasn’t built by the Azure Cognitive Services team that traditionally handled speech recognition at Microsoft. It comes from the Microsoft AI (MAI) Superintelligence team - a relatively new research group within Microsoft that’s been pumping out foundation models at a rapid clip.

You may have heard of other MAI-branded models: MAI-Code-1-Flash, MAI-1, and various other “MAI” prefixed releases that have appeared on Microsoft’s model catalog over the past year. The “MAI” prefix signals models developed by Microsoft’s in-house AI research division rather than licensed from partners like OpenAI. The Superintelligence team’s mandate, based on their public output, appears to be building cost-efficient, high-performance models that compete with both open-source alternatives and proprietary offerings.

MAI-Transcribe 1.5 is their second crack at speech recognition. The first version (mai-transcribe-1) launched earlier and supported 24 languages. Version 1.5 nearly doubles that.

How MAI-Transcribe 1.5 Works (The Technical Picture)

Under the hood, MAI-Transcribe 1.5 is a multimodal model, meaning it processes audio input directly through neural architectures designed to handle both acoustic and linguistic patterns simultaneously. This is different from traditional speech-to-text pipelines that use separate acoustic models, language models, and decoders stitched together.

The model operates through the LLM Speech API, which is Microsoft’s umbrella endpoint for large language model-enhanced speech services. You interact with it via a straightforward REST call:

curl --location 'https://YourResourceName.cognitiveservices.azure.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourSpeechResourceKey>' \
--form 'audio=@"YourAudioFile.wav"' \
--form 'definition={
 "enhancedMode": {
 "enabled": true,
 "model":"mai-transcribe-1.5"
 }
}'

The API version 2025-10-15 is the current standard. You set enhancedMode.enabled to true and specify mai-transcribe-1.5 as the model name. That’s it. The endpoint handles authentication via either an API key or Entra ID (formerly Azure AD) token.

The model returns results in the same JSON structure as other fast transcription endpoints: a combinedPhrases array with the full transcript and a phrases array with segment-level breakdowns including offsetMilliseconds, durationMilliseconds, text, and confidence scores.

Supported audio formats are WAV, MP3, and FLAC. Files must be under 300 MB.

MAI-Transcribe 1.5 vs. MAI-Transcribe 1: What’s New?

The jump from v1 to v1.5 is significant. Here’s the breakdown:

Language Support: 24 → 42 Languages

Version 1 supported 24 languages. Version 1.5 adds 18 new languages, bringing the total to 42. The new additions include Indic languages (Assamese, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu), Eastern European languages (Bulgarian, Slovak, Slovenian, Ukrainian), and others like Catalan, Greek, Estonian, and Lithuanian.

Here’s the full language comparison:

Language	MAI-Transcribe 1	MAI-Transcribe 1.5
Arabic	✅	✅
Assamese	❌	✅ (new)
Bulgarian	❌	✅ (new)
Bengali	❌	✅ (new)
Catalan	❌	✅ (new)
Czech	✅	✅
Danish	✅	✅
German	✅	✅
Greek	❌	✅ (new)
English	✅	✅
Spanish	✅	✅
Estonian	❌	✅ (new)
Finnish	✅	✅
French	✅	✅
Gujarati	❌	✅ (new)
Hindi	✅	✅
Hungarian	✅	✅
Indonesian	✅	✅
Italian	✅	✅
Japanese	✅	✅
Kannada	❌	✅ (new)
Korean	✅	✅
Lithuanian	❌	✅ (new)
Malayalam	❌	✅ (new)
Marathi	❌	✅ (new)
Norwegian Bokmål	✅	✅
Dutch	✅	✅
Odia	❌	✅ (new)
Punjabi (Gurmukhi)	❌	✅ (new)
Polish	✅	✅
Portuguese	✅	✅
Romanian	✅	✅
Russian	✅	✅
Slovak	❌	✅ (new)
Slovenian	❌	✅ (new)
Swedish	✅	✅
Tamil	❌	✅ (new)
Telugu	❌	✅ (new)
Thai	✅	✅
Turkish	✅	✅
Ukrainian	❌	✅ (new)
Vietnamese	✅	✅

New Features: Phrase Lists and Transcription Style

Two major features landed in v1.5 that weren’t available in v1:

Phrase Lists (Entity Biasing): You can now pass a list of domain-specific terms - company names, product codes, technical jargon - and the model will bias its recognition toward those phrases. This is huge for enterprise use cases where accuracy on proper nouns and industry terms is non-negotiable.

"phraseList": {
"phrases": ["Contoso", "Jessie", "Rehaan"]
}

Transcription Style Control: v1.5 introduces a transcribeStyle parameter that lets you toggle between two output modes:

Default (display): Returns a readability-optimized transcript with punctuation and capitalization, cleaned of filler words.
Verbatim: Preserves the original spoken content, including filler words (“um,” “uh”) and disfluencies.

"enhancedMode": {
"enabled": true,
"model":"mai-transcribe-1.5",
"transcribeStyle":"verbatim"
}

These two additions alone make v1.5 significantly more practical for real-world applications.

Feature Comparison: MAI-Transcribe vs. Default Fast Transcription vs. LLM Speech

Microsoft now offers three distinct transcription modes through the same API endpoint. Here’s how they stack up:

Feature	Fast Transcription (Default)	LLM Speech (Enhanced)	MAI-Transcribe 1.5
Model type	Traditional speech models	Multimodal LLM	Multimodal LLM
Transcription	✅	✅	✅
Translation	❌	✅	❌
Speaker diarization	✅	✅	❌
Channel separation (stereo)	✅	✅	❌
Profanity filtering	✅	✅	✅
Specify locale	✅	✅	✅
Custom prompting	❌	✅	❌
Phrase lists	✅	❌¹	✅
Segment-level timestamps	✅	✅	✅
Word-level timestamps	✅	✅	❌

¹ LLM Speech uses prompting instead of explicit phrase lists.

The takeaway: MAI-Transcribe 1.5 occupies a middle ground. It gives you the multimodal model architecture of LLM Speech (for higher baseline accuracy) but strips away advanced features like diarization and translation in exchange for a more focused, potentially faster transcription pipeline. It’s the model you choose when you want the best raw transcription quality without the overhead of prompting or speaker separation.

Accuracy: What We Know

Because MAI-Transcribe 1.5 is still in public preview, Microsoft hasn’t published formal benchmark results or Word Error Rate (WER) comparisons against Whisper, the default fast transcription model, or third-party alternatives.

However, the official documentation describes the model as “optimized for both high accuracy and high efficiency”. The fact that it’s a multimodal model - processing audio holistically rather than through pipelined components - suggests an architecture that can capture context better than traditional ASR systems. The inclusion of phrase lists for domain adaptation and verbatim mode for disfluency preservation also indicates the team has thought carefully about enterprise accuracy requirements.

In practice, confidence scores are returned with each transcribed phrase - typically in the 0.90–0.95 range based on sample outputs in Microsoft’s documentation. These are comparable to what you’d expect from other production-grade speech recognition systems, though without independent benchmarks, treat that as a directional signal rather than a hard metric.

Pricing: What It Costs

MAI-Transcribe pricing appears as its own line item on the Azure Speech pricing page under the “Speech Model Prices” section, listed simply as “MAI-transcribe” at a per-hour rate. It shares the same pricing SKU as LLM Speech and fast transcription - meaning you’re not paying a premium for the MAI model over Microsoft’s other transcription offerings.

The pricing is per audio hour, billed in one-second increments. There’s also a free tier (F0) that gives you 5 audio hours per month for real-time transcription, though fast transcription and LLM Speech endpoints are pay-as-you-go only.

For high-volume users, Microsoft offers commitment tiers starting at 2,000 hours per month with discounted rates and overage pricing. If you’re processing more than a few hundred hours of audio monthly, it’s worth running the numbers through the Azure pricing calculator.

Integration Options: How to Use MAI-Transcribe 1.5

You’ve got several paths to integrate this model into your stack:

1. REST API (Direct HTTP Calls)

The simplest approach. Send a multipart/form-data POST request to the transcriptions:transcribe endpoint. Works with any HTTP client - curl, Python’s requests, Node’s fetch, you name it. No SDK required.

2. Python SDK

Microsoft provides an azure-ai-transcription Python package (available on PyPI) that wraps the REST calls in a clean, idiomatic interface. Use the EnhancedModeProperties class to specify the model.

from azure.ai.transcription import TranscriptionClient
from azure.ai.transcription.models import EnhancedModeProperties, TranscriptionOptions, TranscriptionContent

enhanced_mode = EnhancedModeProperties(
 task="transcribe",
 model="mai-transcribe-1.5"
)
options = TranscriptionOptions(enhanced_mode=enhanced_mode)
request_content = TranscriptionContent(definition=options, audio=audio_file)
result = client.transcribe(request_content)

3..NET SDK

The Azure.AI.Speech.Transcription NuGet package supports MAI-Transcribe through the EnhancedModeProperties class. Set Model = "mai-transcribe-1.5" in your options.

4. JavaScript/TypeScript SDK

The @azure/ai-speech-transcription npm package provides the same functionality for Node.js environments.

5. Java SDK

The azure-ai-speech-transcription Maven package is available for JVM-based applications.

6. Voice Live API (Real-Time Voice Agents)

MAI-Transcribe 1.5 can be plugged into Microsoft’s Voice Live API as the input audio transcription engine for real-time voice agents. Set the model field in the input_audio_transcription session configuration. This means you can use MAI-Transcribe as the speech recognition layer for conversational AI applications like customer service bots and voice assistants.

7. Microsoft Foundry Portal (No-Code)

You can test MAI-Transcribe directly in the Microsoft Foundry portal (the new Azure AI Foundry experience) without writing a single line of code. Navigate to Build → Models → Azure Speech - Speech to text, select LLM speech from the dropdown, and specify mai-transcribe-1.5 as the model.

Region Availability

MAI-Transcribe 1.5 is available in four Azure regions as of June 2026:

East US (eastus)
North Europe (northeurope)
Southeast Asia (southeastasia)
West US (westus)

If your application needs to run in a specific geography for compliance or latency reasons, check whether one of these four regions works for you before building a dependency on this model. More regions will likely be added as the model moves toward general availability.

Limitations You Should Know About

MAI-Transcribe 1.5 isn’t a Swiss Army knife. Here’s what it can’t do:

No speaker diarization. If you need “who said what” attribution in multi-speaker audio, you’ll need to use the default fast transcription or LLM Speech modes instead.
No channel separation. Stereo audio with separate channels for different speakers won’t be processed independently.
No translation. The model transcribes in the source language only. For speech translation, use the LLM Speech mode with task: "translate".
No custom prompting. Unlike the full LLM Speech mode, you can’t guide the model with natural language instructions about output formatting or domain context (though phrase lists partially compensate for this).
No word-level timestamps. You get segment-level timing but not per-word offsets.
Public preview limitations. No SLA, not recommended for production workloads, and features may change before GA.

Best Use Cases: When Should You Use MAI-Transcribe 1.5?

Based on its feature set and limitations, here’s where MAI-Transcribe 1.5 shines:

1. High-Volume Audio/Video Transcription

If you’re transcribing podcasts, meeting recordings, webinars, or training videos at scale, MAI-Transcribe 1.5’s synchronous, faster-than-real-time processing and broad language support make it an excellent fit. The phrase list feature means you can tune it for names and terminology specific to your content.

2. Call Center Analytics (Single-Channel)

For post-call analysis where you’re processing agent-side or customer-side audio separately (not joint stereo), the model’s accuracy and profanity filtering give you clean transcripts ready for downstream NLP. Note: the lack of diarization means you can’t automatically split speaker turns from a mono recording.

3. Multilingual Content Platforms

With 42 supported languages, MAI-Transcribe 1.5 is one of the most broadly multilingual transcription models available through Azure. If your platform serves content in Hindi, Tamil, Telugu, or other Indic languages - newly supported in v1.5 - this is a compelling option.

4. Voice Agent Speech Recognition

When paired with the Voice Live API, MAI-Transcribe 1.5 handles the speech-to-text leg of real-time voice conversations. This is ideal for customer service bots, in-car assistants, and interactive voice response systems.

5. Domain-Specific Transcription (Legal, Medical, Technical)

The phrase list feature is a game-changer for vertical applications. Legal firms can bias toward case law citations. Medical practices can ensure drug names and conditions are captured accurately. Engineering teams can handle product codes and technical acronyms.

When NOT to Use It

Skip MAI-Transcribe 1.5 and use the full LLM Speech mode instead if you need any of the following:

Multi-speaker diarization
Speech translation
Word-level timestamps
Custom LLM-style prompting for output formatting

How MAI-Transcribe 1.5 Fits into Microsoft’s AI Strategy

Stepping back, MAI-Transcribe 1.5 is part of a broader pattern at Microsoft. The company is systematically building in-house alternatives to the third-party models it hosts on Azure. Whisper (OpenAI’s speech model) is available through Azure OpenAI Service and Azure AI Speech. But with MAI-Transcribe, Microsoft now has a first-party option that it controls end-to-end - from training data to architecture to deployment.

This matters for a few reasons:

Cost control. First-party models typically have better margins, which can translate to competitive pricing.
Data residency. Microsoft can ensure MAI-Transcribe runs entirely within Azure’s regional boundaries, which is critical for regulated industries.
Integration depth. A first-party model can be more tightly integrated with other Azure services (Voice Live, Foundry, Copilot stack) than a third-party API.

The MAI Superintelligence team is clearly moving fast. MAI-Transcribe 1.5 went from not existing to supporting 42 languages with phrase lists and verbatim mode in what appears to be a matter of months. If the trajectory holds, we’ll likely see diarization, translation, and word-level timestamps in a future version.

Getting Started: A 60-Second Quickstart

Create an Azure Speech resource in one of the four supported regions (eastus, northeurope, southeastasia, or westus).
Grab your resource key from the Azure portal.
Open a terminal and run:

curl --location 'https://YOUR_REGION.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Ocp-Apim-Subscription-Key: YOUR_KEY' \
--form 'audio=@"your-audio.mp3"' \
--form 'definition={"enhancedMode":{"enabled":true,"model":"mai-transcribe-1.5"}}'

Read the JSON response. The combinedPhrases.text field contains your transcript.

That’s genuinely it. No model deployment, no infrastructure provisioning, no GPU management.

The Bottom Line

MAI-Transcribe 1.5 is a serious speech recognition model that’s still in preview but already shipping features that make it practical for real workloads. The 42-language support, phrase list customization, and verbatim transcription mode put it ahead of many alternatives in the Azure ecosystem, and the fact that it’s built by Microsoft’s in-house Superintelligence team suggests aggressive iteration will continue.

If you’re already on Azure and need transcription that’s faster than real-time, supports your target languages, and doesn’t require diarization or translation, MAI-Transcribe 1.5 is worth testing right now. The pricing is competitive, the API is dead simple, and the feature gap from the “full” LLM Speech mode is narrowing fast.

Sources

Microsoft Learn - MAI-Transcribe in LLM Speech API (Accessed June 2026)
Microsoft Learn - Fast Transcription API Overview (Accessed June 2026)
Microsoft Learn - MAI-Transcribe Documentation (Updated May 2026)
Microsoft Learn - Transcriptions - Transcribe REST API Reference (API Version 2025-10-15)
Azure - Speech Services Pricing (Accessed June 2026)
Microsoft Learn - LLM Speech API Quickstart (Accessed June 2026)
Microsoft Learn - Voice Live API Overview (Accessed June 2026)
Microsoft Learn - Speech Service Regions (Accessed June 2026)
Microsoft Learn - Language and Voice Support for Azure Speech (Updated December 2025)
Microsoft Learn - Speech to Text Overview (Updated February 2026)

Get our weekly AI digest

The latest AI tools, prompts, and insights — delivered every Tuesday.

No spam. Unsubscribe anytime.

AIUnpacker Editorial Team

Verified

A collective of engineers, journalists, and AI practitioners dedicated to providing hands-on, transparently disclosed analysis of the AI tools shaping tomorrow.

About us ·More articles