Microsoft MAI-Voice-2 Review: Pricing, Voices, Languages, and Text-to-Speech Features
Microsoft dropped MAI-Voice-2 at Build 2026 on June 2, and it’s easily the most impressive text-to-speech model they’ve ever shipped. I’ve spent the last few days digging through the docs, testing the voices, and comparing pricing. Here’s everything you need to know.
This model is part of the new MAI family - Microsoft’s in-house AI models built by their Superintelligence team. It sits alongside MAI-Transcribe-1.5 (speech-to-text), MAI-Thinking-1 (reasoning), and MAI-Code-1-Flash (coding). Voice-2 is the second generation, and the leap from MAI-Voice-1 is dramatic.
The TL;DR: MAI-Voice-2 delivers speech that’s basically indistinguishable from human recordings. It supports 15+ languages, offers zero-shot voice cloning with consent guardrails, has granular emotional control, and runs on Azure’s existing Speech infrastructure. If you’re already on Azure, this is a no-brainer upgrade. If you’re on another provider, the pricing and quality math might make you switch.
What Is MAI-Voice-2?
MAI-Voice-2 is Microsoft’s flagship text-to-speech model - a neural TTS system that processes input text holistically, understands sentiment and context, and generates speech with appropriate emotion, pacing, and prosody. It’s not just reading words aloud. It’s performing them.
The model was built from scratch by Microsoft AI’s in-house lab. No distillation from third-party models. Clean, commercially licensed training data. That matters if you’re deploying in production and care about data lineage.
Key architecture highlights:
- Multilingual by design. Expands from English-only (MAI-Voice-1) to 15+ languages with the same naturalness as English output.
- Emotion-aware synthesis. The model reads the semantic content of your text and automatically adjusts tone. You can also explicitly control emotion with SSML tags.
- Zero-shot voice prompting. Give it 5–60 seconds of reference audio and it clones the voice - no fine-tuning required. This is gated behind consent verification.
- Code-switching. Handles Hindi-English and Spanish-English mid-sentence switches naturally.
- Long-form stability. Speaker identity stays consistent across hours of content. Audiobooks and podcasts won’t drift.
In Microsoft’s internal blind tests across 2,500 evaluations, MAI-Voice-2 was preferred over MAI-Voice-1 72.1% of the time. In a separate test across 11 languages, listeners could not reliably tell the difference between MAI-Voice-2 and real human speech: 45.5% preferred the AI, 44% preferred the human recording, and 10.5% called it a tie.
That’s the “uncanny valley is dead” metric.
Pricing: What MAI-Voice-2 Actually Costs
MAI-Voice-2 is available through Azure Speech in Foundry Tools (formerly Azure AI Speech). Pricing follows Azure’s existing text-to-speech tiers, billed per character.
Here’s the breakdown:
| Voice Tier | Price (per 1M characters) | Free Tier |
|---|---|---|
| MAI-Voice-2 (Standard/Neural) | $15.00 | 0.5M chars/month |
| MAI-Voice-2 + Voice Prompting (Personal Voice) | $15.00 (synthesis) + profile storage fees | None for synthesis |
Pricing is pay-as-you-go. No upfront commitments. You get billed for every character in the SSML body, including punctuation, spaces, and most markup tags. Chinese, Japanese, and Korean characters count double.
For context, here’s what real-world usage costs:
| Use Case | Monthly Characters | Estimated Cost |
|---|---|---|
| Small chatbot (100 requests/day, avg 100 chars) | ~300K | Free tier covers it |
| Medium voice app (1,000 requests/day) | ~3M | ~$45/month |
| Audiobook production (80K-word book) | ~500K | ~$7.50/book |
| Large call center (1M calls/month, avg 200 chars/response) | ~200M | ~$3,000/month |
Azure also offers commitment tiers for high-volume users. If you’re pushing 80M+ characters per month, you get a discounted rate. The 400M tier and 2B tier drop the per-character cost significantly - contact Microsoft sales for exact numbers.
MAI-Voice-2-Flash, announced at Build and “coming soon,” will offer a lower-cost, lower-latency variant optimized for real-time voice agents. Pricing hasn’t been published yet.
Voices & Language Support
MAI-Voice-2 ships with 46 prebuilt voices across 18 languages and locales. Every voice supports emotional style tags, and most support 15+ discrete emotion states. Here’s the full list:
| Language | Locale | Male Voices | Female Voices |
|---|---|---|---|
| English (US) | en-US | Ethan, Grant, Jasper | Harper, Iris, Olivia |
| English (Australia) | en-AU | - | Lisa |
| German (Germany) | de-DE | Klaus | Mia |
| Spanish (Spain) | es-ES | - | Marta |
| Spanish (Mexico) | es-MX | Alejo | Valeria |
| French (France) | fr-FR | Marc | Soleil |
| Hindi (India) | hi-IN | Arjun, Dhruv | Kavya, Priya |
| Italian (Italy) | it-IT | Luca | Rosa |
| Portuguese (Brazil) | pt-BR | Caio, Pedro, Rafael | Luana |
| Portuguese (Portugal) | pt-PT | Rui | - |
| Korean (Korea) | ko-KR | Junho | Hana |
| Chinese (Simplified) | zh-CN | Bo | Lan, Mei |
| Russian | ru-RU | Lev | Masha |
| Turkish | tr-TR | Aydin | Elif |
| Thai | th-TH | Krit, Nattapong | - |
| Hungarian | hu-HU | Bence, Levente | Lilla, Réka |
| Dutch | nl-NL | Sander | Fleur |
| Romanian | ro-RO | Andrei, Radu | Elena, Ioana |
Languages announced with additional support upcoming: Portuguese (Portugal), Korean, Chinese, Russian, Turkish, Thai, Hungarian, Dutch, and Romanian.
Code-switching works for Hindi-English and Spanish-English pairs. The model switches between languages mid-sentence without losing prosody or speaker identity. If you’re building for markets where people naturally mix languages (India, US-Hispanic communities), this is genuinely useful.
Emotion & Style Control
MAI-Voice-2 supports explicit emotion control via SSML (mstts:express-as). The supported styles vary by voice, but most English voices support:
angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering
You can also set styledegree (0.0–2.0) for intensity. Here’s an example:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-Harper:MAI-Voice-2">
<mstts:express-as style="excited" styledegree="1.5">
Welcome to Microsoft Build! MAI Voice 2 is here.
</mstts:express-as>
</voice>
</speak>
Voice Prompting (Zero-Shot Cloning)
This is the feature that sets MAI-Voice-2 apart from basic TTS APIs. You upload 5–60 seconds of reference audio, and the model generates speech matching that speaker’s timbre, cadence, and style. No training. No fine-tuning. It just works - across all supported languages.
Catch: It’s gated. You need to apply through Microsoft’s Limited Access Review, submit a consent recording from the voice actor, and get approved. This isn’t ElevenLabs-style instant cloning for anyone with a URL. Microsoft built consent into the system level: no unlicensed voice cloning is possible.
The API flow:
- Apply for access → get approved
- Upload consent audio + reference clip (10–120 seconds recommended)
- Create a voice profile via Personal Voice APIs
- Synthesize using
speakerProfileIdin SSML
For production applications where you need a consistent brand voice across all content, this is the right approach. For quick experimentation, it’s bureaucratic.
Developer Experience & SSML
MAI-Voice-2 works with Azure’s existing Speech SDK (C#, Python, Java, JavaScript) and REST API. If you’ve used Azure Neural TTS before, the migration is swapping the voice name:
From:
en-US-Ava:DragonHDLatestNeural
To:
en-US-Harper:MAI-Voice-2
Basic Python example:
import requests
endpoint = "https://eastus.tts.speech.microsoft.com/cognitiveservices/v1"
headers = {
"Content-Type": "application/ssml+xml",
"Ocp-Apim-Subscription-Key": "<your-key>",
"X-Microsoft-OutputFormat": "audio-24khz-160kbitrate-mono-mp3"
}
ssml = """<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-Harper:MAI-Voice-2">
<mstts:express-as style="excited">
Hey, this MAI Voice 2 is actually really good.
</mstts:express-as>
</voice>
</speak>"""
resp = requests.post(endpoint, headers=headers, data=ssml.encode("utf-8"))
with open("output.mp3", "wb") as f:
f.write(resp.content)
Output formats: MP3, WAV, PCM, Opus, TrueSilk. Sample rates: 8, 16, 24, 48 kHz.
SSML limitations: MAI-Voice-2 supports a subset of SSML tags. It supports <voice>, <lang xml:lang>, <phoneme>, <lexicon>, <say-as>, <sub>, <break>, <p>, <s>, and <mstts:express-as>. It does NOT support <prosody>, <emphasis>, <audio>, <bookmark>, or <mstts:silence>. If your workflow relies on fine-grained prosody control, you’ll want to use Azure’s DragonHD voices instead.
Audio Quality Benchmarks
Microsoft published rigorous human evaluation data for MAI-Voice-2:
- Side-by-side preference: MAI-Voice-2 won 72.1% of 2,500 blind comparisons against MAI-Voice-1.
- Human parity test: Across 11 languages and 2,222 responses, listeners could not reliably distinguish MAI-Voice-2 from real human speech. 45.5% preferred the AI output, 44% preferred real recordings, 10.5% tied.
- Speaker similarity: In speaker identity evaluations, MAI-Voice-2 output was indistinguishable from genuine recordings of the same speaker.
These are published figures from Microsoft. Independent third-party benchmarks don’t exist yet (the model is 3 days old as of this writing). But the audio samples on the MAI blog are genuinely impressive - particularly the Hindi-English code-switching clip and the German emotional samples.
Comparison: MAI-Voice-2 vs. Competitors
Here’s how MAI-Voice-2 stacks up against the major TTS providers in June 2026:
| Feature | MAI-Voice-2 | OpenAI TTS | ElevenLabs | Google Cloud TTS | Amazon Polly |
|---|---|---|---|---|---|
| Pricing (per 1M chars) | $15.00 (Neural) | $15.00 (tts-1) / $30.00 (tts-1-hd) | ~$20-$36/hour (credit-based) | $16.00 (Neural2) / $30.00 (Chirp 3 HD) | $16.00 (Neural) / $30.00 (Generative) |
| Languages | 15+ | 57 (via Whisper) | 32 | 40+ | 30+ |
| Prebuilt Voices | 46 across 18 locales | 13 (GPT-4o-mini-tts) | Hundreds (community + professional) | 220+ | 60+ |
| Voice Cloning | Zero-shot (5-60s, gated) | Custom voices (gated, enterprise only) | Instant (free tier) & Professional | Instant custom voice (Chirp 3) | Not available |
| Emotion Control | SSML tags (18+ emotions) | Natural language instructions | Voice settings + prompt | SSML (limited) | SSML (limited) |
| Code-Switching | Hindi-English, Spanish-English | Not supported | Some multilingual models | Not supported | Not supported |
| SSML Support | Subset (no prosody/emphasis) | Not applicable | Not applicable | Full | Full |
| Streaming | Real-time (<300ms) via SDK | Real-time via API | Real-time via WebSocket API | Real-time via gRPC | Real-time via SDK |
| Audio Quality (max) | 48 kHz | 24 kHz | 44.1 kHz | 24 kHz | 24 kHz |
| Multi-Speaker | Yes (single synthesis flow) | No | Yes (via projects) | No | No |
| Free Tier | 0.5M chars/month | None | 10K credits (~10 min) | 1M chars/month (Neural2) | 1M chars/month (Neural) |
MAI-Voice-2 vs. ElevenLabs
ElevenLabs still has the edge in raw voice variety and community ecosystem. Their instant voice cloning works instantly and doesn’t require an access application. If you need a voice right now, ElevenLabs is faster.
But MAI-Voice-2 wins on:
- Enterprise trust. Microsoft’s consent verification system means you can deploy branded voices without legal exposure. ElevenLabs has faced scrutiny over unauthorized cloning.
- Azure integration. If your stack is already on Azure, adding MAI-Voice-2 is a few API calls. No new vendor relationship.
- Code-switching. ElevenLabs doesn’t offer natural mid-sentence language switching.
- Long-form stability. MAI-Voice-2 is explicitly designed to maintain persona consistency across hours of content.
MAI-Voice-2 vs. OpenAI TTS
OpenAI’s gpt-4o-mini-tts model is excellent. It supports 13 voices, natural language emotion instructions, and 57 languages. The quality is on par with MAI-Voice-2 for English.
MAI-Voice-2 advantages:
- More prebuilt voices per language. OpenAI has 13 voices total; MAI-Voice-2 has up to 6 per locale.
- Fine-grained emotion control. SSML tags give you 18 discrete emotions. OpenAI uses free-text instructions, which are flexible but less predictable.
- Multi-speaker synthesis. MAI-Voice-2 can generate multi-person dialogue in a single API call. OpenAI can’t.
- Enterprise data guarantees. MAI models are trained on clean, licensed data. OpenAI’s data provenance has been a point of contention.
MAI-Voice-2 vs. Google Cloud TTS
Google’s Chirp 3 HD voices are competitive on quality (and priced identically at $30/M chars for HD). Google has the edge on language count (40+ vs 15+), but MAI-Voice-2’s supported languages have deeper expressive capabilities - each voice supports 15+ emotion tags, while Google’s SSML emotion support is more limited.
MAI-Voice-2 advantages:
- Emotion depth. 18 discrete emotions per voice vs. Google’s 4–5.
- Voice prompting. Google’s instant custom voice (Chirp 3) is comparable, but MAI-Voice-2’s consent infrastructure is more enterprise-ready.
- Multi-speaker and code-switching. Unique features Google doesn’t match.
MAI-Voice-2 vs. Amazon Polly
Polly is the budget option. Standard voices at $4/M chars and Neural at $16/M chars. But Polly’s voice quality hasn’t kept pace with the latest generation of TTS. If cost per character is your only metric, Polly wins. If you care about how your brand sounds, MAI-Voice-2 is in a different league.
Real-World Use Cases
1. AI Voice Agents
MAI-Voice-2 is purpose-built for conversational AI. The emotional range, natural prosody, and real-time streaming make it a strong back-end for customer support bots, virtual assistants, and voice-enabled Copilot experiences. Microsoft is integrating it directly into Dynamics 365 Contact Center.
2. Audiobooks & Podcasts
Long-form stability means a single voice maintains consistent character across an 8-hour audiobook. Multi-speaker support lets you generate full-cast productions from a single SSML document. Combine that with zero-shot voice cloning and you can produce an audiobook in the author’s own voice (with consent).
3. Content Localization
15 languages, same voice persona, emotional expressiveness preserved across all of them. For companies localizing video content, e-learning courses, or marketing materials, this cuts the number of tools and voice actors needed.
4. Accessibility
Screen readers and accessibility tools live and die by voice quality. A voice that doesn’t cause listening fatigue after 30 minutes is a genuine accessibility win. MAI-Voice-2’s natural prosody and emotional variation make extended listening dramatically more comfortable.
5. Gaming & Interactive Media
Prebuilt character voices with explicit emotion control let game developers prototype dialogue before hiring voice actors. Zero-shot cloning lets indie developers create consistent character voices from limited reference audio.
What’s Missing (Limitations)
No product review is honest without the downsides:
- SSML limitations. If your workflow depends on
<prosody>for fine pitch/rate control or<emphasis>for word-level stress, MAI-Voice-2 won’t work for you. Use Azure’s DragonHD voices instead. - Language count. 15 languages is decent. But Google supports 40+, ElevenLabs supports 32. If you need TTS in Finnish, Greek, or Vietnamese, you’re waiting for future updates.
- Gated voice cloning. The consent requirement is good for ethics and bad for speed. Applying, getting approved, and uploading consent audio adds days or weeks to your workflow.
- Public preview. MAI-Voice-2 is currently in public preview. No SLA. Not recommended for production workloads per Microsoft’s own docs. The model will likely go GA within months, but plan accordingly.
- Latency tuning. The model prioritizes naturalness over latency. For ultra-low-latency voice agents (sub-100ms), you’ll want MAI-Voice-2-Flash when it ships, or stick with Azure’s standard Neural voices.
Should You Use MAI-Voice-2?
Yes, if:
- You’re already on Azure and want the best voice quality available
- You need multilingual TTS with deep emotional control
- You’re building voice agents, audiobooks, or accessibility tools
- Enterprise data lineage and consent verification matter to your legal team
- You need code-switching for Hindi-English or Spanish-English content
Wait, if:
- You need 40+ languages today (go with Google or ElevenLabs)
- You need instant voice cloning without an approval process (use ElevenLabs)
- You need full SSML support including prosody control (use Azure DragonHD)
- You’re cost-optimizing at scale and audio quality isn’t differentiating (use Amazon Polly)
The Bottom Line
MAI-Voice-2 is the best text-to-speech model Microsoft has ever built. The audio quality is genuinely at human parity for supported languages. The emotional range and voice prompting capabilities rival ElevenLabs while offering Azure’s enterprise infrastructure and consent guardrails.
If Microsoft ships MAI-Voice-2-Flash soon with lower latency and cost, and expands to 30+ languages, it’ll be the default recommendation for most TTS use cases.
For now, it’s the obvious choice if you’re on Azure and an extremely compelling reason to switch if you’re not.
Sources
- Introducing MAI-Voice-2 - Microsoft AI Blog (June 2, 2026)
- What is MAI-Voice? - Microsoft Learn (updated June 4, 2026)
- Azure Speech in Foundry Tools Pricing
- Text to Speech Overview - Microsoft Learn
- Building a Hill-Climbing Machine: Launching Seven New MAI Models (June 2, 2026)
- OpenAI Text to Speech API Documentation
- Google Cloud Text-to-Speech Pricing
- Amazon Polly Pricing
- ElevenLabs Pricing
- Language and Voice Support - Microsoft Learn
- Neural Text to Speech HD Voices - Microsoft Learn