Discover the best AI tools curated for professionals.

AIUnpacker

Search everything

Find AI tools, reviews, prompts, and more

Quick links

Microsoft MAI-Voice-2 Review: Pricing, Voices, Languages, and Text-to-Speech Features

Microsoft's MAI-Voice-2 is the latest text-to-speech model on Azure. I tested its voices, checked the pricing, and compared it against every major TTS player.

AIUnpacker

AIUnpacker Editorial

June 5, 2026

13 min read
AIUnpacker

AIUnpacker

Jun 5, 2026 · 13m read

Jun 5, 2026 13 min

Key Takeaways

Microsoft's MAI-Voice-2 is the latest text-to-speech model on Azure. I tested its voices, checked the pricing, and compared it against every major TTS player.

Editorial Disclosure & Affiliate Notice

This content is published for informational and educational purposes only. It is not intended as a substitute for professional, legal, financial, or medical advice. AIUnpacker is reader-supported — when you buy through our links, we may earn a commission at no extra cost to you, and our editorial picks are never influenced by compensation.

  • For educational purposes only. Nothing here should be taken as a guarantee, recommendation, or professional recommendation.
  • AI-assisted editing. Drafts are produced with AI assistance and reviewed by our human editorial team.
  • Opinions are our own. Also, we are not affiliated with most tools we cover unless explicitly stated.
  • Information may be outdated. Verify pricing, features, and policies directly with the vendor.
  • Last reviewed: June 5, 2026.

Read more on our About page, Terms and Editorial Policy.

Microsoft MAI-Voice-2 Review: Pricing, Voices, Languages, and Text-to-Speech Features

Microsoft dropped MAI-Voice-2 at Build 2026 on June 2, and it’s easily the most impressive text-to-speech model they’ve ever shipped. I’ve spent the last few days digging through the docs, testing the voices, and comparing pricing. Here’s everything you need to know.

This model is part of the new MAI family - Microsoft’s in-house AI models built by their Superintelligence team. It sits alongside MAI-Transcribe-1.5 (speech-to-text), MAI-Thinking-1 (reasoning), and MAI-Code-1-Flash (coding). Voice-2 is the second generation, and the leap from MAI-Voice-1 is dramatic.

The TL;DR: MAI-Voice-2 delivers speech that’s basically indistinguishable from human recordings. It supports 15+ languages, offers zero-shot voice cloning with consent guardrails, has granular emotional control, and runs on Azure’s existing Speech infrastructure. If you’re already on Azure, this is a no-brainer upgrade. If you’re on another provider, the pricing and quality math might make you switch.


What Is MAI-Voice-2?

MAI-Voice-2 is Microsoft’s flagship text-to-speech model - a neural TTS system that processes input text holistically, understands sentiment and context, and generates speech with appropriate emotion, pacing, and prosody. It’s not just reading words aloud. It’s performing them.

The model was built from scratch by Microsoft AI’s in-house lab. No distillation from third-party models. Clean, commercially licensed training data. That matters if you’re deploying in production and care about data lineage.

Key architecture highlights:

  • Multilingual by design. Expands from English-only (MAI-Voice-1) to 15+ languages with the same naturalness as English output.
  • Emotion-aware synthesis. The model reads the semantic content of your text and automatically adjusts tone. You can also explicitly control emotion with SSML tags.
  • Zero-shot voice prompting. Give it 5–60 seconds of reference audio and it clones the voice - no fine-tuning required. This is gated behind consent verification.
  • Code-switching. Handles Hindi-English and Spanish-English mid-sentence switches naturally.
  • Long-form stability. Speaker identity stays consistent across hours of content. Audiobooks and podcasts won’t drift.

In Microsoft’s internal blind tests across 2,500 evaluations, MAI-Voice-2 was preferred over MAI-Voice-1 72.1% of the time. In a separate test across 11 languages, listeners could not reliably tell the difference between MAI-Voice-2 and real human speech: 45.5% preferred the AI, 44% preferred the human recording, and 10.5% called it a tie.

That’s the “uncanny valley is dead” metric.


Pricing: What MAI-Voice-2 Actually Costs

MAI-Voice-2 is available through Azure Speech in Foundry Tools (formerly Azure AI Speech). Pricing follows Azure’s existing text-to-speech tiers, billed per character.

Here’s the breakdown:

Voice TierPrice (per 1M characters)Free Tier
MAI-Voice-2 (Standard/Neural)$15.000.5M chars/month
MAI-Voice-2 + Voice Prompting (Personal Voice)$15.00 (synthesis) + profile storage feesNone for synthesis

Pricing is pay-as-you-go. No upfront commitments. You get billed for every character in the SSML body, including punctuation, spaces, and most markup tags. Chinese, Japanese, and Korean characters count double.

For context, here’s what real-world usage costs:

Use CaseMonthly CharactersEstimated Cost
Small chatbot (100 requests/day, avg 100 chars)~300KFree tier covers it
Medium voice app (1,000 requests/day)~3M~$45/month
Audiobook production (80K-word book)~500K~$7.50/book
Large call center (1M calls/month, avg 200 chars/response)~200M~$3,000/month

Azure also offers commitment tiers for high-volume users. If you’re pushing 80M+ characters per month, you get a discounted rate. The 400M tier and 2B tier drop the per-character cost significantly - contact Microsoft sales for exact numbers.

MAI-Voice-2-Flash, announced at Build and “coming soon,” will offer a lower-cost, lower-latency variant optimized for real-time voice agents. Pricing hasn’t been published yet.


Voices & Language Support

MAI-Voice-2 ships with 46 prebuilt voices across 18 languages and locales. Every voice supports emotional style tags, and most support 15+ discrete emotion states. Here’s the full list:

LanguageLocaleMale VoicesFemale Voices
English (US)en-USEthan, Grant, JasperHarper, Iris, Olivia
English (Australia)en-AU-Lisa
German (Germany)de-DEKlausMia
Spanish (Spain)es-ES-Marta
Spanish (Mexico)es-MXAlejoValeria
French (France)fr-FRMarcSoleil
Hindi (India)hi-INArjun, DhruvKavya, Priya
Italian (Italy)it-ITLucaRosa
Portuguese (Brazil)pt-BRCaio, Pedro, RafaelLuana
Portuguese (Portugal)pt-PTRui-
Korean (Korea)ko-KRJunhoHana
Chinese (Simplified)zh-CNBoLan, Mei
Russianru-RULevMasha
Turkishtr-TRAydinElif
Thaith-THKrit, Nattapong-
Hungarianhu-HUBence, LeventeLilla, Réka
Dutchnl-NLSanderFleur
Romanianro-ROAndrei, RaduElena, Ioana

Languages announced with additional support upcoming: Portuguese (Portugal), Korean, Chinese, Russian, Turkish, Thai, Hungarian, Dutch, and Romanian.

Code-switching works for Hindi-English and Spanish-English pairs. The model switches between languages mid-sentence without losing prosody or speaker identity. If you’re building for markets where people naturally mix languages (India, US-Hispanic communities), this is genuinely useful.

Emotion & Style Control

MAI-Voice-2 supports explicit emotion control via SSML (mstts:express-as). The supported styles vary by voice, but most English voices support:

angry, confused, determined, disgusted, embarrassed, excited, fearful, happy, hopeful, jealous, joyful, regretful, relieved, sad, shouting, softvoice, surprised, whispering

You can also set styledegree (0.0–2.0) for intensity. Here’s an example:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" 
 xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
 <voice name="en-US-Harper:MAI-Voice-2">
 <mstts:express-as style="excited" styledegree="1.5">
 Welcome to Microsoft Build! MAI Voice 2 is here.
 </mstts:express-as>
 </voice>
</speak>

Voice Prompting (Zero-Shot Cloning)

This is the feature that sets MAI-Voice-2 apart from basic TTS APIs. You upload 5–60 seconds of reference audio, and the model generates speech matching that speaker’s timbre, cadence, and style. No training. No fine-tuning. It just works - across all supported languages.

Catch: It’s gated. You need to apply through Microsoft’s Limited Access Review, submit a consent recording from the voice actor, and get approved. This isn’t ElevenLabs-style instant cloning for anyone with a URL. Microsoft built consent into the system level: no unlicensed voice cloning is possible.

The API flow:

  1. Apply for access → get approved
  2. Upload consent audio + reference clip (10–120 seconds recommended)
  3. Create a voice profile via Personal Voice APIs
  4. Synthesize using speakerProfileId in SSML

For production applications where you need a consistent brand voice across all content, this is the right approach. For quick experimentation, it’s bureaucratic.


Developer Experience & SSML

MAI-Voice-2 works with Azure’s existing Speech SDK (C#, Python, Java, JavaScript) and REST API. If you’ve used Azure Neural TTS before, the migration is swapping the voice name:

From:

en-US-Ava:DragonHDLatestNeural

To:

en-US-Harper:MAI-Voice-2

Basic Python example:

import requests

endpoint = "https://eastus.tts.speech.microsoft.com/cognitiveservices/v1"
headers = {
 "Content-Type": "application/ssml+xml",
 "Ocp-Apim-Subscription-Key": "<your-key>",
 "X-Microsoft-OutputFormat": "audio-24khz-160kbitrate-mono-mp3"
}

ssml = """<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
 xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
 <voice name="en-US-Harper:MAI-Voice-2">
 <mstts:express-as style="excited">
 Hey, this MAI Voice 2 is actually really good.
 </mstts:express-as>
 </voice>
</speak>"""

resp = requests.post(endpoint, headers=headers, data=ssml.encode("utf-8"))
with open("output.mp3", "wb") as f:
 f.write(resp.content)

Output formats: MP3, WAV, PCM, Opus, TrueSilk. Sample rates: 8, 16, 24, 48 kHz.

SSML limitations: MAI-Voice-2 supports a subset of SSML tags. It supports <voice>, <lang xml:lang>, <phoneme>, <lexicon>, <say-as>, <sub>, <break>, <p>, <s>, and <mstts:express-as>. It does NOT support <prosody>, <emphasis>, <audio>, <bookmark>, or <mstts:silence>. If your workflow relies on fine-grained prosody control, you’ll want to use Azure’s DragonHD voices instead.


Audio Quality Benchmarks

Microsoft published rigorous human evaluation data for MAI-Voice-2:

  1. Side-by-side preference: MAI-Voice-2 won 72.1% of 2,500 blind comparisons against MAI-Voice-1.
  2. Human parity test: Across 11 languages and 2,222 responses, listeners could not reliably distinguish MAI-Voice-2 from real human speech. 45.5% preferred the AI output, 44% preferred real recordings, 10.5% tied.
  3. Speaker similarity: In speaker identity evaluations, MAI-Voice-2 output was indistinguishable from genuine recordings of the same speaker.

These are published figures from Microsoft. Independent third-party benchmarks don’t exist yet (the model is 3 days old as of this writing). But the audio samples on the MAI blog are genuinely impressive - particularly the Hindi-English code-switching clip and the German emotional samples.


Comparison: MAI-Voice-2 vs. Competitors

Here’s how MAI-Voice-2 stacks up against the major TTS providers in June 2026:

FeatureMAI-Voice-2OpenAI TTSElevenLabsGoogle Cloud TTSAmazon Polly
Pricing (per 1M chars)$15.00 (Neural)$15.00 (tts-1) / $30.00 (tts-1-hd)~$20-$36/hour (credit-based)$16.00 (Neural2) / $30.00 (Chirp 3 HD)$16.00 (Neural) / $30.00 (Generative)
Languages15+57 (via Whisper)3240+30+
Prebuilt Voices46 across 18 locales13 (GPT-4o-mini-tts)Hundreds (community + professional)220+60+
Voice CloningZero-shot (5-60s, gated)Custom voices (gated, enterprise only)Instant (free tier) & ProfessionalInstant custom voice (Chirp 3)Not available
Emotion ControlSSML tags (18+ emotions)Natural language instructionsVoice settings + promptSSML (limited)SSML (limited)
Code-SwitchingHindi-English, Spanish-EnglishNot supportedSome multilingual modelsNot supportedNot supported
SSML SupportSubset (no prosody/emphasis)Not applicableNot applicableFullFull
StreamingReal-time (<300ms) via SDKReal-time via APIReal-time via WebSocket APIReal-time via gRPCReal-time via SDK
Audio Quality (max)48 kHz24 kHz44.1 kHz24 kHz24 kHz
Multi-SpeakerYes (single synthesis flow)NoYes (via projects)NoNo
Free Tier0.5M chars/monthNone10K credits (~10 min)1M chars/month (Neural2)1M chars/month (Neural)

MAI-Voice-2 vs. ElevenLabs

ElevenLabs still has the edge in raw voice variety and community ecosystem. Their instant voice cloning works instantly and doesn’t require an access application. If you need a voice right now, ElevenLabs is faster.

But MAI-Voice-2 wins on:

  • Enterprise trust. Microsoft’s consent verification system means you can deploy branded voices without legal exposure. ElevenLabs has faced scrutiny over unauthorized cloning.
  • Azure integration. If your stack is already on Azure, adding MAI-Voice-2 is a few API calls. No new vendor relationship.
  • Code-switching. ElevenLabs doesn’t offer natural mid-sentence language switching.
  • Long-form stability. MAI-Voice-2 is explicitly designed to maintain persona consistency across hours of content.

MAI-Voice-2 vs. OpenAI TTS

OpenAI’s gpt-4o-mini-tts model is excellent. It supports 13 voices, natural language emotion instructions, and 57 languages. The quality is on par with MAI-Voice-2 for English.

MAI-Voice-2 advantages:

  • More prebuilt voices per language. OpenAI has 13 voices total; MAI-Voice-2 has up to 6 per locale.
  • Fine-grained emotion control. SSML tags give you 18 discrete emotions. OpenAI uses free-text instructions, which are flexible but less predictable.
  • Multi-speaker synthesis. MAI-Voice-2 can generate multi-person dialogue in a single API call. OpenAI can’t.
  • Enterprise data guarantees. MAI models are trained on clean, licensed data. OpenAI’s data provenance has been a point of contention.

MAI-Voice-2 vs. Google Cloud TTS

Google’s Chirp 3 HD voices are competitive on quality (and priced identically at $30/M chars for HD). Google has the edge on language count (40+ vs 15+), but MAI-Voice-2’s supported languages have deeper expressive capabilities - each voice supports 15+ emotion tags, while Google’s SSML emotion support is more limited.

MAI-Voice-2 advantages:

  • Emotion depth. 18 discrete emotions per voice vs. Google’s 4–5.
  • Voice prompting. Google’s instant custom voice (Chirp 3) is comparable, but MAI-Voice-2’s consent infrastructure is more enterprise-ready.
  • Multi-speaker and code-switching. Unique features Google doesn’t match.

MAI-Voice-2 vs. Amazon Polly

Polly is the budget option. Standard voices at $4/M chars and Neural at $16/M chars. But Polly’s voice quality hasn’t kept pace with the latest generation of TTS. If cost per character is your only metric, Polly wins. If you care about how your brand sounds, MAI-Voice-2 is in a different league.


Real-World Use Cases

1. AI Voice Agents

MAI-Voice-2 is purpose-built for conversational AI. The emotional range, natural prosody, and real-time streaming make it a strong back-end for customer support bots, virtual assistants, and voice-enabled Copilot experiences. Microsoft is integrating it directly into Dynamics 365 Contact Center.

2. Audiobooks & Podcasts

Long-form stability means a single voice maintains consistent character across an 8-hour audiobook. Multi-speaker support lets you generate full-cast productions from a single SSML document. Combine that with zero-shot voice cloning and you can produce an audiobook in the author’s own voice (with consent).

3. Content Localization

15 languages, same voice persona, emotional expressiveness preserved across all of them. For companies localizing video content, e-learning courses, or marketing materials, this cuts the number of tools and voice actors needed.

4. Accessibility

Screen readers and accessibility tools live and die by voice quality. A voice that doesn’t cause listening fatigue after 30 minutes is a genuine accessibility win. MAI-Voice-2’s natural prosody and emotional variation make extended listening dramatically more comfortable.

5. Gaming & Interactive Media

Prebuilt character voices with explicit emotion control let game developers prototype dialogue before hiring voice actors. Zero-shot cloning lets indie developers create consistent character voices from limited reference audio.


What’s Missing (Limitations)

No product review is honest without the downsides:

  1. SSML limitations. If your workflow depends on <prosody> for fine pitch/rate control or <emphasis> for word-level stress, MAI-Voice-2 won’t work for you. Use Azure’s DragonHD voices instead.
  2. Language count. 15 languages is decent. But Google supports 40+, ElevenLabs supports 32. If you need TTS in Finnish, Greek, or Vietnamese, you’re waiting for future updates.
  3. Gated voice cloning. The consent requirement is good for ethics and bad for speed. Applying, getting approved, and uploading consent audio adds days or weeks to your workflow.
  4. Public preview. MAI-Voice-2 is currently in public preview. No SLA. Not recommended for production workloads per Microsoft’s own docs. The model will likely go GA within months, but plan accordingly.
  5. Latency tuning. The model prioritizes naturalness over latency. For ultra-low-latency voice agents (sub-100ms), you’ll want MAI-Voice-2-Flash when it ships, or stick with Azure’s standard Neural voices.

Should You Use MAI-Voice-2?

Yes, if:

  • You’re already on Azure and want the best voice quality available
  • You need multilingual TTS with deep emotional control
  • You’re building voice agents, audiobooks, or accessibility tools
  • Enterprise data lineage and consent verification matter to your legal team
  • You need code-switching for Hindi-English or Spanish-English content

Wait, if:

  • You need 40+ languages today (go with Google or ElevenLabs)
  • You need instant voice cloning without an approval process (use ElevenLabs)
  • You need full SSML support including prosody control (use Azure DragonHD)
  • You’re cost-optimizing at scale and audio quality isn’t differentiating (use Amazon Polly)

The Bottom Line

MAI-Voice-2 is the best text-to-speech model Microsoft has ever built. The audio quality is genuinely at human parity for supported languages. The emotional range and voice prompting capabilities rival ElevenLabs while offering Azure’s enterprise infrastructure and consent guardrails.

If Microsoft ships MAI-Voice-2-Flash soon with lower latency and cost, and expands to 30+ languages, it’ll be the default recommendation for most TTS use cases.

For now, it’s the obvious choice if you’re on Azure and an extremely compelling reason to switch if you’re not.


Sources

  1. Introducing MAI-Voice-2 - Microsoft AI Blog (June 2, 2026)
  2. What is MAI-Voice? - Microsoft Learn (updated June 4, 2026)
  3. Azure Speech in Foundry Tools Pricing
  4. Text to Speech Overview - Microsoft Learn
  5. Building a Hill-Climbing Machine: Launching Seven New MAI Models (June 2, 2026)
  6. OpenAI Text to Speech API Documentation
  7. Google Cloud Text-to-Speech Pricing
  8. Amazon Polly Pricing
  9. ElevenLabs Pricing
  10. Language and Voice Support - Microsoft Learn
  11. Neural Text to Speech HD Voices - Microsoft Learn

Get our weekly AI digest

The latest AI tools, prompts, and insights — delivered every Tuesday.

No spam. Unsubscribe anytime.

AIUnpacker

AIUnpacker Editorial Team

Verified

A collective of engineers, journalists, and AI practitioners dedicated to providing clear, unbiased analysis of the AI tools shaping tomorrow.