Microsoft MAI-Voice-2 Pricing Guide 2026: Azure TTS Costs &

AIUnpacker Editorial

AIUnpacker

Jun 5, 2026Updated Jun 5, 202614m read

Jun 5, 2026Updated Jun 5, 2026

14 min2,993 words

Key Takeaways

Let's talk real numbers: what does Microsoft MAI-Voice-2 actually cost on Azure? I broke down every pricing tier and calculated costs for five real-world scenarios.

Summarize with AI

14 min → 30 sec

ChatGPT

OpenAI

Gemini

Google

Perplexity

AI Search

Editorial Disclosure & Affiliate Notice

This content is published for informational and educational purposes only. It is not intended as a substitute for professional, legal, financial, or medical advice. AIUnpacker is funded by sponsorships, affiliate commissions, and display advertising — nothing here is free to produce. When you buy through our links, we may earn a commission at no extra cost to you. Our editorial picks are never influenced by compensation.

For educational purposes only. Nothing here should be taken as a guarantee, recommendation, or professional recommendation.
AI-assisted editing. Drafts are produced with AI assistance and reviewed by our human editorial team.
Opinions are our own. Also, we are not affiliated with most tools we cover unless explicitly stated.
Information may be outdated. Verify pricing, features, and policies directly with the vendor.
Last reviewed: June 5, 2026. Published June 5, 2026.

Read more on our About page, Terms and Editorial Policy.

If you’ve been circling around Microsoft MAI-Voice-2 pricing trying to figure out what this thing actually costs, you’re not alone. Azure’s pricing page buries the real numbers behind JavaScript widgets and tiered dropdowns. I spent a week digging through docs, blog posts, and the Azure pricing calculator so you don’t have to.

Here’s the bottom line upfront: MAI-Voice-2 costs $22 per 1 million characters for pay-as-you-go synthesis. That gets you Microsoft’s latest multilingual TTS model with voice cloning and voice prompting across 15+ languages. Standard neural voices cost less ($16/M chars), custom professional voices cost more ($24/M chars), and the free tier gives you 500,000 characters per month to experiment.

But that’s just the sticker price. The real question is what it costs to run actual workloads. Let’s get into it.

What Is MAI-Voice-2, Actually?

MAI-Voice-2 is Microsoft’s second-generation multilingual text-to-speech model, announced at Build 2026. It’s part of the MAI model family that powers Copilot, Bing, and PowerPoint. Unlike the older standard neural voices, MAI-Voice-2 brings two capabilities that change how you build voice applications:

Identity preservation - also known as voice cloning. You give the model a short reference sample of a specific person’s voice, and it “speaks as” that individual across all supported languages. One cloned voice, 15+ languages, no per-language voice libraries to manage.

Voice prompting - instead of picking from a dropdown menu of voice styles, you feed the model a short audio clip that demonstrates the tone, emotion, accent, and pacing you want. The output matches that reference. Think of it like prompt engineering, but for speech.

These two features make MAI-Voice-2 genuinely different from standard TTS models. You’re not locked into pre-defined personas. You can clone a CEO’s voice for internal comms, match a brand ambassador’s cadence for marketing, or preserve a family member’s voice for accessibility applications.

Microsoft also announced MAI-Voice-2 Flash, a faster, more cost-efficient variant coming soon.

Azure AI Speech TTS Pricing Tiers: The Full Breakdown

Azure’s TTS pricing isn’t one-size-fits-all. Here’s how the tiers actually work:

Free Tier (F0)

What You Get	Details
Neural TTS characters	500,000 characters free per month
Speech-to-text	5 audio hours free per month
Custom Voice endpoint	1 model hosted free per month
Best for	Prototyping, testing, personal projects

The free tier is genuinely useful for experimentation. Half a million characters is roughly 8-10 hours of spoken audio. But there’s a catch: F0 has strict concurrency limits (typically 1 concurrent request), so don’t try to run production workloads on it.

Pay-As-You-Go (Standard Pricing)

Voice Type	Price per 1M Characters	Best For
MAI-Voice-2	$22.00	Voice cloning, multilingual projects, voice prompting
Standard Neural (non-HD)	$16.00	General TTS, chatbots, accessibility, narrations
Neural HD (DragonHD)	$24.00	Content creation, audiobooks, podcasts, professional output
Dragon HD Omni	$24.00	700+ voices, automatic style prediction, diverse content
Custom Voice - Professional (synthesis)	$24.00	Branded voices, enterprise agents, character voices
Custom Voice - Professional (training)	Priced per compute hour	One-time cost for voice model creation
Custom Voice - Professional (hosting)	Priced per model per hour	Ongoing cost for deployed endpoints
Personal Voice (synthesis)	$22.00	Personal voice cloning (approved use cases only)
Personal Voice (profile storage)	Per 1,000 profiles/month	Ongoing storage for voice profiles

Pricing note: Azure bills per character, not per word or per second. Every letter, number, space, punctuation mark, and SSML tag inside the text body counts. Chinese characters count as two characters each. SSML tags <speak> and <voice> are excluded from billing.

Commitment Tiers (for High-Volume Users)

If you’re processing more than 80 million characters per month, commitment tiers give you steep discounts:

Monthly Volume	Price per 1M Characters	Discount vs. Pay-As-You-Go
80M characters	~$12.75	~20% off standard neural
400M characters	~$11.25	~30% off standard neural
2,000M characters	~$10.00	~38% off standard neural

Important: Commitment tiers apply to standard neural (non-HD) voices only. HD voices, MAI-Voice-2, custom voices, and OpenAI voices are not included. You can’t mix and match voice types within a commitment tier.

Real Cost Examples: 5 Scenarios Calculated

Enough with the pricing tables. Let’s calculate actual costs for real-world use cases.

Scenario 1: Weekly Podcast Production

You publish one 45-minute podcast episode per week. Each episode has two hosts (male and female), intro/outro music handled separately.

Words per episode (average): ~7,000 words for both hosts
Characters per word (English average, including spaces): ~5.5
Characters per episode: 7,000 x 5.5 = 38,500
Monthly characters (4 episodes): 154,000
Voice type: Neural HD (you want professional quality)
Monthly cost: 154,000 / 1,000,000 x $24 = $3.70

Yes, three dollars and seventy cents. For a full month of podcast narration with Neural HD voices. You’ll spend more on your podcast hosting platform.

With MAI-Voice-2 and voice prompting, you could use a reference clip of each host’s natural speaking style to get more authentic delivery. At $22/M chars, that same monthly cost would be $3.39 - slightly less than Neural HD but with dramatically more flexibility.

Scenario 2: Enterprise Customer Service Voice Bot

A mid-size company runs a customer service voice bot handling 50,000 calls per month. Average call duration is 4 minutes, with the bot speaking roughly 40% of the time.

Total bot speaking time per month: 50,000 x 4 x 0.4 = 80,000 minutes
Words spoken per minute (average TTS rate): ~150
Total words per month: 80,000 x 150 = 12,000,000
Characters per month: 12,000,000 x 5.5 = 66,000,000
Voice type: Standard Neural (good enough for CS interactions)
Monthly TTS cost: 66,000,000 / 1,000,000 x $16 = $1,056

At 66M characters/month, you’re close to the 80M commitment tier threshold. If you cross it, the per-unit price drops to ~$12.75/M, bringing this down to roughly $842/month.

If you add custom voice for brand consistency: $1,584/month (66M x $24/M) plus hosting costs.

Scenario 3: AI Content Creation Platform

A SaaS platform lets users generate narrated videos. Users produce 200,000 characters of TTS output daily across all accounts.

Daily characters: 200,000
Monthly characters: 6,000,000
Voice type: Neural HD for pro users, Standard Neural for free users
Estimated split: 40% HD ($24/M), 60% Standard ($16/M)
HD monthly cost: 2.4M / 1M x $24 = $57.60
Standard monthly cost: 3.6M / 1M x $16 = $57.60
Total monthly TTS cost: $115.20

That’s $115/month to narrate 6 million characters of content. To put that in perspective, 6 million characters is roughly 13 average-length novels.

Scenario 4: Enterprise Accessibility Suite

A university deploys TTS across its learning management system. 30,000 students use it for reading course materials, averaging 15,000 characters each per month.

Monthly characters: 30,000 x 15,000 = 450,000,000
Voice type: Standard Neural through 2,000M commitment tier
Tier: $10.00/M characters (committed volume: 2,000M)
Monthly cost: 450M / 1M x $10 = $4,500

At this scale, the commitment tier saves roughly $2,700/month compared to pay-as-you-go pricing. With 30,000 students, that’s about $0.15 per student per month.

Scenario 5: Multilingual Marketing with MAI-Voice-2

A global brand produces localized video ads in 12 languages using MAI-Voice-2’s identity preservation. They clone their brand ambassador’s voice once and generate TTS in all markets.

Monthly characters across all languages: 30,000,000
Voice type: MAI-Voice-2 ($22/M)
Monthly TTS cost: 30,000,000 / 1,000,000 x $22 = $660

Without voice cloning, they’d need to hire voice actors for each market, record studio sessions, and manage revisions. The TTS approach costs $660/month versus thousands in talent and production fees.

SSML Features: What You Can Actually Control

Speech Synthesis Markup Language (SSML) is the XML-based markup language that lets you fine-tune Azure TTS output. Here’s what matters for practical use:

What You Can Adjust with SSML

SSML Feature	What It Does	Supported on MAI-Voice-2
`<voice>`	Select voice, language, and model variant	Yes
`<lang>`	Switch languages mid-document	Yes
`<break>`	Insert pauses (strength or duration)	Yes
`<prosody>`	Adjust pitch, rate, volume, contour	Partial (model-dependent)
`<emphasis>`	Stress specific words	Partial
`<say-as>`	Interpret dates, numbers, currency	Yes
`<phoneme>`	Specify exact pronunciation	Yes
`<sub>`	Substitute text for pronunciation	Yes
`<lexicon>`	Custom pronunciation dictionaries	Yes (alias only on HD)
`<mstts:express-as>`	Apply speaking styles (emotions, roles)	Partial

SSML Styles on Dragon HD Voices

If you’re using DragonHD voices (not MAI-Voice-2), you get access to 60+ speaking styles:

amazed, amused, angry, annoyed, anxious, appreciative, calm, cautious, concerned, confident, confused, curious, defeated, defensive, defiant, determined, disappointed, disgusted, doubtful, ecstatic, encouraging, excited, fast, fearful, frustrated, happy, hesitant, hurt, impatient, impressed, intrigued, joking, laughing, optimistic, painful, panicked, panting, pleading, proud, quiet, reassuring, reflective, relieved, remorseful, resigned, sad, sarcastic, secretive, serious, shocked, shouting, shy, skeptical, slow, struggling, surprised, suspicious, sympathetic, terrified, upset, urgent, whispering

Plus paralinguistic tags: laughter, coughing, throat_clearing, breathing, sighing, yawning.

SSML Example

Here’s what a practical SSML document looks like for a customer service interaction with a neural voice:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
 xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
 <voice name="en-US-JennyNeural">
 <mstts:express-as style="customerservice">
 <prosody rate="-5%" pitch="+10%">
 Hello! <break time="300ms"/> Thank you for calling.
 </prosody>
 </mstts:express-as>
 <break time="500ms"/>
 <prosody rate="medium" volume="medium">
 Your order number <say-as interpret-as="characters">A4B9X7</say-as>
 will arrive on <say-as interpret-as="date" format="mdy">06/12/2026</say-as>.
 </prosody>
 <break time="300ms"/>
 Is there anything else I can help with?
 </voice>
</speak>

Voice Customization Options

Azure Speech gives you three paths for voice customization:

1. Standard Voices (Pre-Built)

500+ voices across 100+ languages and locales
Available in 24 kHz and 48 kHz
No training required, just pick and use
Best for: General applications, prototyping, any project that doesn’t need a unique voice

2. Professional Voice (Custom Neural Voice)

Train a model on 300+ utterances from a voice talent
Takes 20-40 compute hours to train (capped at 96 hours billing)
Requires voice talent consent recording
Limited access - you must apply and be approved
Best for: Brand voices, enterprise agents, character voices in games/media

3. Personal Voice

Zero-shot cloning from a short voice sample
Free voice creation, charged for synthesis and profile storage
Restricted to pre-approved use cases only (accessibility, personal assistants, etc.)
Best for: Individual voice preservation, personal assistants

MAI-Voice-2 sits in a unique position here. It offers cloning-like capabilities (identity preservation) without requiring the full custom voice application process. You still need responsible use approval, but the technical barrier is lower.

When to Use MAI-Voice-2 vs. DragonHD vs. Standard Neural

Question	Standard Neural	DragonHD	MAI-Voice-2
Cost per 1M chars	$16	$24	$22
Voice count	500+	30+ (DragonHD) / 700+ (Omni)	Model-based (cloning)
Multilingual	Some (dedicated multilingual voices)	Yes	Yes, 15+ languages
Voice cloning	No	No	Yes (identity preservation)
Voice prompting	No	No	Yes (reference audio)
Style control	Varies by voice	60+ styles + paralinguistics	Limited (prompt-based)
SSML prosody	Full support	Limited (no `<prosody>`)	Partial
Batch synthesis	Yes	No (real-time only)	Yes
Deployment options	Cloud, container, embedded	Cloud only	Cloud only
Best for	Chatbots, IVR, accessibility, any TTS	Audiobooks, podcasts, professional narration	Localized content, brand voice, multispeaker

How to Get Started: Step-by-Step

Here’s the quickest path from zero to synthesized speech with MAI-Voice-2:

Step 1: Create an Azure Account and Speech Resource

# Sign up at https://azure.microsoft.com/free
# Create a Speech resource in Azure Portal
# Note your key and region (e.g., eastus)

Step 2: Set Environment Variables

export SPEECH_KEY="your-key-here"
export SPEECH_REGION="eastus"

Step 3: Install the Speech SDK (Python)

pip install azure-cognitiveservices-speech

Step 4: Write Your First TTS Script

import os
import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
 subscription=os.environ.get('SPEECH_KEY'),
 region=os.environ.get('SPEECH_REGION')
)

# For MAI-Voice-2, reference it through Azure Speech
speech_config.speech_synthesis_voice_name = "en-US-Ava:DragonHDLatestNeural"

# Or use MAI-Voice-2 through the voice name format (check latest docs)
# speech_config.speech_synthesis_voice_name = "en-US-Ava:MAIVoice2LatestNeural"

audio_config = speechsdk.audio.AudioOutputConfig(filename="output.wav")
synthesizer = speechsdk.SpeechSynthesizer(
 speech_config=speech_config, audio_config=audio_config
)

synthesizer.speak_text_async("Hello from Azure TTS. This is my voice now.").get()

Step 5: Use SSML for More Control

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
 xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
 <voice name="en-US-Ava:DragonHDLatestNeural">
 <prosody rate="0.9" pitch="+5%">
 This is fine-tuned speech with custom pacing and pitch.
 </prosody>
 </voice>
</speak>
"""
synthesizer.speak_ssml_async(ssml).get()

Step 6: Try the Voice Gallery First (No Code)

Browse to speech.microsoft.com/portal/voicegallery to audition 500+ voices without writing a single line of code. You can filter by language, gender, and use case.

Pricing Comparison: Azure vs. Competitors

How does Microsoft MAI-Voice-2 pricing stack up against alternatives?

Provider	Model	Price per 1M Characters	Notes
Microsoft (MAI-Voice-2)	MAI-Voice-2	$22.00	Voice cloning, 15+ languages, voice prompting
Microsoft (Standard Neural)	Neural TTS	$16.00	500+ voices, full SSML support
Microsoft (Neural HD)	DragonHD	$24.00	60+ styles, multi-talker, professional quality
Google Cloud	Chirp 3 HD	$24.00	31 voices, 8 languages, studio quality
Google Cloud	Standard voices	$16.00	Standard quality voices
Google Cloud	Studio voices	$24.00	Premium voices with expression
ElevenLabs	Eleven Multilingual v2	~$0.30 per 1K chars ($300/M chars)	Voice cloning, 29 languages, API-based
ElevenLabs	Eleven Flash v2.5	~$0.015 per 1K chars ($15/M chars)	Fast, lower quality, cloning
OpenAI (via Azure)	TTS-1 / TTS-1 HD	See Azure OpenAI pricing	6 voices, REST API only in Azure OpenAI
Amazon Polly	Neural	$16.00	Standard neural quality
Amazon Polly	Generative	$32.00	LLM-powered, longer-form content

Note: Direct API pricing comparisons are tricky because billing units differ (characters vs. tokens vs. requests). ElevenLabs bills per character, Google bills per character, Amazon bills per character - this comparison uses character-based pricing where possible.

The key takeaway: MAI-Voice-2 at $22/M chars sits in the mid-premium range. It’s cheaper than ElevenLabs’ multilingual model ($300/M chars) while offering comparable cloning capabilities. It’s more expensive than standard neural TTS but the feature set justifies the premium for multilingual and cloning workflows.

Watch Your Bill: Cost Traps to Avoid

After working with Azure TTS for a while, here are the ways people accidentally overspend:

Forgetting SSML tags count as characters. Anything inside your text body that isn’t <speak> or <voice> is billable. A complex SSML document with lots of <phoneme>, <prosody>, and <break> tags can add 10-15% to your character count.
Leaving custom voice endpoints running. Custom Professional Voice endpoints bill per hour whether you’re synthesizing or not. If you train a voice model and deploy it, suspend the endpoint when you’re done testing. A forgotten endpoint running 24/7 adds roughly $100-200/month depending on region.
Chinese/Japanese/Korean text costs double. Each CJK character is counted as two characters for billing purposes. If half your content is in Japanese, your effective per-character cost doubles.
Not monitoring the free tier threshold. The free tier gives you 500K neural characters per month. If you accidentally deploy to production and start serving users, you’ll blow through that allocation fast and start paying.
Using Neural HD for everything. DragonHD and Dragon HD Omni cost 50% more than standard neural ($24 vs. $16 per 1M characters). For IVR systems, automated alerts, and basic narration, standard neural sounds just as good to most listeners. Reserve HD for content where audio quality affects the user experience.

The Bottom Line

MAI-Voice-2 is Microsoft’s strongest TTS model yet, and $22 per 1M characters makes it competitive for production workloads. The voice cloning and voice prompting features alone justify the premium over standard neural voices if you’re doing multilingual content or need consistent brand voice across markets.

For most teams, I’d recommend this decision framework:

Prototyping or personal projects: Free tier (500K chars/month) with standard neural voices
Basic TTS (chatbots, IVR, accessibility): Standard Neural at $16/M chars, commit to a tier once you exceed 80M chars/month
Content creation (podcasts, audiobooks, video): DragonHD Omni at $24/M chars for the voice variety and automatic style prediction
Multilingual brand voice or cloning: MAI-Voice-2 at $22/M chars, especially if you need identity preservation across languages
Enterprise brand voice at scale: Custom Professional Voice + commitment tier - expensive upfront but the best long-term value for high-volume branded applications

Start with the free tier, experiment in the Voice Gallery, and only pay for production workloads when you know exactly what you need.

Sources

Microsoft Tech Community - “New MAI models in Microsoft Foundry across text, image, voice, and speech” (June 2, 2026). https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/new-mai-models-in-microsoft-foundry-across-text-image-voice-and-speech/4524632
Azure Speech in Foundry Tools Pricing Page. https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/
Microsoft Learn - “Speech Synthesis Markup Language (SSML) Overview.” https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup
Microsoft Learn - “What are neural text to speech HD voices?” https://learn.microsoft.com/en-us/azure/ai-services/speech-service/high-definition-voices
Microsoft Learn - “Custom voice overview.” https://learn.microsoft.com/en-us/azure/ai-services/speech-service/custom-neural-voice
Microsoft Tech Community - “Azure Speech at Build 2026: Powering Voice Agents with Real-Time and Life-like Experiences” (June 3, 2026). https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/azure-speech-at-build-2026-powering-voice-agents-with-real-time-and-life-like-ex/4524638
Microsoft Learn - “Text to speech overview.” https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech
Microsoft Learn - “Language and Voice Support for Azure Speech.” https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=tts

Get our weekly AI digest

The latest AI tools, prompts, and insights — delivered every Tuesday.

No spam. Unsubscribe anytime.

AIUnpacker Editorial Team

Verified

A collective of engineers, journalists, and AI practitioners dedicated to providing hands-on, transparently disclosed analysis of the AI tools shaping tomorrow.

About us ·More articles