MiniMax M3 Pricing, Features, Video Input, Context Window, and Best Use Cases
MiniMax dropped M3 on June 1, 2026, and if you’re trying to figure out MiniMax M3 pricing, what its video input actually does, or whether that 1,000,000-token context window is worth paying for - you’re in the right place. I’ve read the entire technical report, combed through the API docs, and cross-checked the pricing tiers so you don’t have to.
Here’s the short version: M3 is the first open-weight model that combines frontier coding, native multimodal input (text + image + video), and a million-token context window in one package. It uses a new sparse attention architecture called MSA that makes long-context inference actually feasible. And its pricing - especially with the launch-week 50% discount - seriously undercuts comparable closed-source models.
Let’s get into the details.
MiniMax M3 Pricing: How Much Does It Actually Cost?
MiniMax M3 pricing splits into two tracks: pay-as-you-go API pricing for enterprises and Token Plan subscriptions for individuals and small teams. Here’s exactly what you’ll pay.
Pay-As-You-Go API Pricing
M3’s API pricing uses a two-tier structure based on context length. For calls with 512K or fewer input tokens - which covers the vast majority of conversation and coding use cases - you get the standard rate. Push past 512K (think full-repo code understanding or all-day video analysis sessions), and you hit the long-context rate.
| Tier | Input (per 1M tokens) | Output (per 1M tokens) | Cache Read (per 1M tokens) |
|---|---|---|---|
| Standard ≤512K (7-day 50% off) | $0.30 (normally $0.60) | $1.20 (normally $2.40) | $0.06 (normally $0.12) |
| Standard >512K | $1.20 | $4.80 | $0.24 |
| Priority ≤512K (7-day 50% off) | $0.45 (normally $0.90) | $1.80 (normally $3.60) | $0.09 (normally $0.18) |
| Priority >512K | $1.80 | $7.20 | $0.36 |
Priority tier gets you scheduling priority and more stable latency under high concurrency - useful if you’re building a production service with SLA requirements. For most developers, Standard tier is the right starting point.
At the launch-week discounted rate of $0.30/M input and $1.20/M output, M3 is cheaper than Claude Opus 4.7 ($5/$25), GPT-5.x, and Gemini 3.1 Pro. Even at normal pricing of $0.60/$2.40, it’s still comfortably below the big closed-source models for input cost, though output pricing is closer to parity.
Prompt caching brings costs down further. M3 supports automatic prompt caching for requests with 512+ input tokens. Cache-hit tokens bill at just $0.06/M (discounted) - that’s an 80% discount versus the standard input rate. If you’re building a chatbot or an agent that reuses system prompts, tool definitions, or conversation prefixes, the savings stack up fast. MiniMax’s docs show a real example where caching reduced total cost by roughly 67% on a request with 45,000 cached tokens out of 50,000 total.
Token Plan Subscriptions
If you’d rather pay a flat monthly fee than meter every token, the Token Plan covers all MiniMax models - text, image, speech, video, and music - under one quota.
| Plan | Price | Estimated M3 Token Capacity | Best For |
|---|---|---|---|
| Plus | $20/month | ~1.7B tokens/month | Personal projects, prototyping |
| Max | $50/month | ~5.1B tokens/month | Daily coding with agents, multimodal work |
| Ultra | $120/month | ~9.8B tokens/month | Heavy agent workflows, extended sessions |
All three tiers share the same rate limits: 200 requests per minute and 10 million tokens per minute. Usage draws from a shared credit pool with 5-hour rolling and weekly quota windows. If you blow through your subscription quota, purchased Credits ($1 = 1,000 credits) cover the overflow automatically.
For context: $50/month for roughly 5 billion tokens of frontier-model access is aggressive pricing. Comparable coding-focused subscriptions from other providers typically deliver fewer tokens at a higher price point.
Token Plan vs. Pay-As-You-Go: Which Should You Choose?
- Go Token Plan if you’re a solo developer or small team using M3 daily across multiple tools (Claude Code, Cursor, OpenClaw). The flat fee caps your cost and the credit pool covers speech, image, and video generation too.
- Go Pay-As-You-Go if you need programmatic API access with no usage windows, want the Priority service tier, or have unpredictable burst workloads that don’t fit neatly into subscription quotas.
Key Features of MiniMax M3
M3 isn’t a single-feature model. It’s designed to do three hard things simultaneously - and that’s what makes it different from most open-weight releases.
1. Native Multimodal: Text, Image, and Video Input
MiniMax rebuilt its entire data pipeline to train M3 on text, images, and video from step zero - not as a post-hoc fine-tuning bolt-on. The training corpus exceeds 100 trillion tokens with a heavy emphasis on interleaved multimodal data (documents where text and images naturally mix within sequences).
Image input supports JPEG, PNG, GIF, and WEBP formats. You can pass images via URL or base64 encoding, with files up to 10 MB. At the “high” detail setting, a single image can consume up to roughly 15K tokens. At “low” detail, it’s usually a few hundred tokens. The model handles charts, diagrams, photographs, screenshots, and document scans.
Video input is where M3 genuinely stands out. Supported formats include MP4, AVI, MOV, and MKV. You can send videos via URL, base64, or through the Files API (which handles files up to 512 MB - URL/base64 caps at 50 MB). M3 processes video at 1 frame per second, with support for up to 1,024 frames at resolutions between 336–1,008 pixels on the long edge.
On Video-MMMU, a challenging multimodal video understanding benchmark, M3 scores competitively against closed-source models. On the more widely-used Video-MME benchmark, it hits 84.6 at 512 frames.
The practical implication: you can upload a product demo, a security camera clip, a lecture recording, or a gameplay video and ask M3 to describe what’s happening, answer questions about specific moments, extract timestamps, or summarize the content. No separate vision pipeline needed.
2. 1M Token Context Window via MSA Architecture
A million-token context window isn’t just a spec-sheet flex - it changes what you can build. M3 achieves this with MiniMax Sparse Attention (MSA), a sparse attention mechanism designed from scratch to avoid the quadratic compute scaling of full attention.
Here’s what MSA actually does:
- KV-block-based sparse routing. The key-value cache gets partitioned into blocks, and queries only route to the most relevant blocks. MSA’s partitioning is finer-grained than earlier approaches like DSA or MoBA, giving it better effective context coverage.
- Operator-level optimization. They use a “KV outer gather Q” approach where KV blocks act as the outer loop, aggregating queries that hit each block. Each block is read exactly once with contiguous memory access. Under M3’s head configuration, this runs over 4x faster than open-source Flash-Sparse-Attention and flash-moba.
- Real throughput. At 1M context length, per-token compute drops to roughly 1/20th of the previous generation. Prefilling speeds up by over 9x, decoding by over 15x.
The team’s internal testing showed MSA matched full attention on the vast majority of capability dimensions - reasoning, retrieval, multi-hop QA - without the performance degradation that historically plagued sparse attention methods.
What the 1M window enables in practice:
- Dropping an entire 500-page technical specification, its test suite, and the full codebase into a single prompt
- Processing 12+ hours of agent conversation history with full recall of every decision, error, and tool call
- Analyzing hour-long video recordings without splitting them into chunks
- Running multi-day autonomous coding sessions where the agent never forgets what it did three hours ago
The guaranteed minimum is 512K tokens, with the full 1M available through API configuration. Input beyond 512K currently requires contacting sales, though public availability is expected shortly after launch.
3. Interleaved Thinking and Tool Use
M3 supports interleaved thinking, a reasoning pattern where the model reflects between each round of tool interactions. Before every tool call, it analyzes the current environment and tool outputs to decide its next action. This matters for long-horizon agent tasks because the model builds a running mental model rather than calling tools blindly.
The model achieved state-of-the-art results on SWE-Bench, BrowseCamp, and xBench - all benchmarks that test both coding and agentic reasoning under multi-step, tool-heavy conditions.
On BrowseComp, M3 scored 83.5, surpassing Claude Opus 4.7 (79.3). On MCP Atlas, it hit 74.2%. These aren’t toy numbers - they reflect real autonomous browsing and tool orchestration capability.
You can toggle thinking on or off. With thinking enabled, M3 is suited for complex reasoning, agentic tasks, and long-horizon collaboration. With thinking disabled, it responds faster, making it better for conversation and code completion scenarios where latency matters. Both modes share the same pricing.
4. Coding and Agentic Performance
Coding is where M3 puts up its strongest numbers:
| Benchmark | MiniMax M3 | Claude Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro | 59.0% | 64.3% | 58.6% | 54.2% |
| Terminal-Bench 2.1 | 66.0% | - | - | - |
| SWE-fficiency | 34.8% | - | - | - |
| KernelBench Hard | 28.8% | - | - | - |
| MCP Atlas | 74.2% | - | - | - |
On SWE-Bench Pro - the extended, harder version of the software engineering benchmark - M3 edges past GPT-5.5 and Gemini 3.1 Pro, sitting just behind Opus 4.7.
The real test came from MiniMax’s internal evaluations. They tasked M3 with autonomously optimizing an FP8 GEMM CUDA kernel on NVIDIA Hopper GPUs - one of the hardest optimization problems in LLM inference. Starting from a non-runnable Triton skeleton with no reference implementation, M3 ran for roughly 24 hours, completed 147 benchmark submissions and 1,959 tool calls, and pushed hardware peak utilization from 7.6% to 71.3% - a 9.4x speedup with zero human intervention.
They also gave M3 an ICLR 2025 Outstanding Paper and asked it to reproduce the results independently. It ran for nearly 12 hours, produced 18 commits and 23 experimental figures, and successfully replicated the core experiments. That’s not just code generation - it’s research-grade experimental design and execution.
MiniMax M3 API Integration: Getting Started
M3’s API is designed for zero-friction integration. You can use either the Anthropic SDK (recommended) or the OpenAI SDK - both work with minimal configuration changes.
Anthropic SDK (Recommended)
pip install anthropic
export ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
export ANTHROPIC_API_KEY=<your-api-key>
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="MiniMax-M3",
max_tokens=1000,
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": [{"type": "text", "text": "Hi, how are you?"}]}
],
)
The Anthropic-compatible endpoint supports text, image (type="image"), video (type="video"), tool use (type="tool_use"), tool results (type="tool_result"), and thinking blocks. This is the recommended path because it gives you full access to interleaved thinking and streaming responses.
OpenAI SDK
pip install openai
export OPENAI_BASE_URL=https://api.minimax.io/v1
export OPENAI_API_KEY=<your-api-key>
from openai import OpenAI
client = OpenAI(
base_url="https://api.minimax.io/v1",
api_key="<your-api-key>",
)
response = client.chat.completions.create(
model="MiniMax-M3",
messages=[{"role": "user", "content": "Hi, how are you?"}],
)
With the OpenAI-compatible endpoint, you can pass extra_body={"reasoning_split": True} to separate thinking content into a dedicated reasoning_details field - cleaner than parsing <think> tags from the content string. Image input uses image_url content parts, and video input uses video_url content parts.
Video Input via API
Sending a video for analysis works like this in the Anthropic-compatible format:
message = client.messages.create(
model="MiniMax-M3",
max_tokens=4000,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Summarize what happens in this video."},
{"type": "video", "source": {
"type": "base64",
"media_type": "video/mp4",
"data": base64_encoded_video
}}
]
}],
)
URL-based and Files API (mm_file://{file_id}) approaches are also supported. For videos over 50 MB, upload through the Files API first and reference by file ID.
Supported Coding Tools
M3 works with Claude Code (native support), Cursor (via custom OpenAI endpoint), Kilo Code, OpenCode, OpenClaw, TRAE, Droid, Codex CLI, and Hermes Agent. Full configuration guides are available in the MiniMax platform docs for each tool. Setting up Claude Code with M3 takes roughly two minutes - just point ANTHROPIC_BASE_URL at the MiniMax endpoint and set the model to MiniMax-M3.
Best Use Cases for MiniMax M3
Based on the benchmarks, real-world demos, and architecture, here’s where M3 actually shines in practice.
1. Long-Form Video Analysis
If you work with surveillance footage, lecture recordings, product demos, or user research sessions, M3’s video input changes the workflow. Instead of watching hours of footage yourself, you can ask the model: “At what timestamps does the presenter switch slides?” or “Find every moment the user hesitates or shows confusion.” Because M3 was trained on multimodal data from the start, its video understanding isn’t brittle - it handles real-world video content with decent accuracy at up to 1,024 frames.
The 1M context window means you can process a roughly 17-minute video at 1 FPS without hitting context limits. For longer videos, the Files API supports uploads up to 512 MB.
2. Long-Document Research and Legal Work
Dropping entire legal contracts, regulatory filings, or academic paper collections into context and asking targeted questions is a genuine superpower. M3’s MSA architecture means you can query across a million tokens of source material without the model losing track of details buried in the middle - a problem that still afflicts some models at extreme context lengths, even if they technically support big windows.
The prompt caching system also means repeated queries against the same document set get progressively cheaper and faster. Upload a 400K-token corpus once, ask 50 questions, and the cache-hit rate on that corpus drives your effective cost way down.
3. Autonomous Coding Agents
M3’s sweet spot is long-running, autonomous coding sessions. Claude Code with M3 as the backend can handle multi-hour refactoring sessions, test suite generation, or dependency upgrades across an entire monorepo without the model forgetting what it changed in file #47 by the time it reaches file #312.
The CUDA kernel optimization demo (9.4x speedup over 24 hours with no human input) and the ICLR paper reproduction (12 hours, 18 commits, 23 figures) aren’t cherry-picked demos - they demonstrate a real capability for multi-step, self-correcting code generation that goes well beyond autocomplete.
4. Enterprise RAG and Knowledge Base Applications
For enterprise teams building on top of internal documentation, M3’s combination of long context, prompt caching, and competitive pricing makes it a strong candidate for retrieval-augmented generation pipelines. You can stuff an entire product knowledge base into context rather than relying solely on chunked retrieval, which reduces retrieval failures for questions that span multiple documents.
The Priority service tier also gives enterprise deployments more predictable latency, which is critical for customer-facing applications.
5. Computer Use and GUI Automation
M3 scored 70.06% on OSWorld-Verified - a benchmark that tests a model’s ability to navigate desktop interfaces, click buttons, fill forms, and complete multi-step tasks using only visual input. Combined with MiniMax Code’s computer-use mode, this means you can ask M3 to “open the ERP client and batch-enter these invoice numbers from the spreadsheet” and it will actually navigate the UI across applications.
Where M3 Isn’t the Right Fit
Be realistic about the trade-offs. M3’s rate limits (200 RPM, 10M TPM) are lower than what you’d get with larger cloud providers. If you need to serve thousands of concurrent users, you might hit those caps. The >512K context pricing ($1.20/M input, $4.80/M output) is also meaningfully more expensive - only use it when you genuinely need the extra context, not as a default.
And while M3 scores well on coding benchmarks, Claude Opus 4.7 still leads on pure SWE-Bench Pro (64.3% vs. 59.0%). If you’re doing nothing but software engineering and budget isn’t a concern, Opus 4.7 is still the stronger option. M3’s advantage is the multimodal + long context + lower price combination.
Comparison: MiniMax M3 vs. Major Alternatives
| Feature | MiniMax M3 | Claude Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Context Window | 1M tokens | 1M tokens | 200K tokens (est.) | 1M tokens |
| Multimodal Input | Text, image, video | Text, image | Text, image, video | Text, image, video |
| Open Weights | Yes (releasing soon) | No | No | No |
| API Input Price | $0.30–$1.20/M | $5/M | Pricing varies | Pricing varies |
| API Output Price | $1.20–$4.80/M | $25/M | Pricing varies | Pricing varies |
| SWE-Bench Pro | 59.0% | 64.3% | 58.6% | 54.2% |
| BrowseComp | 83.5 | 79.3 | - | - |
| Prompt Caching | Yes (automatic) | Yes (explicit) | Yes | Yes |
| Thinking Control | On/off toggle | Adaptive thinking | - | - |
MiniMax M3 API Access Guide (2026)
Here’s the quick path to start using M3:
- Create an account at platform.minimax.io
- Choose your billing model: Token Plan (subscription) or Pay-As-You-Go (API key)
- For Token Plan: Subscribe at platform.minimax.io/subscribe/token-plan, grab your Subscription Key from Account → Token Plan
- For Pay-As-You-Go: Get your API Key from Account → API Keys, top up your balance
- Set up your SDK: Use the Anthropic SDK with
base_url=https://api.minimax.io/anthropicor the OpenAI SDK withbase_url=https://api.minimax.io/v1 - Start calling: Your first request takes about two minutes from signup to response
For users in China, use api.minimaxi.com endpoints instead of api.minimax.io.
The Bottom Line
MiniMax M3 pricing is competitive - aggressively so during the launch discount period. At $0.30/M input, it’s roughly 16x cheaper on input than Claude Opus 4.7. The Token Plan at $20–120/month delivers usable monthly token quotas for serious development work. Video input works, the 1M context window is backed by real architectural innovation (not just marketing), and the coding benchmarks place it solidly in frontier territory.
The model’s real differentiator is doing all three things - coding, long context, and multimodal - in one open-weight release. Until M3, you had to choose: pick an open model with good coding but mediocre context, or a multimodal model that couldn’t code, or a long-context model that only handled text. M3 is the first open-weight model that doesn’t force that trade-off.
The open weights release is expected within days of the June 1 launch, which will make M3 the strongest locally-deployable option for teams that need multimodal + coding in a single model.
Sources
- MiniMax M3 Official Blog Post: minimax.io/blog/minimax-m3
- MiniMax Platform API Docs - Pay as You Go Pricing: platform.minimax.io/docs/guides/pricing-paygo
- MiniMax Platform - Token Plan Pricing: platform.minimax.io/docs/guides/pricing-token-plan
- MiniMax M3 Anthropic SDK Documentation: platform.minimax.io/docs/api-reference/text-anthropic-api
- MiniMax M3 OpenAI-Compatible API Documentation: platform.minimax.io/docs/api-reference/text-chat-openai
- MiniMax Models Release Notes: platform.minimax.io/docs/release-notes/models
- MiniMax M3 Product Page: minimax.io/models/text/m3
- MiniMax M3 Prompt Caching Documentation: platform.minimax.io/docs/api-reference/text-prompt-caching
- MiniMax M3 Rate Limits: platform.minimax.io/docs/guides/rate-limits
- Anthropic Claude Models Overview: docs.anthropic.com/en/docs/about-claude/models/overview