MiniMax M3 Pricing & Features Guide 2026: Video Input AI

AIUnpacker Editorial

AIUnpacker

Jun 5, 2026Updated Jun 5, 202614m read

Jun 5, 2026Updated Jun 5, 2026

14 min3,126 words

Key Takeaways

Everything about MiniMax M3: how much it costs, what its video input can do, how the 1M context window works, and where it actually performs best.

Summarize with AI

14 min → 30 sec

ChatGPT

OpenAI

Gemini

Google

Perplexity

AI Search

Editorial Disclosure & Affiliate Notice

This content is published for informational and educational purposes only. It is not intended as a substitute for professional, legal, financial, or medical advice. AIUnpacker is funded by sponsorships, affiliate commissions, and display advertising — nothing here is free to produce. When you buy through our links, we may earn a commission at no extra cost to you. Our editorial picks are never influenced by compensation.

For educational purposes only. Nothing here should be taken as a guarantee, recommendation, or professional recommendation.
AI-assisted editing. Drafts are produced with AI assistance and reviewed by our human editorial team.
Opinions are our own. Also, we are not affiliated with most tools we cover unless explicitly stated.
Information may be outdated. Verify pricing, features, and policies directly with the vendor.
Last reviewed: June 5, 2026. Published June 5, 2026.

Read more on our About page, Terms and Editorial Policy.

MiniMax dropped M3 on June 1, 2026, and if you’re trying to figure out MiniMax M3 pricing, what its video input actually does, or whether that 1,000,000-token context window is worth paying for - you’re in the right place. I’ve read the entire technical report, combed through the API docs, and cross-checked the pricing tiers so you don’t have to.

Here’s the short version: M3 is the first open-weight model that combines frontier coding, native multimodal input (text + image + video), and a million-token context window in one package. It uses a new sparse attention architecture called MSA that makes long-context inference actually feasible. And its pricing - especially with the launch-week 50% discount - seriously undercuts comparable closed-source models.

Let’s get into the details.

MiniMax M3 Pricing: How Much Does It Actually Cost?

MiniMax M3 pricing splits into two tracks: pay-as-you-go API pricing for enterprises and Token Plan subscriptions for individuals and small teams. Here’s exactly what you’ll pay.

Pay-As-You-Go API Pricing

M3’s API pricing uses a two-tier structure based on context length. For calls with 512K or fewer input tokens - which covers the vast majority of conversation and coding use cases - you get the standard rate. Push past 512K (think full-repo code understanding or all-day video analysis sessions), and you hit the long-context rate.

Tier	Input (per 1M tokens)	Output (per 1M tokens)	Cache Read (per 1M tokens)
Standard ≤512K (7-day 50% off)	$0.30 (normally $0.60)	$1.20 (normally $2.40)	$0.06 (normally $0.12)
Standard >512K	$1.20	$4.80	$0.24
Priority ≤512K (7-day 50% off)	$0.45 (normally $0.90)	$1.80 (normally $3.60)	$0.09 (normally $0.18)
Priority >512K	$1.80	$7.20	$0.36

Priority tier gets you scheduling priority and more stable latency under high concurrency - useful if you’re building a production service with SLA requirements. For most developers, Standard tier is the right starting point.

At the launch-week discounted rate of $0.30/M input and $1.20/M output, M3 is cheaper than Claude Opus 4.7 ($5/$25), GPT-5.x, and Gemini 3.1 Pro. Even at normal pricing of $0.60/$2.40, it’s still comfortably below the big closed-source models for input cost, though output pricing is closer to parity.

Prompt caching brings costs down further. M3 supports automatic prompt caching for requests with 512+ input tokens. Cache-hit tokens bill at just $0.06/M (discounted) - that’s an 80% discount versus the standard input rate. If you’re building a chatbot or an agent that reuses system prompts, tool definitions, or conversation prefixes, the savings stack up fast. MiniMax’s docs show a real example where caching reduced total cost by roughly 67% on a request with 45,000 cached tokens out of 50,000 total.

Token Plan Subscriptions

If you’d rather pay a flat monthly fee than meter every token, the Token Plan covers all MiniMax models - text, image, speech, video, and music - under one quota.

Plan	Price	Estimated M3 Token Capacity	Best For
Plus	$20/month	~1.7B tokens/month	Personal projects, prototyping
Max	$50/month	~5.1B tokens/month	Daily coding with agents, multimodal work
Ultra	$120/month	~9.8B tokens/month	Heavy agent workflows, extended sessions

All three tiers share the same rate limits: 200 requests per minute and 10 million tokens per minute. Usage draws from a shared credit pool with 5-hour rolling and weekly quota windows. If you blow through your subscription quota, purchased Credits ($1 = 1,000 credits) cover the overflow automatically.

For context: $50/month for roughly 5 billion tokens of frontier-model access is aggressive pricing. Comparable coding-focused subscriptions from other providers typically deliver fewer tokens at a higher price point.

Token Plan vs. Pay-As-You-Go: Which Should You Choose?

Go Token Plan if you’re a solo developer or small team using M3 daily across multiple tools (Claude Code, Cursor, OpenClaw). The flat fee caps your cost and the credit pool covers speech, image, and video generation too.
Go Pay-As-You-Go if you need programmatic API access with no usage windows, want the Priority service tier, or have unpredictable burst workloads that don’t fit neatly into subscription quotas.

Key Features of MiniMax M3

M3 isn’t a single-feature model. It’s designed to do three hard things simultaneously - and that’s what makes it different from most open-weight releases.

1. Native Multimodal: Text, Image, and Video Input

MiniMax rebuilt its entire data pipeline to train M3 on text, images, and video from step zero - not as a post-hoc fine-tuning bolt-on. The training corpus exceeds 100 trillion tokens with a heavy emphasis on interleaved multimodal data (documents where text and images naturally mix within sequences).

Image input supports JPEG, PNG, GIF, and WEBP formats. You can pass images via URL or base64 encoding, with files up to 10 MB. At the “high” detail setting, a single image can consume up to roughly 15K tokens. At “low” detail, it’s usually a few hundred tokens. The model handles charts, diagrams, photographs, screenshots, and document scans.

Video input is where M3 genuinely stands out. Supported formats include MP4, AVI, MOV, and MKV. You can send videos via URL, base64, or through the Files API (which handles files up to 512 MB - URL/base64 caps at 50 MB). M3 processes video at 1 frame per second, with support for up to 1,024 frames at resolutions between 336–1,008 pixels on the long edge.

On Video-MMMU, a challenging multimodal video understanding benchmark, M3 scores competitively against closed-source models. On the more widely-used Video-MME benchmark, it hits 84.6 at 512 frames.

The practical implication: you can upload a product demo, a security camera clip, a lecture recording, or a gameplay video and ask M3 to describe what’s happening, answer questions about specific moments, extract timestamps, or summarize the content. No separate vision pipeline needed.

2. 1M Token Context Window via MSA Architecture

A million-token context window isn’t just a spec-sheet flex - it changes what you can build. M3 achieves this with MiniMax Sparse Attention (MSA), a sparse attention mechanism designed from scratch to avoid the quadratic compute scaling of full attention.

Here’s what MSA actually does:

KV-block-based sparse routing. The key-value cache gets partitioned into blocks, and queries only route to the most relevant blocks. MSA’s partitioning is finer-grained than earlier approaches like DSA or MoBA, giving it better effective context coverage.
Operator-level optimization. They use a “KV outer gather Q” approach where KV blocks act as the outer loop, aggregating queries that hit each block. Each block is read exactly once with contiguous memory access. Under M3’s head configuration, this runs over 4x faster than open-source Flash-Sparse-Attention and flash-moba.
Real throughput. At 1M context length, per-token compute drops to roughly 1/20th of the previous generation. Prefilling speeds up by over 9x, decoding by over 15x.

The team’s internal testing showed MSA matched full attention on the vast majority of capability dimensions - reasoning, retrieval, multi-hop QA - without the performance degradation that historically plagued sparse attention methods.

What the 1M window enables in practice:

Dropping an entire 500-page technical specification, its test suite, and the full codebase into a single prompt
Processing 12+ hours of agent conversation history with full recall of every decision, error, and tool call
Analyzing hour-long video recordings without splitting them into chunks
Running multi-day autonomous coding sessions where the agent never forgets what it did three hours ago

The guaranteed minimum is 512K tokens, with the full 1M available through API configuration. Input beyond 512K currently requires contacting sales, though public availability is expected shortly after launch.

3. Interleaved Thinking and Tool Use

M3 supports interleaved thinking, a reasoning pattern where the model reflects between each round of tool interactions. Before every tool call, it analyzes the current environment and tool outputs to decide its next action. This matters for long-horizon agent tasks because the model builds a running mental model rather than calling tools blindly.

The model achieved state-of-the-art results on SWE-Bench, BrowseCamp, and xBench - all benchmarks that test both coding and agentic reasoning under multi-step, tool-heavy conditions.

On BrowseComp, M3 scored 83.5, surpassing Claude Opus 4.7 (79.3). On MCP Atlas, it hit 74.2%. These aren’t toy numbers - they reflect real autonomous browsing and tool orchestration capability.

You can toggle thinking on or off. With thinking enabled, M3 is suited for complex reasoning, agentic tasks, and long-horizon collaboration. With thinking disabled, it responds faster, making it better for conversation and code completion scenarios where latency matters. Both modes share the same pricing.

4. Coding and Agentic Performance

Coding is where M3 puts up its strongest numbers:

Benchmark	MiniMax M3	Claude Opus 4.7	GPT-5.5	Gemini 3.1 Pro
SWE-Bench Pro	59.0%	64.3%	58.6%	54.2%
Terminal-Bench 2.1	66.0%	-	-	-
SWE-fficiency	34.8%	-	-	-
KernelBench Hard	28.8%	-	-	-
MCP Atlas	74.2%	-	-	-

On SWE-Bench Pro - the extended, harder version of the software engineering benchmark - M3 edges past GPT-5.5 and Gemini 3.1 Pro, sitting just behind Opus 4.7.

The real test came from MiniMax’s internal evaluations. They tasked M3 with autonomously optimizing an FP8 GEMM CUDA kernel on NVIDIA Hopper GPUs - one of the hardest optimization problems in LLM inference. Starting from a non-runnable Triton skeleton with no reference implementation, M3 ran for roughly 24 hours, completed 147 benchmark submissions and 1,959 tool calls, and pushed hardware peak utilization from 7.6% to 71.3% - a 9.4x speedup with zero human intervention.

They also gave M3 an ICLR 2025 Outstanding Paper and asked it to reproduce the results independently. It ran for nearly 12 hours, produced 18 commits and 23 experimental figures, and successfully replicated the core experiments. That’s not just code generation - it’s research-grade experimental design and execution.

MiniMax M3 API Integration: Getting Started

M3’s API is designed for zero-friction integration. You can use either the Anthropic SDK (recommended) or the OpenAI SDK - both work with minimal configuration changes.

Anthropic SDK (Recommended)

pip install anthropic
export ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
export ANTHROPIC_API_KEY=<your-api-key>

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
 model="MiniMax-M3",
 max_tokens=1000,
 system="You are a helpful assistant.",
 messages=[
 {"role": "user", "content": [{"type": "text", "text": "Hi, how are you?"}]}
 ],
)

The Anthropic-compatible endpoint supports text, image (type="image"), video (type="video"), tool use (type="tool_use"), tool results (type="tool_result"), and thinking blocks. This is the recommended path because it gives you full access to interleaved thinking and streaming responses.

OpenAI SDK

pip install openai
export OPENAI_BASE_URL=https://api.minimax.io/v1
export OPENAI_API_KEY=<your-api-key>

from openai import OpenAI

client = OpenAI(
 base_url="https://api.minimax.io/v1",
 api_key="<your-api-key>",
)

response = client.chat.completions.create(
 model="MiniMax-M3",
 messages=[{"role": "user", "content": "Hi, how are you?"}],
)

With the OpenAI-compatible endpoint, you can pass extra_body={"reasoning_split": True} to separate thinking content into a dedicated reasoning_details field - cleaner than parsing <think> tags from the content string. Image input uses image_url content parts, and video input uses video_url content parts.

Video Input via API

Sending a video for analysis works like this in the Anthropic-compatible format:

message = client.messages.create(
 model="MiniMax-M3",
 max_tokens=4000,
 messages=[{
 "role": "user",
 "content": [
 {"type": "text", "text": "Summarize what happens in this video."},
 {"type": "video", "source": {
 "type": "base64",
 "media_type": "video/mp4",
 "data": base64_encoded_video
 }}
 ]
 }],
)

URL-based and Files API (mm_file://{file_id}) approaches are also supported. For videos over 50 MB, upload through the Files API first and reference by file ID.

Supported Coding Tools

M3 works with Claude Code (native support), Cursor (via custom OpenAI endpoint), Kilo Code, OpenCode, OpenClaw, TRAE, Droid, Codex CLI, and Hermes Agent. Full configuration guides are available in the MiniMax platform docs for each tool. Setting up Claude Code with M3 takes roughly two minutes - just point ANTHROPIC_BASE_URL at the MiniMax endpoint and set the model to MiniMax-M3.

Best Use Cases for MiniMax M3

Based on the benchmarks, real-world demos, and architecture, here’s where M3 actually shines in practice.

1. Long-Form Video Analysis

If you work with surveillance footage, lecture recordings, product demos, or user research sessions, M3’s video input changes the workflow. Instead of watching hours of footage yourself, you can ask the model: “At what timestamps does the presenter switch slides?” or “Find every moment the user hesitates or shows confusion.” Because M3 was trained on multimodal data from the start, its video understanding isn’t brittle - it handles real-world video content with decent accuracy at up to 1,024 frames.

The 1M context window means you can process a roughly 17-minute video at 1 FPS without hitting context limits. For longer videos, the Files API supports uploads up to 512 MB.

2. Long-Document Research and Legal Work

Dropping entire legal contracts, regulatory filings, or academic paper collections into context and asking targeted questions is a genuine superpower. M3’s MSA architecture means you can query across a million tokens of source material without the model losing track of details buried in the middle - a problem that still afflicts some models at extreme context lengths, even if they technically support big windows.

The prompt caching system also means repeated queries against the same document set get progressively cheaper and faster. Upload a 400K-token corpus once, ask 50 questions, and the cache-hit rate on that corpus drives your effective cost way down.

3. Autonomous Coding Agents

M3’s sweet spot is long-running, autonomous coding sessions. Claude Code with M3 as the backend can handle multi-hour refactoring sessions, test suite generation, or dependency upgrades across an entire monorepo without the model forgetting what it changed in file #47 by the time it reaches file #312.

The CUDA kernel optimization demo (9.4x speedup over 24 hours with no human input) and the ICLR paper reproduction (12 hours, 18 commits, 23 figures) aren’t cherry-picked demos - they demonstrate a real capability for multi-step, self-correcting code generation that goes well beyond autocomplete.

4. Enterprise RAG and Knowledge Base Applications

For enterprise teams building on top of internal documentation, M3’s combination of long context, prompt caching, and competitive pricing makes it a strong candidate for retrieval-augmented generation pipelines. You can stuff an entire product knowledge base into context rather than relying solely on chunked retrieval, which reduces retrieval failures for questions that span multiple documents.

The Priority service tier also gives enterprise deployments more predictable latency, which is critical for customer-facing applications.

5. Computer Use and GUI Automation

M3 scored 70.06% on OSWorld-Verified - a benchmark that tests a model’s ability to navigate desktop interfaces, click buttons, fill forms, and complete multi-step tasks using only visual input. Combined with MiniMax Code’s computer-use mode, this means you can ask M3 to “open the ERP client and batch-enter these invoice numbers from the spreadsheet” and it will actually navigate the UI across applications.

Where M3 Isn’t the Right Fit

Be realistic about the trade-offs. M3’s rate limits (200 RPM, 10M TPM) are lower than what you’d get with larger cloud providers. If you need to serve thousands of concurrent users, you might hit those caps. The >512K context pricing ($1.20/M input, $4.80/M output) is also meaningfully more expensive - only use it when you genuinely need the extra context, not as a default.

And while M3 scores well on coding benchmarks, Claude Opus 4.7 still leads on pure SWE-Bench Pro (64.3% vs. 59.0%). If you’re doing nothing but software engineering and budget isn’t a concern, Opus 4.7 is still the stronger option. M3’s advantage is the multimodal + long context + lower price combination.

Comparison: MiniMax M3 vs. Major Alternatives

Feature	MiniMax M3	Claude Opus 4.7	GPT-5.5	Gemini 3.1 Pro
Context Window	1M tokens	1M tokens	200K tokens (est.)	1M tokens
Multimodal Input	Text, image, video	Text, image	Text, image, video	Text, image, video
Open Weights	Yes (releasing soon)	No	No	No
API Input Price	$0.30–$1.20/M	$5/M	Pricing varies	Pricing varies
API Output Price	$1.20–$4.80/M	$25/M	Pricing varies	Pricing varies
SWE-Bench Pro	59.0%	64.3%	58.6%	54.2%
BrowseComp	83.5	79.3	-	-
Prompt Caching	Yes (automatic)	Yes (explicit)	Yes	Yes
Thinking Control	On/off toggle	Adaptive thinking	-	-

MiniMax M3 API Access Guide (2026)

Here’s the quick path to start using M3:

Create an account at platform.minimax.io
Choose your billing model: Token Plan (subscription) or Pay-As-You-Go (API key)
For Token Plan: Subscribe at platform.minimax.io/subscribe/token-plan, grab your Subscription Key from Account → Token Plan
For Pay-As-You-Go: Get your API Key from Account → API Keys, top up your balance
Set up your SDK: Use the Anthropic SDK with base_url=https://api.minimax.io/anthropic or the OpenAI SDK with base_url=https://api.minimax.io/v1
Start calling: Your first request takes about two minutes from signup to response

For users in China, use api.minimaxi.com endpoints instead of api.minimax.io.

The Bottom Line

MiniMax M3 pricing is competitive - aggressively so during the launch discount period. At $0.30/M input, it’s roughly 16x cheaper on input than Claude Opus 4.7. The Token Plan at $20–120/month delivers usable monthly token quotas for serious development work. Video input works, the 1M context window is backed by real architectural innovation (not just marketing), and the coding benchmarks place it solidly in frontier territory.

The model’s real differentiator is doing all three things - coding, long context, and multimodal - in one open-weight release. Until M3, you had to choose: pick an open model with good coding but mediocre context, or a multimodal model that couldn’t code, or a long-context model that only handled text. M3 is the first open-weight model that doesn’t force that trade-off.

The open weights release is expected within days of the June 1 launch, which will make M3 the strongest locally-deployable option for teams that need multimodal + coding in a single model.

Sources

MiniMax M3 Official Blog Post: minimax.io/blog/minimax-m3
MiniMax Platform API Docs - Pay as You Go Pricing: platform.minimax.io/docs/guides/pricing-paygo
MiniMax Platform - Token Plan Pricing: platform.minimax.io/docs/guides/pricing-token-plan
MiniMax M3 Anthropic SDK Documentation: platform.minimax.io/docs/api-reference/text-anthropic-api
MiniMax M3 OpenAI-Compatible API Documentation: platform.minimax.io/docs/api-reference/text-chat-openai
MiniMax Models Release Notes: platform.minimax.io/docs/release-notes/models
MiniMax M3 Product Page: minimax.io/models/text/m3
MiniMax M3 Prompt Caching Documentation: platform.minimax.io/docs/api-reference/text-prompt-caching
MiniMax M3 Rate Limits: platform.minimax.io/docs/guides/rate-limits
Anthropic Claude Models Overview: docs.anthropic.com/en/docs/about-claude/models/overview

Get our weekly AI digest

The latest AI tools, prompts, and insights — delivered every Tuesday.

No spam. Unsubscribe anytime.

AIUnpacker Editorial Team

Verified

A collective of engineers, journalists, and AI practitioners dedicated to providing hands-on, transparently disclosed analysis of the AI tools shaping tomorrow.

About us ·More articles