MiniMax M3 Review: 1M Context Multimodal AI Model for Agents, Coding, and Research
MiniMax M3 launched on June 1, 2026, and it’s not just another checkpoint drop. It’s a model that genuinely tries to do three hard things at once: frontier coding, million-token context, and native multimodality - all in an open-weight package. I’ve spent the last few days digging through the technical report, benchmarks, and API docs. Here’s my take.
What Is MiniMax M3?
M3 is the latest large language model from MiniMax, the Shanghai-based AI company founded in early 2022. It’s the successor to the M2, M2.1, M2.5, and M2.7 series. The headline numbers:
- 1 million token context window (512K guaranteed minimum)
- Native multimodal - text, image, and video input from step zero of training
- Open-weight - the first model to combine all three “frontier essentials” in one open release
- 100T+ training tokens with interleaved multimodal data
- New MSA architecture - MiniMax Sparse Attention replaces full attention
MiniMax has a real track record here. Their M2.7 open model already hit 2.4 million downloads on Hugging Face, and the company serves 236 million individual users across 200+ countries. They’re not a fly-by-night lab.
MSA Architecture: How the 1M Context Actually Works
Context scaling isn’t just about cranking up a parameter. Full attention has quadratic computational complexity - double the context, quadruple the compute. M3 sidesteps this with MSA (MiniMax Sparse Attention), a new sparse attention mechanism designed from scratch.
Here’s what makes it interesting:
- KV-block-based sparse routing. MSA partitions the key-value cache into blocks and routes queries only to the most relevant ones. Unlike earlier sparse approaches like DSA or MoBA, MSA achieves finer-grained partitioning for better effective context coverage.
- Operator-level optimization. They use a “KV outer gather Q” approach - KV blocks serve as the outer loop, aggregating queries that hit each block. Each block is read exactly once with contiguous memory access. Under M3’s head configuration, this runs over 4x faster than open-source Flash-Sparse-Attention and flash-moba.
- Real throughput gains. At 1M context length, M3’s per-token compute is roughly 1/20th of the previous generation. Prefilling is 9x faster, decoding is 15x faster.
The team tested MSA extensively and confirmed it matches full attention on the vast majority of capability dimensions. That’s a big deal - many sparse attention methods degrade on reasoning or retrieval. M3 doesn’t seem to have that trade-off.
Coding Benchmarks: Where M3 Actually Lands
Let’s cut to what developers care about. Here are M3’s numbers on the major coding benchmarks, based on MiniMax’s published evaluation methodology (tested on internal infrastructure with Claude Code as scaffolding, 4-run averages for key tests):
| Benchmark | MiniMax M3 | GPT-5.5 | Claude Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro | 59.0% | 58.6% | 64.3% | 54.2% |
| Terminal-Bench 2.1 | 66.0% | 78.2% | 66.1% | 70.3% |
| SWE-fficiency | 34.8% | - | - | - |
| KernelBench Hard | 28.8% | - | - | - |
| MCP Atlas | 74.2% | 75.3% | 79.1% | 78.2% |
| BrowseComp | 83.5 | - | 79.3 (Opus 4.7) | - |
Sources: MiniMax M3 technical report (minimax.io/blog/minimax-m3), Gemini 3.5 Flash model page (deepmind.google/models/gemini/flash). Terminal-Bench 2.1 scores for GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7 from official leaderboard; M3 tested on same infrastructure.
On SWE-Bench Pro, M3 edges past GPT-5.5 and Gemini 3.1 Pro. It’s behind Claude Opus 4.7 by about 5 percentage points - but Opus 4.7 costs $5/$25 per million input/output tokens through Anthropic’s API. M3’s token plan pricing starts at $20/month for ~1.7 billion tokens across all modalities. That’s a fundamentally different value proposition.
On Claw-Eval (a 161-task agent evaluation), M3 claims the highest score among all tested models. On SVG-Bench (SVG generation quality), M3 surpasses Claude Opus 4.7.
The pattern is clear: M3 doesn’t lead every benchmark, but it’s competitive with the frontier across the board - and frequently beats models that cost 10-50x more per token.
Agent Performance: Where M3 Shines
MiniMax built M3 specifically for agentic workloads, and it shows. The PostTrainBench result is the clearest signal:
PostTrainBench: M3 Trains Models Autonomously
MiniMax gave M3 four pretrained base models (no downstream capabilities) and 12 hours to autonomously complete data synthesis, training, evaluation, and iteration across five benchmarks (AIME2025, BFCL, GPQA Main, GSM8K, HumanEval). No human intervention.
The scores:
- Claude Opus 4.7: 42.4
- GPT-5.5: 39.3
- MiniMax M3: 37.1
- All other models: significantly lower
M3 landed at #3 overall - behind only Opus 4.7 and GPT-5.5. That puts it ahead of Gemini 3.1 Pro and every other model on an open-ended agentic research task. For an open-weight model, this is unprecedented.
12-Hour Paper Reproduction
The team asked M3 to independently reproduce an ICLR 2025 Outstanding Paper - “Learning Dynamics of LLM Finetuning.” Over 12 hours, M3:
- Produced 18 commits and 23 experimental figures
- Successfully replicated core experiments
- Used multimodal capabilities to parse charts, data, and formulas from the paper
- Fit paper + code + experiment logs into a single context window
- Ran concurrent execution threads
This isn’t cherry-picking. It’s a real reproduction of published research, and it worked.
CUDA Kernel Optimization: 9.4x Speedup
This is the most impressive demo. M3 was asked to optimize an FP8 GEMM kernel on NVIDIA Hopper GPUs, starting with only a task description and a non-runnable Triton skeleton. Over ~24 hours:
- 147 benchmark submissions
- 1,959 tool calls
- Hardware peak utilization: 7.6% → 71.3%
- 9.4x speedup with zero human intervention
What’s striking is persistence. Most models (except Opus 4.7 and M3) stopped making progress within 30 submissions and exited. M3’s best solution came on submission 145 - it pushed through multiple performance plateaus.
Video Input: Native Multimodal Done Right
M3 is multimodal from step zero of training, not a vision encoder bolted onto a text model. MiniMax rebuilt their entire data pipeline to scale to 100T+ tokens with interleaved text-image-video data.
Video input specs:
- Up to 1,024 frames (1 FPS frame rate)
- Single-frame resolution: 336-672 pixels (long edge), configurable to 1,008
- Subtitles: Automatically interleaved into frames every 30 seconds
- Output tokens: Up to 16K-32K depending on benchmark config
On Video-MME (a standard video understanding benchmark), M3 scored 84.6 at 512 frames. That’s in the same tier as native multimodal models like Gemini. On OmniDocBench (document understanding), M3 scored above Gemini 3.1 Pro.
For developers building agents that need to watch screen recordings, parse video tutorials, or analyze dashcam footage, this is genuinely useful - and it’s available through a standard OpenAI-compatible API.
Pricing: The Token Plan Model
M3’s pricing is unusual. Instead of per-token API billing as the primary model, MiniMax pushes their Token Plan subscription:
| Tier | Price | Tokens/Month (M3) | Equivalent Documents |
|---|---|---|---|
| Plus | $20/mo | ~1.7B | ~110K long documents |
| Max | $50/mo | ~5.1B | ~330K long documents |
| Ultra | $120/mo | ~9.8B | ~640K long documents |
All tiers include text, image, speech, and music models in one shared token pool. M3 API access is also available with pay-per-token pricing:
- ≤512K input tokens: standard rate
- >512K input tokens: higher long-context rate (for full-repo code understanding, ultra-long document parsing)
Thinking mode (extended reasoning) and non-thinking mode share the same pricing. You toggle it at request time.
For comparison, Claude Opus 4.8 is $5/M input tokens + $25/M output tokens. GPT-5.5 is $5/M input + $30/M output. If you’re doing heavy agentic coding with thousands of tool calls per session, M3’s subscription model could save you hundreds of dollars a month.
API Access & Tool Compatibility
M3 supports both Anthropic-compatible and OpenAI-compatible APIs. The recommended path is Anthropic-style (supports thinking blocks and interleaved thinking):
import anthropic
client = anthropic.Anthropic(
base_url="https://api.minimax.io/anthropic",
api_key="<MINIMAX_API_KEY>",
)
message = client.messages.create(
model="MiniMax-M3",
max_tokens=1000,
messages=[{"role": "user", "content": "Write a Python quicksort"}],
)
M3 is officially supported in: Claude Code, Cursor, Cline, Kilo Code, Roo Code, Codex CLI, OpenCode, Droid, TRAE, and Grok CLI. MiniMax provides setup guides for each tool on their platform docs.
MiniMax Code
MiniMax also ships their own desktop agent app, MiniMax Code (codenamed “Mavis” - MiniMax as a Jarvis). Key features:
- Agent Teams: Leader-Worker-Verifier architecture for multi-agent collaboration. Tasks split into parallel sub-tasks with adversarial quality gates.
- Computer Use: M3’s multimodal capabilities power desktop automation. You can say “open my ERP client and batch-enter invoice information from this spreadsheet” from your phone, and MiniMax Code executes it on your computer.
- Long-running execution: Can run autonomously for days with the Producer + Verifier harness loop continuously producing, reflecting, and self-correcting.
MiniMax Code is built on OpenCode and Pi (both open-source projects) and will itself be open-sourced alongside M3’s model weights.
MiniMax M3 vs the Competition
vs Gemini
Gemini 3.5 Flash was announced around the same time (June 2026) and leads on several benchmarks - 76.2% on Terminal-Bench 2.1 vs M3’s 66.0%. But Gemini 3.5 Flash is available through Google’s API with per-token pricing. M3’s subscription model and open-weight status give it different strengths. For developers who need an open model they can fine-tune or deploy on private infrastructure, M3 is the clear choice. For those who just want the highest benchmark numbers and don’t mind API lock-in, Gemini 3.5 Flash remains formidable.
vs Claude
Claude Opus 4.7 / 4.8 remain the benchmark kings for agentic coding (64.3% on SWE-Bench Pro). But M3 is genuinely competitive - and on some agent-specific evaluations like Claw-Eval, M3 actually wins. The pricing gap is enormous: Opus 4.7 at $5/$25 per million tokens vs M3’s $20/month for ~1.7B tokens. For high-volume coding agents, M3 could be 10-50x cheaper.
vs GPT
GPT-5.5 ($5/$30 per million tokens) leads on several reasoning and coding benchmarks. But again, M3 matches or approaches GPT-5.5 on many agent-specific tasks (PostTrainBench 37.1 vs 39.3). The gap is small enough that cost and open-weight access become the deciding factors.
vs Kimi
Kimi K2.6 (Moonshot AI) supports a 256K context window with multimodal input (text, image, video). M3’s 1M context is 4x larger, and M3’s open-weight release plan gives developers more flexibility. Kimi K2.6’s API is China-focused with Chinese-language documentation, while M3 is positioned globally with English-first docs and tooling integrations.
Context Window Comparison
| Model | Context Window | Multimodal | Open Weights |
|---|---|---|---|
| MiniMax M3 | 1,000,000 tokens | ✅ Text, Image, Video | ✅ (announced) |
| Claude Opus 4.8 | 1,000,000 tokens | ✅ Text, Image | ❌ |
| Gemini 3.5 Flash | 1,000,000 tokens | ✅ Text, Image, Video, Audio | ❌ |
| GPT-5.5 | 1,000,000 tokens | ✅ Text, Image | ❌ |
| Kimi K2.6 | 256,000 tokens | ✅ Text, Image, Video | ❌ |
| DeepSeek-V4-pro | 128,000 tokens | ❌ (text only) | ✅ |
Long-Context Stability
M3’s 1M context isn’t just a spec sheet number. The MSA architecture was specifically designed to avoid the degradation that plagues many long-context models. MiniMax ran ablations showing MSA matches full attention on most capabilities, which means the context quality stays high even at extreme lengths.
On LOCA-Bench 256K (an environment description length benchmark), M3 was tested with the official react mode. The BrowseComp score of 83.5 (beating Claude Opus 4.7’s 79.3) also demonstrates strong autonomous browsing with effective long-context utilization - when token usage exceeds 64K, M3 discards history efficiently and continues.
For developers building agents that need to maintain context over thousands of tool calls, this matters. You can keep entire codebases, full documentation sets, and extensive conversation histories in context without the model losing the plot.
What’s Missing
No model is perfect. Here’s what I’d flag:
-
Third-party benchmarks are thin. Almost all the published numbers come from MiniMax’s own evaluation infrastructure. Independent verification on standard benchmarks (like the LMSys Chatbot Arena or Artificial Analysis) isn’t available yet. The model launched four days ago, so this will change.
-
Model weights aren’t public yet. MiniMax says M3 will be open-sourced on Hugging Face and GitHub “in the coming days.” Until the weights drop, “open-weight” is a promise, not a reality. The technical report is also pending.
-
Video input has limits. 1 FPS and 512-1,024 frame caps mean you can’t do real-time video analysis. For dashcam or surveillance applications that need higher frame rates, this is a constraint.
-
No audio input. Unlike Gemini, M3 doesn’t process audio natively. You’ll need a separate speech-to-text pipeline.
-
Thinking mode toggles at request time but lacks granularity. You can flip thinking on or off, but there’s no effort dial like Claude’s adaptive thinking or GPT’s reasoning effort levels.
Who Should Use MiniMax M3?
- AI coding agent developers who run thousands of tool calls per session and need the economics of a subscription model rather than per-token billing
- Researchers and labs that need an open-weight frontier model they can fine-tune or deploy on private infrastructure
- Video understanding applications - dashcam analysis, tutorial parsing, screen recording interpretation - where native multimodal input eliminates complex preprocessing pipelines
- Long-document analysis where the 1M context window means you can fit entire books, codebases, or regulatory filings without chunking
- Anyone running Claude Code, Cursor, or other AI coding tools who wants a high-capability alternative to Claude or GPT at a fraction of the cost
Who Should Skip It?
- Latency-sensitive chat applications - M3 isn’t optimized for sub-second responses in conversational UIs
- Audio-heavy workflows - use Gemini or a dedicated speech model instead
- Teams that need a proven, battle-tested API - M3 launched four days ago, and MiniMax’s API infrastructure hasn’t been stress-tested at the scale of OpenAI or Anthropic
The Bottom Line
MiniMax M3 is the most ambitious open-weight model released in 2026. It’s not the best at everything - but it doesn’t need to be. What makes M3 compelling is the combination: frontier-competitive coding, genuinely useful 1M context, and native multimodal input - all in a model you’ll be able to run on your own hardware.
The pricing is aggressive. $20/month for ~1.7 billion tokens of a model that competes with Claude Opus and GPT-5.5 on agentic tasks is a deal that fundamentally changes the economics of AI coding agents. The token plan model - where you subscribe once and get access to LLM, video, speech, and music models - is a refreshing break from per-token metering.
If the open-source release delivers on MiniMax’s promises, M3 could become the default foundation for a generation of AI coding tools. It’s already supported in Claude Code, Cursor, Cline, and Kilo Code - the most popular agentic coding platforms. That integration footprint alone makes it worth testing.
M3 isn’t a ChatGPT-killer or a Claude-replacement. It’s a different thing: a capable, open, affordable model optimized for the workflows developers actually run. In a market of closed-source black boxes with opaque per-token pricing, that’s genuinely refreshing.
Frequently Asked Questions
Is MiniMax M3 open source? MiniMax has announced M3 will be released as open-weight on Hugging Face and GitHub. As of June 5, 2026, the weights are not yet public but are expected within days.
How does M3 compare to GPT-5? M3 competes closely with GPT-5.5 on agentic tasks (PostTrainBench: 37.1 vs 39.3) and beats it on SWE-Bench Pro (59.0 vs 58.6). Pricing is dramatically different: M3 starts at $20/month for ~1.7B tokens vs GPT-5.5 at $5/$30 per million tokens.
Can M3 process video? Yes. M3 supports video input at 1 FPS with up to 1,024 frames, single-frame resolution of 336-1,008 pixels. It scored 84.6 on Video-MME at 512 frames.
What’s the context window on M3? 1 million tokens with a guaranteed minimum of 512K tokens. The MSA (MiniMax Sparse Attention) architecture enables this without quadratic compute scaling.
How do I access MiniMax M3? Through the MiniMax API (api.minimax.io) using either Anthropic-compatible or OpenAI-compatible endpoints, or through a Token Plan subscription. M3 is also available in Claude Code, Cursor, Cline, Kilo Code, and other AI coding tools.
Sources:
- MiniMax M3 Official Technical Report - minimax.io/blog/minimax-m3
- MiniMax M3 Model Page - minimax.io/models/text/m3
- MiniMax API Documentation - platform.minimax.io/docs/guides/text-generation
- Gemini 3.5 Flash Model Page - deepmind.google/models/gemini/flash
- Anthropic Claude Models Overview - docs.anthropic.com/en/docs/about-claude/models
- OpenAI Models Page - platform.openai.com/docs/models
- Kimi K2.6 API Documentation - platform.kimi.com/docs
- MiniMax Agent Team Blog Post - minimax.io/blog/minimax-agent-team-long-running-1779893953
- MiniMax Hugging Face Organization - huggingface.co/MiniMaxAI
- MiniMax Token Plan - platform.minimax.io/subscribe/token-plan