NVIDIA Nemotron 3 Ultra Review: Pricing, 1M Context, Benchmarks, and Use Cases
NVIDIA dropped Nemotron 3 Ultra on June 4, 2026, and it’s already the most capable open-weight AI model from a US company. This isn’t just another LLM launch. It’s a 550-billion-parameter monster with a million-token context window, a hybrid Mamba-Transformer architecture, and genuinely open weights under the permissive OpenMDW-1.1 license.
But here’s what I wanted to know: does it actually deliver on the hype? I spent the last day digging through everything - official model cards, third-party benchmarks from Artificial Analysis, deployment guides, and hands-on reports. Here’s what I found.
What Is NVIDIA Nemotron 3 Ultra?
Nemotron 3 Ultra (full designation: NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4) is NVIDIA’s flagship LLM. It sits at the top of the Nemotron 3 family, above the Super (120B total / 12.7B active) and Nano (31.6B total / 3.6B active) tiers.
The headline numbers:
- 550 billion total parameters - but only 55 billion active per forward pass thanks to its Mixture-of-Experts (MoE) design
- 1 million token context window - that’s enough to ingest entire codebases, multi-day conversations, or 800-page technical manuals in one go
- Hybrid LatentMoE architecture - interleaves Mamba-2 state-space layers, MoE transformer layers, and standard attention layers
- Multi-Token Prediction (MTP) - predicts multiple future tokens simultaneously, which speeds up both training and inference via native speculative decoding
- NVFP4 training - quantization-aware pre-training that keeps the model computationally efficient without cratering accuracy
NVIDIA trained it on approximately 20 trillion tokens across crawled and synthetic data spanning code, math, science, and general knowledge. The post-training data cutoff is May 2026, making it one of the freshest frontier models available right now.
The Nemotron 3 Family at a Glance
| Model | Total Params | Active Params | Context | Best For |
|---|---|---|---|---|
| Nemotron 3 Ultra | 550B | 55B | 1M tokens | Frontier reasoning, agentic workflows, datacenter |
| Nemotron 3 Super | 120.6B | 12.7B | 1M tokens | Production deployments, single-GPU (NVFP4 on B200) |
| Nemotron 3 Nano | 31.6B | 3.6B | 1M tokens | Resource-constrained environments, edge deployment |
| Nemotron 3 Nano Omni | 30B | 3B | 262K | Multimodal (text, image, video, audio) |
Pricing and Availability: Free Tier, Self-Hosted, and Partner Endpoints
This is where things get interesting - and surprisingly accessible for a model this size.
Free API Access
NVIDIA offers a free prototyping tier through build.nvidia.com. You sign up, grab an API key, and start hitting the endpoint at https://integrate.api.nvidia.com/v1 using an OpenAI-compatible client. The free tier is rate-limited and subject to NVIDIA’s API Trial Terms of Service, but it’s enough to kick the tires and prototype seriously.
Here’s what a basic call looks like:
from openai import OpenAI
client = OpenAI(
base_url = "https://integrate.api.nvidia.com/v1",
api_key = "$NVIDIA_API_KEY"
)
completion = client.chat.completions.create(
model="nvidia/nemotron-3-ultra-550b-a55b",
messages=[{"role":"user","content":"Write a haiku about GPUs"}],
temperature=1,
top_p=0.95,
max_tokens=16384,
extra_body={"chat_template_kwargs":{"enable_thinking":True}},
stream=True
)
The model supports reasoning mode on/off via enable_thinking, a medium-effort reasoning toggle, and a reasoning_budget parameter to cap thinking tokens. More on that later.
Partner Endpoints
If you need production-grade throughput, partner providers offer hosted inference. On DeepInfra, Nemotron 3 Ultra reportedly delivers 300+ tokens per second - fast enough for interactive use with a model this large. By comparison, similarly sized models from DeepSeek and Moonshot AI typically manage 50-100 tok/s on the same provider.
Other partner endpoints include OpenRouter and NVIDIA’s own production NIM offerings, which scale to multi-node configurations for enterprise deployments.
Self-Hosting Requirements
Self-hosting is where the hardware appetite becomes real. The minimum recommended setup:
- Single-node: 4× NVIDIA B200 GPUs (the NVFP4 checkpoint fits weights plus KV cache with headroom)
- Multi-node: 4+ GPUs across GB200 or GB300 systems
- Alternative: 8× H100 GPUs for the BF16 flavor
For the NVFP4 quantized variant, NVIDIA publishes deployment commands for vLLM (v0.22.0), SGLang (v0.5.11), and TensorRT-LLM (release 1.3.0rc17). Multi-node deployments use Ray for distributed orchestration. The NVFP4 checkpoint with 4× B200 can serve up to 256K context out of the box; bump it to 1M by setting VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and --max-model-len 1048576.
License
The model is released under the OpenMDW License Agreement, version 1.1 - a permissive, model-first open license developed alongside the Linux Foundation’s Model Openness Framework (MOF). It explicitly grants rights to use, modify, and redistribute the model weights, training data, and associated software for both commercial and non-commercial purposes. This is genuinely open source, not “open weights with restrictions.”
The 1M Token Context Window: What It Actually Means
A million-token context window sounds cool, but does it work in practice? NVIDIA published two key long-context benchmarks that suggest it does:
- RULER 1M: 94.7% (BF16) / 94.0% (NVFP4) - this is the standard needle-in-a-haystack-style test for long-context retrieval accuracy. Scores above 90% mean the model can reliably find and reason about information buried deep in a million-token document.
- AA-LCR (Artificial Analysis Long-Context Reasoning): 65.4% / 65.5% - this test evaluates reasoning quality over long contexts, not just retrieval. A score of 65 is solid for frontier models.
I want to flag something important: the 1M context capability relies on the Mamba-2 state-space layers in the hybrid architecture. Unlike pure transformer models where KV-cache memory explodes quadratically with sequence length, Mamba’s compressed state representation keeps memory manageable at extreme lengths. This is the architectural secret sauce that makes 1M context feasible on 4 GPUs.
Here’s what 1M tokens enables in practice:
- Full codebase analysis - drop in an entire 50,000-file repository and ask questions about cross-file dependencies
- Multi-document legal review - ingest hundreds of contracts simultaneously and find conflicts across them
- Long-running agent sessions - maintain full conversation history plus tool output traces without summarization or truncation
- Scientific literature synthesis - load dozens of full-text papers and reason across them holistically
For coding agents specifically, NVIDIA ships an OpenCode integration config that maps the 1M context limit directly into terminal-based development workflows, working with vLLM, SGLang, and TRT-LLM backends.
Detailed Benchmarks: Coding, Reasoning, Math, and Agentic Tasks
NVIDIA released a comprehensive benchmark suite using their NeMo Evaluator SDK. Here’s the full breakdown for both BF16 precision and the more practical NVFP4 variant:
Agentic Benchmarks
| Benchmark | BF16 Score | NVFP4 Score | What It Measures |
|---|---|---|---|
| SWE-Bench Verified | 71.9% | 69.7% | Real-world GitHub issue resolution |
| SWE-Bench Multilingual | 67.7% | 65.8% | Multi-language code fixes |
| Terminal Bench 2.1 | 56.4% | 53.9% | Terminal-based agent tasks |
| PinchBench | 90.0% | 89.8% | Pinch-point agent reasoning |
| GDPVal | 46.7% | 47.9% | Office deliverable generation |
| ProfBench (Search) | 56.0% | 56.4% | Professional search tasks |
| TauBench V3 (avg) | 70.9% | 70.3% | Multi-domain tool use |
| BrowseComp | 44.4% | 41.4% | Web browsing comprehension |
The SWE-Bench Verified score of 71.9% is the standout here. For context, this means the model can independently resolve nearly 72% of real GitHub issues end-to-end when deployed as a coding agent using the OpenHands scaffold. The NVFP4 quantized version loses only about 2 percentage points - remarkably small degradation for a model that runs on half the hardware.
Reasoning and Knowledge
| Benchmark | BF16 Score | NVFP4 Score |
|---|---|---|
| IOI 2025 | 570.0 | 564.7 |
| GPQA Diamond (no tools) | 87.0% | 87.9% |
| SciCode (subtask) | 44.6% | 43.5% |
| HLE (no tools) | 26.7% | 26.1% |
| CritPt (no tools) | 3.1% | 3.4% |
| OmniScience Accuracy | 24.1% | 24.6% |
The GPQA Diamond score of 87% is genuinely competitive with the best closed models. GPQA (Graduate-Level Physics QA) evaluates PhD-level science reasoning, and anything above 80% puts you in frontier territory. The International Olympiad in Informatics (IOI) score of 570 is also impressive - these are genuinely hard competitive programming problems.
Humanity’s Last Exam (HLE) at 26.7% and CritPt at 3.1% show that there’s still significant room for improvement on the hardest reasoning benchmarks. These are designed to be extremely difficult, and even the best models rarely crack 30% on HLE.
Chat, Instruction Following, and Long Context
| Benchmark | BF16 | NVFP4 |
|---|---|---|
| IFBench (prompt) | 81.7% | 82.3% |
| AA-LCR | 65.4% | 65.5% |
| RULER 1M | 94.7% | 94.0% |
IFBench at 81.7% confirms strong instruction-following capabilities - the model reliably does what you ask, even with complex, nuanced prompts.
Third-Party Rankings
Independent benchmark aggregator Artificial Analysis scores Nemotron 3 Ultra at 48 points on their intelligence index, making it the highest-rated open US model. It outpaces:
- Google Gemma 4 31B: 39 points
- Nemotron 3 Super: 36 points
- OpenAI GPT-OSS-120B: 33 points
However, Chinese open models still lead. Kimi K2.6 scores 54 points on the same index, and Anthropic’s closed Claude Opus 4.8 hits 61 points.
Artificial Analysis also places Nemotron 3 Ultra in their “most attractive quadrant” - combining high intelligence with fast output speed (300+ tok/s on DeepInfra), where competing models at similar capability levels typically deliver 50-100 tok/s.
Architecture Deep-Dive: LatentMoE, Mamba-2, and MTP
The hybrid architecture deserves a closer look because it’s genuinely novel and explains much of the performance profile.
LatentMoE
Traditional MoE models route tokens to experts in the original high-dimensional token space. LatentMoE first projects tokens into a smaller latent dimension, routes them to experts there, then projects back. This improves accuracy-per-compute-byte - you get better expert selection because the routing happens in a more meaningful representation space.
Mamba-2 + Attention Interleaving
Most of the model uses Mamba-2 state-space layers, which process sequences efficiently with constant memory per token regardless of context length. Select attention layers are interleaved for tasks where full quadratic attention helps - like long-range dependency tracking and complex reasoning. This hybrid design is what makes the 1M context window practical on finite hardware.
Multi-Token Prediction (MTP)
MTP predicts the next 5 tokens at each step using shared-weight prediction heads (not independently trained offset heads). At inference time, this enables native speculative decoding - the model drafts multiple tokens in parallel, then verifies them in a single forward pass. The result: roughly 2-3× faster token generation compared to standard autoregressive decoding, with no separate draft model needed.
Training Pipeline
The full training recipe - released openly - follows four stages:
- Pre-training: ~20T tokens on crawled + synthetic code, math, science, and general knowledge data. Uses NVFP4 quantization-aware training for efficiency
- Supervised Fine-Tuning: Synthetic code, math, science, tool calling, instruction following, and long-context retrieval data
- Reinforcement Learning: Multi-environment asynchronous GRPO (Group Relative Policy Optimization) across math, code, science, instruction following, multi-step tool use, multi-turn conversations, and structured output environments. Uses decoupled training/inference on separate GPU sets
- Multi-Domain On-Policy Distillation (MOPD): Strong teacher models guide training on the model’s own generated rollouts, improving reasoning across domains while staying efficient
All four stages are documented and reproducible using the NVIDIA Nemotron Developer Repository and NeMo Evaluator SDK.
How Nemotron 3 Ultra Compares: Worth It vs Other Options?
Let me give you the straight answer based on what I’ve seen.
vs. Closed Frontier Models (Opus 4.8, GPT-5.x)
If raw capability is your only metric, closed models still win. Opus 4.8 scores 61 on the AIQ index vs Nemotron’s 48. That’s a meaningful gap for the hardest reasoning tasks.
But here’s the trade-off: Nemotron 3 Ultra is free to self-host, has no per-token pricing, and runs on hardware you can buy. For enterprises with sensitive data, compliance requirements, or long-running workloads where API costs would be astronomical, the TCO math flips fast. A single 4× B200 node amortized over a year running 24/7 is cheaper than API calls for high-volume use cases.
vs. Other Open Models
| Model | AIQ Score | Context | Active Params | License |
|---|---|---|---|---|
| Nemotron 3 Ultra | 48 | 1M | 55B | OpenMDW-1.1 (permissive) |
| Kimi K2.6 | 54 | 1M | MoE | Open weights |
| DeepSeek V4 Pro | ~50 (est.) | 1M | MoE | Open weights |
| Gemma 4 31B | 39 | 128K | 31B | Apache 2.0 |
| GPT-OSS-120B | 33 | 128K | 120B | Apache 2.0 |
Against Kimi K2.6, Nemotron 3 Ultra trades some raw intelligence for much faster inference (300+ tok/s vs 50-100 tok/s) and a cleaner licensing story. Against Gemma 4 31B and GPT-OSS-120B, Nemotron 3 Ultra is simply in a different league - the AIQ gap of 9-15 points is massive.
vs. Nemotron 3 Super
If you’re deciding between Ultra and Super within the NVIDIA family: Super (120B total / 12.7B active) runs on a single B200 GPU in NVFP4, making it dramatically more accessible for smaller teams. Ultra delivers roughly a 30-40% lift on most benchmarks for roughly 4× the hardware. For coding agents and enterprise RAG, Ultra justifies the cost. For general-purpose chatbots, Super is probably enough.
The Verdict
Nemotron 3 Ultra is worth using if:
- You need a genuinely open, self-hostable frontier model with a permissive license
- Your workloads are agentic (SWE-Bench 71.9% is no joke)
- You have long-context needs where 1M tokens changes your architecture
- You want fast inference speed alongside frontier capabilities
It’s probably not worth it if:
- You’re chasing the absolute highest benchmark scores and can use closed APIs
- You don’t have access to 4× B200 or equivalent hardware
- Your use case is simple chat - Nemotron 3 Super or even Nano will be more than enough
Real-World Use Cases
Based on the architecture and benchmarks, here’s where Nemotron 3 Ultra shines:
1. Autonomous Coding Agents
NVIDIA explicitly ships OpenCode integration configs for this model. With SWE-Bench Verified at 71.9%, it’s genuinely competitive for autonomous code repair, feature implementation, and codebase-wide refactoring. The 1M context means it can ingest entire repositories without summarization losses.
2. Enterprise RAG and Document Intelligence
The combination of 1M context, strong instruction following (IFBench 81.7%), and multilingual support (11+ languages) makes it ideal for analyzing massive document collections - legal contracts, financial reports, scientific papers - in a single pass.
3. Complex Multi-Step Agent Workflows
The TauBench V3 scores (70.9% average across airline, retail, telecom, and banking domains) confirm strong tool-use capabilities. This model can orchestrate API calls, database queries, and computational tools across long-running agent sessions without losing context.
4. Frontier Research and Scientific Reasoning
GPQA Diamond at 87% and IOI 2025 at 570 are serious scores. For researchers working on hard math, physics, or CS problems, this is a capable open reasoning engine.
5. Multilingual Applications
Support for English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, and Chinese means global deployment without juggling multiple models.
What’s Missing: Honest Limitations
No review is complete without the downsides.
Hardware appetite. You need 4× B200 GPUs minimum for the NVFP4 variant - that’s roughly $120,000-160,000 in GPU hardware alone. Yes, the free API tier exists, but self-hosting at scale isn’t cheap.
Still behind closed models. An AIQ score of 48 vs Opus 4.8 at 61 is a real gap. On the hardest reasoning tasks (HLE, CritPt), Nemotron struggles alongside every other model.
Banking domain weak spot. TauBench V3 Banking score of 22.6% (BF16) / 19.2% (NVFP4) stands out as a notable weakness. If your use case involves financial reasoning over banking-specific data, this deserves caution.
New model, sparse ecosystem. Released on June 4, 2026, the model has had almost no time to build a community tooling ecosystem. Expect rough edges in server frameworks, quantization tools, and agent harness integrations for the first few weeks.
Final Thoughts
NVIDIA Nemotron 3 Ultra is the most capable genuinely open AI model from a US company, period. It delivers frontier-level agentic and reasoning benchmarks, a genuinely useful 1M token context window, and remarkably fast inference thanks to the MTP speculative decoding. The free API tier and permissive OpenMDW license make it accessible to anyone.
It doesn’t beat the best closed models from Anthropic or the best open models from China. But it occupies a compelling sweet spot - open, fast, powerful, and backed by NVIDIA’s engineering muscle. For teams building serious AI applications that need to run on their own hardware, it’s the best option available right now.
I’ll be keeping a close eye on how the ecosystem develops around this model. If the community rallies behind it the way it did around Llama, we could be looking at a new default for open-weight frontier AI.