NVIDIA Nemotron 3 Ultra vs Free Version: Cost, Performance, and Best Use Cases
If you’ve been following the AI model race, you know NVIDIA just dropped something big. Nemotron 3 Ultra landed on June 4, 2026, and it’s already shaking up how developers think about the NVIDIA Nemotron 3 Ultra vs free version tradeoff. The short version? There isn’t just one “free version” - there are three Nemotron 3 models you can use for zero dollars, and one you pay for. They’re all the same family, same architecture, but dramatically different in scale, cost, and what they can do.
I spent the last few days pulling every benchmark, pricing page, and technical report I could find. Here’s what you actually need to know before you pick one.
The Nemotron 3 Family: A Quick Map
NVIDIA released the Nemotron 3 family in three tiers. Think of it like car trims - same engineering philosophy, wildly different horsepower:
| Model | Total Params | Active Params | Context Window | Release Date |
|---|---|---|---|---|
| Nano 30B A3B | 31.6B | ~3.6B | 262K (up to 1M) | Dec 14, 2025 |
| Nano Omni 30B A3B | 30B | ~3B | 256K | Apr 28, 2026 |
| Super 120B A12B | 120B | 12B | 1M | Mar 11, 2026 |
| Ultra 550B A55B | 550B | 55B | 1M | Jun 4, 2026 |
All four share the same hybrid Mamba-Transformer Mixture-of-Experts architecture. They’re all fully open - weights, training data, and recipes are downloadable from Hugging Face under the OpenMDW-1.1 license. That’s rare at this scale, and it matters because you can self-host any of them, even Ultra, if you’ve got the hardware.
The key distinction: Nano and Super have free API access on OpenRouter and build.nvidia.com. Ultra does not have a persistent free tier on third-party providers - it costs money per token on OpenRouter, DeepInfra, and everywhere else. But NVIDIA does offer a free prototyping endpoint on build.nvidia.com with rate limits.
Cost Breakdown: What “Free” Actually Means
Let’s get specific about pricing. Here’s what each model costs on the major hosted platforms as of June 2026:
| Model | OpenRouter (Input/Output per 1M tokens) | DeepInfra | build.nvidia.com |
|---|---|---|---|
| Nano 30B | $0.05 / $0.20 (free tier: 18.6B tokens/week) | - | Free endpoint |
| Nano Omni | Free (189M tokens/week limit) | - | Free endpoint |
| Super 120B | $0.09 / $0.45 (free tier: 16.6B tokens/week) | $0.10 / $0.50 | Free endpoint |
| Ultra 550B | $0.50 / $2.50 (1.04B tokens/week free) | $0.50 / $2.50 ($0.15 cached) | Free endpoint (rate-limited) |
A few things jump out immediately.
Ultra costs 5.5x more per input token and 5.5x more per output token than Super. On a million-token output run, Super costs you $0.45 while Ultra costs $2.50. Over a month of heavy agent use - where multi-agent workflows can generate 15x more tokens than standard chat - the gap compounds fast. NVIDIA’s own research notes that agentic workflows cause “context explosion” where history, tool outputs, and reasoning steps get re-sent at every turn. If your agent pipeline churns through a billion tokens a week, Ultra costs $500 input + $2,500 output vs Super at $90 input + $450 output. That’s a $2,460 weekly difference.
But here’s the thing: Ultra is actually cheaper per task than comparable frontier models. NVIDIA reports that Ultra lowers cost-to-task-completion by 30% compared to models in its class on SWE-Bench Verified and Terminal Bench 2.0. It uses fewer total tokens and fewer tokens per turn to get the job done. So while the per-token price looks high, the per-task price can work out lower than alternatives like Kimi K2.6 or GLM 5.1.
The free tiers are genuinely usable. OpenRouter gives you 18.6 billion free tokens per week for Nano and 16.6 billion for Super. That’s enough for individual developers to build, test, and even run modest production workloads without spending a dime. The Nano Omni free tier is tighter at 189M tokens/week - fine for prototyping but not for volume.
build.nvidia.com offers free endpoints for all models including Ultra. These are rate-limited prototyping endpoints, not production-grade. Think of them as a sandbox: you can test Ultra’s reasoning quality before committing to paid infrastructure.
Performance Benchmarks: Where Ultra Earns Its Price
Numbers don’t lie. Here’s how Ultra stacks up against the competition and against its own free-tier siblings:
Ultra vs Other Frontier Models
| Benchmark | Nemotron 3 Ultra (550B) | GLM 5.1 (744B) | Kimi K2.6 (1T) | Qwen3.5 (397B) |
|---|---|---|---|---|
| PinchBench (Agent Productivity) | 91% | 84% | 91% | 89% |
| EnterpriseOps-Gym (Long-horizon Planning) | 33% | 40% | 29% | 30% |
| Terminal-Bench 2.0 (Coding) | 54% | 64% | 67% | 53% |
| IFBench (Instruction Following) | 82% | 77% | 74% | 78% |
| GDPVal-AA (Knowledge Work) | 1,448 | 1,594 | 1,508 | 1,192 |
| ProfBench (Search) | 56% | 46% | 56% | 53% |
| RULER @1M (Long Context) | 95% | N/A (max 256K) | N/A (max 256K) | 90% |
Sources: NVIDIA Nemotron 3 Ultra Technical Report & NVIDIA Developer Blog, June 2026
Ultra leads or ties on PinchBench (agent productivity), IFBench (instruction following), ProfBench (professional search), and RULER (long context). It’s competitive but not dominant on coding benchmarks, where Kimi K2.6 and GLM 5.1 edge ahead. The 1M-token context window with 95% RULER accuracy is a standout - neither GLM 5.1 nor Kimi K2.6 even support 1M tokens. For enterprises doing compliance analysis, long-document reasoning, or monolithic codebase understanding, that alone could be the deciding factor.
Ultra’s Standalone Benchmark Scores
| Benchmark | Ultra BF16 | Ultra NVFP4 |
|---|---|---|
| SWE-Bench Verified | 71.9% | 69.7% |
| Terminal Bench 2.1 | 56.4% | 53.9% |
| PinchBench | 90.0% | 89.8% |
| GPQA (no tools) | 87.0% | 87.9% |
| IFBench | 81.7% | 82.3% |
| RULER @1M | 94.7% | 94.0% |
| TauBench V3 (avg) | 70.9% | 70.3% |
| BrowseComp | 44.4% | 41.4% |
| HLE (no tools) | 26.7% | 26.1% |
Source: NVIDIA NIM API Reference, Nemotron 3 Ultra Model Card
The NVFP4 quantized version (the one most providers serve) loses roughly 1-3 percentage points across most benchmarks compared to BF16. That’s an impressively small gap for a 4-bit model - NVIDIA trained it natively in NVFP4 from the first gradient update, so the model learned to be accurate within 4-bit constraints rather than getting compressed after the fact.
Where Does Super Fit?
Super 120B is the middle child that punches above its weight. On PinchBench, it scores 85.6% - making it the best open model in its class. NVIDIA reports it delivers over 5x higher throughput than the previous Nemotron Super generation and 4x improved memory efficiency. With 12B active parameters and a 1M context window, it handles most agentic tasks without breaking a sweat. For most development teams, Super is the price-performance sweet spot.
Nano 30B: The Efficiency Champ
Nano 30B A3B achieves 3.3x higher throughput than Qwen3-30B and 2.2x higher than GPT-OSS-20B in an 8K input / 16K output configuration on a single H200 GPU. It scored 52 on the Artificial Analysis Intelligence Index - leading among similarly sized models. It’s 4x faster than the previous Nemotron Nano 2. For sub-agents handling targeted, high-volume tasks (tool calling, validation, simple code generation), Nano is unbeatable on cost-efficiency.
Throughput and Speed: The Hidden Cost Driver
Raw benchmark accuracy is one thing. In production, what actually matters is how fast the model generates tokens and how many tokens it burns per task. Slow models mean longer wait times for users and higher compute bills.
NVIDIA claims Ultra achieves 5x higher throughput compared to other open frontier models in its class on Artificial Analysis benchmarks. It’s the only model occupying the “high accuracy + high speed” quadrant. MTP (Multi-Token Prediction) generates up to 5 speculative tokens per forward pass, dramatically reducing wall-clock time for long sequences. Mamba layers handle sequence processing in linear time (vs quadratic for pure Transformers), making the 1M-token context window practical rather than theoretical.
Super with its LatentMoE architecture calls on 4x more expert specialists for the cost of one by compressing tokens before routing. This keeps per-token latency low even for complex multi-turn workflows.
Nano, with only 3.6B active parameters per token, is the speed demon. It’s ideal for high-volume, low-latency sub-tasks where a 550B model would be overkill.
Features and Architectural Differences
Not all Nemotron 3 models are created equal. Here’s what each tier brings:
| Feature | Nano 30B | Nano Omni | Super 120B | Ultra 550B |
|---|---|---|---|---|
| Architecture | Mamba-Transformer MoE | Mamba-Transformer MoE | Mamba-Transformer LatentMoE + MTP | Mamba-Transformer LatentMoE + MTP |
| Precision | BF16/FP8 | BF16 | NVFP4 (native) | NVFP4 (native) |
| Reasoning ON/OFF | Yes | Yes | Yes | Yes |
| Thinking Budget Control | Yes | Yes | Yes | Yes |
| MTP (Multi-Token Prediction) | No | No | Yes | Yes |
| LatentMoE | No | No | Yes | Yes |
| MOPD (Multi-Teacher Distillation) | No | No | No | Yes |
| Modalities | Text | Text, Image, Video, Audio | Text | Text |
| Min GPU (self-host) | 1× H100 / DGX Spark | 1× H100 / DGX Spark | 1× H100 / B200 | 4× B200 or 8× H100 |
MOPD is Ultra’s secret weapon. Multi-Teacher On-Policy Distillation uses 10+ specialized teacher models, each an expert in its own domain. During training, Ultra generates its own responses (on-policy rollouts), and each teacher scores those responses in its area of expertise. This co-evolution between students and teachers means Ultra continuously improves across domains without the typical accuracy-efficiency tradeoff. It’s why Ultra can match or beat models twice its size on certain benchmarks.
NVFP4 training means both Super and Ultra were born in 4-bit precision. Most quantized models are compressed after training, which introduces accuracy loss. Super and Ultra learned to think in 4-bit from the start. This is why the BF16-to-NVFP4 gap on benchmarks is so small - and why deployment on NVIDIA Blackwell GPUs is 4x faster than FP8 on Hopper.
Reasoning ON/OFF with thinking budgets is present across the entire family. You can toggle whether the model produces chain-of-thought reasoning and cap how many “thinking” tokens it generates. This is critical for controlling costs in agentic pipelines where you might want deep reasoning for planning steps but fast, direct responses for tool execution.
Supported Platforms: Where Can You Actually Run These?
Free API Access
- OpenRouter: Nano and Super have
:freevariants. Ultra is paid only. - build.nvidia.com (NVIDIA NIM): All models including Ultra have free prototyping endpoints with rate limits.
- Ollama: Download and run Nano and Super locally on consumer GPUs. Ultra requires enterprise hardware.
Paid API Access
- OpenRouter: All models with no free-tier limits
- DeepInfra: Super ($0.10/$0.50), Ultra ($0.50/$2.50)
- Together AI, Fireworks AI, Baseten, CoreWeave, DigitalOcean, Nebius, Vultr: Various pricing
- Perplexity Pro: Super and Ultra available with subscription
Self-Hosted (Download Weights)
- vLLM, SGLang, TRT-LLM: Cookbooks available for all models
- Ollama, LM Studio, llama.cpp: Nano and Super on consumer hardware
- Unsloth: Fine-tuning support for all models
- Hugging Face: Full weights under OpenMDW-1.1 license
Ultra’s minimum hardware requirement is steep: 4× B200 or 8× H100 GPUs. That’s easily $100K+ in hardware for self-hosting. Super can run on a single H100 or B200 - far more accessible. Nano runs on a DGX Spark or even high-end RTX workstation GPUs.
Real-World Use Case Recommendations
Individual Developer / Hobbyist: Stick With Free
If you’re building personal projects, learning agentic AI, or prototyping a startup idea, you don’t need Ultra. Start with Nano 30B on the free OpenRouter tier. It gives you 18.6 billion free tokens per week, which is more than enough for solo development. When you hit a task that needs more reasoning depth, switch to Super (also free, 16.6B tokens/week). Both are fast, both have 1M-token context (with the right config), and both outperform anything in their weight class.
Only consider Ultra’s paid tier if you’re consistently hitting reasoning ceilings that Super can’t break through - and honestly, for most individual projects, that won’t happen.
Recommendation: Nano (free) → Super (free) for harder tasks. Skip Ultra.
Startup / Small Team (3-20 devs): Super Is Your Default
For startups building agentic products - coding assistants, customer support automation, document intelligence - the economics tilt heavily toward Super. At $0.09/$0.45 per million tokens, you can run production workloads on a reasonable budget. Super’s 1M context window handles RAG, long conversations, and multi-document reasoning. The free tier covers development and testing; scale to paid when you go live.
If your product requires frontier-level reasoning (autonomous coding agents that need to sustain architectural decisions across sessions, for example), bolt on Ultra for the orchestration layer while keeping Super for execution. This “Super + Nano” deployment pattern is literally what NVIDIA recommends: Super plans, Nano executes.
Recommendation: Super as primary. Add Ultra for orchestration if your margins support it. Run Nano for high-volume sub-tasks.
Mid-Market / Growth-Stage Company: Tiered Deployment
At this stage, you’re running multi-agent systems at scale. You have agents planning, calling tools, delegating to sub-agents, and handling error recovery across hundreds of concurrent sessions. The token math gets real fast: multi-agent workflows generate 15x the tokens of standard chat. You need a tiered strategy.
Recommendation: Ultra for orchestration and the “hard calls” (10-20% of requests). Super for most agentic reasoning (50-60%). Nano for tool execution, validation, and high-volume sub-tasks (20-30%). Run self-hosted if you have GPU capacity to avoid per-token pricing. Use NVFP4 quantization for maximum throughput.
Enterprise: Ultra + Self-Hosting
Large enterprises running customer service automation, supply chain management, IT security triage, or compliance analysis will benefit most from Ultra’s frontier reasoning. The 30% cost-to-task-completion savings on agentic benchmarks add up at enterprise volume. Self-hosting on NVIDIA Blackwell hardware (4× B200 minimum) eliminates per-token API costs, and the OpenMDW-1.1 license allows full commercial deployment with data sovereignty.
Pair Ultra with Nemotron 3.5 Content Safety (4B parameter guardrail model covering 23 safety categories, 12 languages) and Nemotron 3.5 ASR (multilingual speech recognition with sub-100ms latency) for a complete voice-enabled, safety-guarded agentic stack.
Recommendation: Self-host Ultra on Blackwell. Layer in NVIDIA’s safety, speech, and RAG models. Fine-tune with LoRA for domain specialization using NeMo RL and NeMo Gym.
Academic Research: Free Tier + Open Weights
The Nemotron 3 family’s openness is a gift to researchers. Full training recipes, pre-training data (10T+ tokens), post-training data (50M+ SFT samples), RL environments (55+), and evaluation pipelines are all public. You can inspect, reproduce, and extend everything. The free API tiers handle experimentation; download weights for deeper work.
Recommendation: Use free API for initial experiments. Download weights from Hugging Face for fine-tuning and architecture research. Leverage NeMo Gym for RL experimentation.
The “Super + Nano” Pattern: NVIDIA’s Own Recommendation
One of the most useful insights from NVIDIA’s technical blog is what they call the “Super + Nano deployment pattern.” The idea: use Super for complex planning and multi-step reasoning, and Nano for executing individual steps within the workflow.
“Simple merge requests can be addressed by Nemotron 3 Nano while complex coding tasks that require deeper understanding of the code base can be handled by Nemotron 3 Super.”
This routing strategy gives you the best of both worlds - depth where you need it, speed and cost-efficiency everywhere else. It’s how NVIDIA itself recommends deploying the Nemotron family in production.
Verdict: Which Nemotron 3 Version Should You Use?
| Your Situation | Best Pick | Why |
|---|---|---|
| Solo dev, learning, prototyping | Nano 30B (free) | 18.6B free tokens/week, fast, good enough for most tasks |
| Solo dev hitting reasoning limits | Super 120B (free) | 16.6B free tokens/week, 1M context, strong agentic performance |
| Startup building an AI product | Super 120B (paid, ~$0.10/$0.50/M) | Best price-performance, handles production workloads |
| Startup needing frontier reasoning | Ultra 550B for orchestration + Super for execution | 30% lower cost-to-task-completion on agentic benchmarks |
| Growth-stage with multi-agent systems | Tiered: Ultra + Super + Nano | Route by task complexity, minimize total token spend |
| Enterprise, compliance, security | Self-hosted Ultra 550B | Data sovereignty, frontier accuracy, 1M context for document analysis |
| Academic research | Free API + downloaded weights | Full reproducibility, open data, open recipes |
| Multimodal applications | Nano Omni 30B (free) | Text + image + video + audio in one model, 2x video throughput vs separate pipelines |
The Bottom Line
The NVIDIA Nemotron 3 Ultra vs free version question has a surprisingly clear answer for most people: you probably don’t need Ultra. The free-tier Super 120B handles the vast majority of agentic workloads with a 1M context window, strong benchmark scores, and excellent throughput - all for zero dollars if you stay within the weekly token limit. Even at paid scale, it’s 5.5x cheaper per token than Ultra.
Ultra earns its price tag when you need frontier-level reasoning for complex, long-running autonomous agents. If your application requires sustaining architectural decisions across long coding sessions, synthesizing evidence across hundreds of research sources, or verifying outputs against thousands of constraints, Ultra’s 5x throughput advantage and 30% lower cost-to-task-completion make it worth the investment. For everyone else, the free Nemotron 3 models are already really, really good.
Sources
-
NVIDIA Developer Blog - “NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents” (June 4, 2026). developer.nvidia.com/blog/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running-agents/
-
NVIDIA Developer Blog - “Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning” (March 11, 2026). developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/
-
NVIDIA Developer Blog - “Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate” (December 15, 2025). developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/
-
Hugging Face Blog - “Nemotron 3 Nano - A New Standard for Efficient, Open, and Intelligent Agentic Models” (December 15, 2025). huggingface.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models
-
OpenRouter - Nemotron 3 Ultra Model Page. openrouter.ai/nvidia/nemotron-3-ultra-550b-a55b
-
OpenRouter - Nemotron 3 Super Model Page. openrouter.ai/nvidia/nemotron-3-super-120b-a12b
-
OpenRouter - Nemotron 3 Nano Model Page. openrouter.ai/nvidia/nemotron-3-nano-30b-a3b
-
NVIDIA NIM API Reference - Nemotron 3 Ultra 550B A55B Model Card. docs.api.nvidia.com/nim/reference/nvidia-nemotron-3-ultra-550b-a55b
-
DeepInfra - Nemotron 3 Ultra Pricing. deepinfra.com/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B
-
DeepInfra - Nemotron 3 Super Pricing. deepinfra.com/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B
-
NVIDIA Developer - Nemotron Model Family Page. developer.nvidia.com/nemotron
-
NVIDIA build - Nemotron 3 Ultra Endpoint. build.nvidia.com/nvidia/nemotron-3-ultra-550b-a55b
-
NVIDIA build - Nemotron 3 Super Endpoint. build.nvidia.com/nvidia/nemotron-3-super-120b-a12b
-
Hugging Face - Nemotron 3 Ultra NVFP4 Weights. huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4
-
Hugging Face - Nemotron 3 Super FP8 Weights. huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8