NVIDIA Nemotron 3 Ultra Free: Pricing, Features, Context Window, and Best Use Cases
NVIDIA dropped something massive on June 4, 2026 - the NVIDIA Nemotron 3 Ultra free tier. And I’m not talking about some stripped-down demo with a 5-message limit. This is a genuine 550-billion-parameter frontier model with 55 billion active parameters, a native 1M-token context window, and a free API endpoint you can start using right now. I’ve spent the last day digging into the docs, testing the endpoint, and comparing it against every alternative I could find. Here’s everything you need to know.
What Exactly Is NVIDIA Nemotron 3 Ultra?
Nemotron 3 Ultra is NVIDIA’s flagship reasoning model in the Nemotron 3 family. It sits above the Nano (30B total, 3B active) and the Super (120B total, 12B active) as the heavyweight option for the hardest agentic workloads.
The architecture is what makes it interesting. It’s a hybrid Mamba-Transformer with Latent Mixture of Experts (LatentMoE). In plain English: instead of running every token through every parameter like a dense model, it activates only 55 billion of its 550 billion total parameters per token. The Mamba-2 layers handle efficient long-sequence modeling, while selective attention layers handle precise reasoning. It also uses Multi-Token Prediction (MTP) - predicting multiple future tokens in a single forward pass - which speeds up inference significantly.
Here are the specs at a glance:
| Spec | Details |
|---|---|
| Total Parameters | 550B |
| Active Parameters | 55B |
| Architecture | Hybrid Mamba-2 + MoE + Attention with MTP |
| Context Window | Up to 1M tokens |
| Supported Languages | English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, Chinese |
| Training Data Cutoff | May 2026 (post-training), September 2025 (pre-training) |
| License | OpenMDW-1.1 (open weights, open data, commercial use allowed) |
| Release Date | June 4, 2026 |
| Minimum GPU (Self-Host) | 4x B200 / 4x GB200 / 8x H100 |
NVIDIA Nemotron 3 Ultra Pricing Explained
Here’s where things get genuinely surprising. NVIDIA is offering Nemotron 3 Ultra with a free API endpoint on build.nvidia.com. This isn’t a time-limited trial - it’s a free tier for prototyping and development.
The Free Tier
- Cost: $0 (requires an NVIDIA API key - free to generate)
- Access:
https://integrate.api.nvidia.com/v1via OpenAI-compatible API - Rate Limits: NVIDIA’s API Trial Terms of Service govern usage. While NVIDIA doesn’t publish hard rate limits publicly, the free tier is designed for prototyping. Expect throttling if you’re hammering it with production traffic.
- Max Output Tokens: 32,768
- Reasoning Toggle: Configurable on/off via
enable_thinkingflag
Paid Options
If you need production throughput, you’ve got choices:
| Provider | Input Price (per 1M tokens) | Output Price (per 1M tokens) | Context |
|---|---|---|---|
| NVIDIA Free Endpoint | $0 | $0 | 1M |
| OpenRouter | $0.50 | $2.50 | 1M |
| Self-Hosted (NIM) | Infrastructure cost only | Infrastructure cost only | 1M |
OpenRouter routes requests to multiple providers, automatically selecting the best available backend. Self-hosting requires substantial hardware - minimum 4x B200 GPUs - but gives you unlimited usage at your own infrastructure cost.
How the Nemotron 3 Family Pricing Compares
| Model | Total Params | Active Params | Input (per 1M) | Output (per 1M) | Context |
|---|---|---|---|---|---|
| Nano 30B | 30B | 3B | $0.05 | $0.20 | 262K |
| Super 120B | 120B | 12B | $0.09 | $0.45 | 1M |
| Ultra 550B | 550B | 55B | $0.50 (or free) | $2.50 (or free) | 1M |
The Super model is the sweet spot for most production agent workloads - 1M context at roughly 1/5 the cost of Ultra. But the free Ultra endpoint changes the calculus entirely for prototyping.
The 1M Context Window: Why It Matters
A 1M-token context window means you can feed the model roughly 750,000 words in a single prompt. That’s the entirety of War and Peace plus The Great Gatsby with room to spare.
Most practical uses are less literary:
- Entire codebases - drop in 500K+ lines of code and ask targeted questions
- Multi-hour agent sessions - keep conversation history, tool outputs, and planning state all in context
- Full document corpora - analyze 10-K filings, legal contracts, or research papers without chunking
- Aggregated RAG retrieval - stuff dozens of retrieved passages into a single reasoning pass
The key benchmark here is RULER 1M, where Nemotron 3 Ultra scores 94.7 (NVFP4) to 94.7 (BF16). That means it’s retrieving and using information accurately even at the full million-token mark - not just technically supporting long context but actually performing well with it.
The 1M context is enabled by the hybrid Mamba-Transformer architecture. Mamba layers track long-range dependencies with minimal memory overhead, while the attention layers handle precision reasoning where needed. The MoE routing keeps per-token compute manageable even at extreme lengths.
Key Features of Nemotron 3 Ultra
Configurable Reasoning Mode
You get three reasoning levels:
- High (
reasoning_effort: "high") - Full chain-of-thought reasoning trace before the final answer. Best for complex math, coding, planning. - Medium (
reasoning_effort: "medium") - Efficient reasoning with significantly fewer tokens. Good starting point before tuning explicit budgets. - Off (
reasoning_effort: "none") - No reasoning trace. Fast responses for simple queries.
You can also set a hard reasoning_budget in tokens. The model will attempt to close its reasoning trace before hitting that ceiling.
Tool Calling
Nemotron 3 Ultra supports native function calling with reasoning intertwined. When you enable both enable_thinking: true and force_nonempty_content: true, the model reasons about which tool to call, then outputs a properly formatted tool call. This is critical for agent workflows where the model needs to think before acting.
Streaming with Reasoning Visibility
The streaming API exposes reasoning tokens separately from content tokens via reasoning_content in the delta. This means you can show users a “thinking” indicator while the model works through a problem, then display the final answer when it’s ready. It’s a much better UX than staring at a blank screen for 20 seconds.
Multi-Token Prediction
MTP predicts multiple future tokens per forward pass, reducing latency for long generations. Combined with the MoE architecture, this means Ultra can sustain high throughput despite its 550B parameter scale. The MTP implementation uses shared-weight prediction heads, which improves training signal quality and supports native speculative decoding at inference time.
Open Weights and Open Data
This is NVIDIA’s big differentiator. The model weights are freely downloadable from Hugging Face under the OpenMDW-1.1 license. The pre-training data (nearly 10 trillion tokens) and post-training datasets are also openly available. You can inspect, customize, fine-tune, or deploy however you want.
Best Use Cases for Nemotron 3 Ultra
1. Coding Agents
This is Ultra’s strongest suit. On SWE-Bench Verified, it hits 71.9% (BF16) - meaning it can autonomously resolve real GitHub issues nearly 72% of the time. On SWE-Bench Multilingual, it scores 67.7%.
The 1M context window means you can feed an entire repository into context and let the model reason across files. With tool calling and the reasoning toggle, it can plan multi-file edits, write the code, and verify correctness in one agentic loop. NVIDIA even ships an OpenCode configuration specifically for Nemotron 3 Ultra, so you can wire it up as a terminal-based coding agent with zero friction.
Where it beats alternatives: Closed-source coding models like GPT-OSS-120B (also available on NVIDIA’s API) don’t give you the 1M context. Ultra lets you reason across entire monorepos without chunking.
2. Multi-Agent Orchestration
Ultra was explicitly designed as an orchestrator for multi-agent systems. On agentic benchmarks, it scores:
| Benchmark | BF16 Score |
|---|---|
| Terminal Bench 2.1 | 56.4 |
| TauBench V3 (Average) | 70.9 |
| PinchBench | 90.0 |
| ProfBench (Search) | 56.0 |
| BrowseComp | 44.4 |
These evaluate multi-step planning, tool use, verification, and recovery - exactly the skills needed for an orchestrator agent that delegates to sub-agents.
The configurable reasoning budget is particularly important here. For routine delegation decisions, you can use medium or no reasoning. For complex planning requiring synthesis across multiple agent outputs, you crank it up to high.
3. Deep Research and Document Analysis
With 1M-token context and strong long-context benchmarks, Ultra excels at research tasks. The AA-LCR (long context reasoning) score of 65.4 and the OmniScience Non-Hallucination rate of 78.7 indicate it stays grounded in its sources rather than confabulating.
Practical applications:
- Load a 200-page PDF and ask cross-referenced questions
- Analyze entire legal contracts with precedent comparison
- Summarize multi-document research collections
- Compare financial filings across multiple quarters
The NVFP4 quantized version performs nearly identically to the BF16 version on long-context tasks (RULER 1M: 94.0 vs 94.7), so you’re not sacrificing quality by running the more efficient checkpoint.
4. RAG and Enterprise Knowledge Systems
NVIDIA designed the entire Nemotron ecosystem around RAG. Ultra pairs with NVIDIA’s Nemotron Retriever models (embed, rerank, parse) to form a complete retrieval pipeline.
A typical setup:
- Nemotron Parse extracts clean text from PDFs, preserving tables and reading order
- Nemotron Retriever embeds documents and retrieves relevant passages
- Nemotron 3 Ultra reasons over retrieved context and generates the final answer
The 1M context means you can retrieve 50+ passages and let Ultra synthesize them without losing track. Compare that to a 128K model where you’re cramming things in and hoping the attention mechanism doesn’t lose the thread.
5. High-Stakes Enterprise Workflows
Ultra targets enterprise use cases where accuracy matters more than cost:
- Customer service automation - with safety guardrails via Nemotron Safety models
- Supply chain management - multi-step planning with tool integration
- IT security analysis - reasoning over logs, alerts, and playbooks
- Financial analysis - cross-document reasoning over filings, earnings calls, and market data
The GPQA score of 87.0 (no tools) demonstrates graduate-level reasoning capability. Combined with the 78.7 non-hallucination rate on OmniScience, it’s a solid choice when wrong answers cost real money.
How to Access NVIDIA Nemotron 3 Ultra Free
There are three main access paths:
1. NVIDIA Free API Endpoint (Easiest)
from openai import OpenAI
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="YOUR_NVIDIA_API_KEY" # Free from build.nvidia.com
)
response = client.chat.completions.create(
model="nvidia/nemotron-3-ultra-550b-a55b",
messages=[{"role": "user", "content": "Explain quantum entanglement to a 12-year-old."}],
temperature=1.0,
top_p=0.95,
max_tokens=16384,
extra_body={
"chat_template_kwargs": {"enable_thinking": True},
"reasoning_budget": 16384
},
stream=True
)
Generate your free API key at build.nvidia.com, swap it in, and you’re running. The endpoint is OpenAI-compatible, so any existing OpenAI SDK code works with a base URL change.
2. OpenRouter (Paid, Production-Ready)
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="YOUR_OPENROUTER_KEY",
headers={"HTTP-Referer": "https://your-site.com"}
)
response = client.chat.completions.create(
model="nvidia/nemotron-3-ultra-550b-a55b",
messages=[{"role": "user", "content": "Your prompt"}]
)
OpenRouter charges $0.50/M input tokens and $2.50/M output tokens with automatic provider failover.
3. Self-Hosted (Maximum Control)
Pull the NIM container and run on your own hardware:
docker login nvcr.io
# Username: $oauthtoken, Password: <NVIDIA_API_KEY>
docker run -it --rm --gpus all --shm-size=16GB \
-e NGC_API_KEY -p 8000:8000 \
nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:latest
Minimum requirements: 4x B200 GPUs or 8x H100 GPUs. Use vLLM, SGLang, or TensorRT-LLM as your serving backend for production deployments.
Additional Access Points
- Hugging Face: Download weights directly and run with Transformers, vLLM, or SGLang
- LM Studio: Desktop app with built-in model browser
- Ollama: CLI-based local inference
- Partner Endpoints: Available through providers like Together AI, DeepInfra, Fireworks AI, and 20+ others
How Ultra Compares to Alternatives
Vs. NVIDIA Nemotron 3 Super
Super (120B total, 12B active) is the more practical choice for most production workloads. It also has a 1M context window and costs 80-90% less. Use Super when you need efficient multi-agent coordination at scale. Use Ultra when you need frontier-level reasoning accuracy for the hardest agent workflow calls.
Vs. Closed-Source Frontier Models
Ultra’s open-weight nature is its main competitive advantage against closed models. You can:
- Inspect the training data
- Fine-tune on proprietary data
- Deploy on-premises with full data sovereignty
- Audit for compliance and safety
Closed models can’t match that transparency. And with the free API endpoint, Ultra has a zero-cost entry point that proprietary alternatives can’t touch.
Vs. Other Open Models
The hybrid Mamba-Transformer architecture gives Ultra a meaningful efficiency advantage for long-context workloads. Pure Transformer models struggle at 1M tokens with quadratic attention costs. Mamba layers skip that problem entirely. Combined with MoE routing (only 55B of 550B active), Ultra delivers frontier performance at lower effective compute cost than dense models of comparable quality.
Limitations to Know About
It’s text-only. No vision, no audio, no multimodal. For those, you’d want Nemotron 3 Nano Omni (30B, multimodal).
Self-hosting is expensive. 4x B200 GPUs is a serious hardware commitment. Most developers will use the free endpoint or OpenRouter.
Free tier is for prototyping. NVIDIA’s API Trial Terms govern the free endpoint. It’s not designed for production throughput. If you’re building a customer-facing app, budget for OpenRouter or self-hosting.
Reasoning tokens count against output budget. When enable_thinking is on, the reasoning trace eats into your max_tokens limit. Set reasoning_budget explicitly to control this.
Banking domain weakness. TauBench V3 Banking scores only 19.2-22.6, suggesting domain-specific fine-tuning is needed for financial services deployment.
The Bottom Line
NVIDIA Nemotron 3 Ultra is the most capable open-weight reasoning model available as of June 2026. The free API endpoint removes the cost barrier for prototyping. The 1M context window and configurable reasoning make it uniquely suited for coding agents, deep research, and multi-agent orchestration.
If you’re building agentic systems in 2026, start with the free endpoint. Graduate to Super when you need production throughput at lower cost. Reserve Ultra for the hardest tasks where accuracy trumps everything else.
Sources
-
NVIDIA Docs - Nemotron 3 Ultra API Reference - Official API documentation with model specifications, quick start guide, and benchmark tables.
-
Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate - NVIDIA Technical Blog (Dec 15, 2025) covering hybrid architecture, multi-environment RL, and 1M context.
-
Nemotron 3 Ultra on build.nvidia.com - Free API endpoint with code examples and model card.
-
Nemotron 3 Ultra on OpenRouter - Commercial pricing, provider status, benchmarks, and weekly token availability.
-
Nemotron 3 Super on OpenRouter - Pricing comparison ($0.09/$0.45 per 1M tokens) and specifications.
-
NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 on Hugging Face - Full model card with all benchmark results (BF16 and NVFP4 variants).
-
NVIDIA Nemotron Developer Page - Family overview, model comparisons, deployment options, and ecosystem tools.
-
NVIDIA API Documentation - Chat Completions - Full API specification with reasoning_effort, reasoning_budget, and other parameters.
-
Nemotron 3 Model Collection on Hugging Face - All Nemotron 3 model variants with weights, datasets, and deployment guides.
-
NVIDIA Nemotron Retriever Models - Embedding, reranking, and parsing models for RAG pipelines.