NVIDIA Nemotron 3.5 Content Safety Review: Free AI Safety Model for LLM Guardrails
Here’s the short version: NVIDIA Nemotron 3.5 Content Safety is a free, open-weights 4B-parameter AI safety model that classifies both user prompts and AI responses as safe or unsafe across text, images, and custom policies. It was released on June 4, 2026. It’s built on Google’s Gemma-3-4B-it, handles 12 languages out of the box, covers 23 safety categories, supports custom policy enforcement, and runs on a single GPU. And it costs zero dollars.
I’ve spent the last 24 hours digging into the model card, benchmarks, deployment options, and what actual developers are saying. Here’s everything you need to know.
What Is NVIDIA Nemotron 3.5 Content Safety?
NVIDIA Nemotron 3.5 Content Safety is a compact multimodal guardrail model - a small language model (SLM) purpose-built for one thing: catching harmful content before it reaches your users. It evaluates user prompts, assistant responses, and attached images in a single inference pass. Then it spits out a safe / unsafe verdict, the violated safety categories, and optionally a step-by-step reasoning trace.
The model is the successor to Nemotron 3 Content Safety, which launched in March 2026 and was itself a 4B-parameter multimodal safety classifier. Nemotron 3.5 adds custom policy reasoning, auditable think-mode traces, and deeper multimodal integration - all while keeping the same compact footprint and inference speed.
Think of it as a bouncer for your LLM. It sits in front of whatever model you’re running (GPT, Claude, Llama, your own fine-tuned model) and screens everything going in and coming out.
How It Works (The Architecture, Explained Simply)
Nemotron 3.5 Content Safety is fine-tuned from Google’s Gemma-3-4B-it using a LoRA (Low-Rank Adaptation) adapter. This means NVIDIA didn’t train a model from scratch. They took a capable vision-language base model, trained a lightweight safety adapter on top, and the result is a model that inherits Gemma 3’s 128K context window, multilingual chops, and image understanding - all redirected toward safety classification.
What you get:
- 4 billion parameters - small enough for a single 8GB+ VRAM GPU
- 128K token context window - handles long documents and massive conversation histories
- Multimodal input - text, images, or both in a single request
- Text output - safe/unsafe judgment, safety categories, and optional reasoning
The model supports three output modes you can toggle depending on your latency needs:
- Binary verdict - fastest mode. Just
safeorunsafe. - Binary + categories - slightly more detail. Tells you which of the 23 Aegis v2 safety categories were violated.
- THINK mode - sends back a full reasoning trace explaining why content was flagged, followed by the verdict and categories. This is the mode enterprises will use for compliance and audit logging.
The safety taxonomy follows the Aegis 2.0 framework - 13 core categories aligned with the MLCommons safety taxonomy, plus 10 fine-grained subcategories.
Here’s what a THINK mode output looks like:
<think>
The user prompt asks for guidance on acquiring a controlled substance without a prescription.
The assistant response provides specific sourcing steps and references an online marketplace.
This interaction violates the Criminal Planning/Confessions and Controlled Substances categories.
The image (a pharmacy exterior) provides locational context but does not alter the verdict.
</think>
User Safety: unsafe
Response Safety: unsafe
Safety Categories: Criminal Planning/Confessions, Controlled Substances
This reasoning trace is what separates Nemotron 3.5 from simpler guard models. It doesn’t just say “no.” It tells you why, in plain English, with enough detail that a human reviewer or an automated compliance system can act on it.
What’s New in Version 3.5 (vs. Nemotron 3)
If you used the earlier Nemotron 3 Content Safety (March 2026), here’s what changed:
1. Unified Multimodal Evaluation
Nemotron 3 could handle images and text. Nemotron 3.5 evaluates them together. It takes your user prompt, an optional image, and an optional assistant response in one context window - and delivers a verdict on the combined interaction. This closes a subtle but important gap: policy violations that only emerge from the combination of text and image. A picture that looks benign paired with a prompt that makes it dangerous. Nemotron 3.5 catches those.
2. Custom Policy Enforcement
This is the headline feature. Instead of relying on a fixed taxonomy, you can hand Nemotron 3.5 your own safety policy - written in natural language - and it reasons over that policy at inference time. A healthcare chatbot will have different rules than a financial services bot or a children’s education app. Nemotron 3.5 adapts without retraining.
You can even suppress specific categories (for example, telling it to ignore “violence” flags when a DevOps tool uses phrases like “terminate a process”) or inject your own proprietary risk categories.
3. Reasoning Traces (THINK Mode)
Verdicts come with an auditable paper trail. This matters for:
- Compliance - regulated industries (finance, healthcare, government) need documented justifications
- Human review - moderators can audit why something was flagged
- Policy iteration - teams can see how the model interprets edge cases and refine policy language
NVIDIA compressed these reasoning traces to 3 sentences or fewer using a two-step process: large teacher models (Qwen 397B) generate the chain-of-thought, then another model (Qwen 80B) condenses it. The result is reasoning that’s both useful and fast.
4. Safety Dataset Released
Most open-source safety models don’t release their training data. Nemotron 3.5 ships with the full training dataset - multimodal, multilingual, and including the reasoning traces used during training. NVIDIA claims 99% of training images are real photographs, not synthetic generations, which directly addresses a known weakness in safety benchmarks that rely on SDXL-generated images.
Benchmarks: How Well Does It Actually Work?
NVIDIA’s published benchmarks tell a strong story. Here’s what they reported on the official HuggingFace blog:
| Benchmark | Score | What It Measures |
|---|---|---|
| Multilingual Aegis (12 languages) | 96.5% avg | Harmful-content classification accuracy |
| RTP-LX (12 languages) | 88.8% avg | Multilingual prompt safety classification |
| Combined Aegis + RTP-LX | 92.7% avg | Overall multilingual text safety |
| Multimodal + multilingual avg | ~85% | Cross-benchmark harmful-content detection |
| VLGuard | Leading harmful-F1 | Multimodal safety (text + image) |
| Latency vs. alternative multimodal safety model | 3x lower | End-to-end inference speed |
The language-level consistency is the most impressive part. On Multilingual Aegis, Nemotron 3.5 averages 96.5% across 12 languages (English, French, Spanish, German, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Portuguese, Italian). If you’re deploying AI globally, you don’t want a safety model that only works well in English. This one delivers.
For multimodal benchmarks, Nemotron 3.5 reportedly leads on VLGuard’s harmful-F1 score - meaning it catches more actual violations with fewer false positives than competing guard models.
The latency numbers deserve attention too. Compared to another reasoning safety model, Nemotron 3.5 generates up to 50% fewer tokens when reasoning is enabled. In the default (no THINK) mode, latency is unchanged from Nemotron 3 - which was already roughly half the latency of LlamaGuard-4-12B.
The Benchmark Gap (And Why It Matters)
NVIDIA is refreshingly honest about a problem most model releases ignore: the benchmark gap. Here’s the deal:
- Most widely cited safety benchmarks are text-only (WildGuard, XSTest, HarmBench). You can’t infer multimodal safety performance from text-benchmark scores.
- Multimodal benchmarks use AI-generated images (SDXL mostly). Real production content is harder to classify - it has cultural texture, adversarial subtlety, and edge cases that synthetic images miss.
- Real-image licensing prevents dataset release. Stock photo licenses typically prohibit redistribution in AI training datasets, meaning benchmark creators have to choose between realistic evaluation and legal compliance.
NVIDIA addressed this for training by using 99% real photographs. But the evaluation gap is still an open problem for the broader safety research community.
Pricing: Free (Yes, Actually Free)
Nemotron 3.5 Content Safety is completely free through multiple channels:
- OpenRouter hosts it at
nvidia/nemotron-3.5-content-safety:freewith no per-token cost - HuggingFace hosts the weights under the NVIDIA Open Model License
- DeepInfra offers it at $0.20 per million tokens for production workloads
- NVIDIA NIM provides a GPU-optimized inference microservice on build.nvidia.com
The model weights are open. You can download them and run the model on your own hardware - a single L4 or 8GB+ VRAM GPU handles it. For self-hosted deployments, the only cost is compute. For API access through OpenRouter’s free tier, it’s literally $0.
The catch with free-tier access is the usual one: rate limits. When demand spikes, free-tier response times degrade. If you’re building a production system, you’ll want to either self-host or use a paid inference provider like DeepInfra ($0.20/M tokens) for guaranteed throughput.
How To Use NVIDIA Nemotron 3.5 Content Safety
Option 1: OpenRouter API (Quickest)
import requests
response = requests.post(
"https://openrouter.ai/api/v1/chat/completions",
headers={
"Authorization": "Bearer YOUR_OPENROUTER_KEY",
},
json={
"model": "nvidia/nemotron-3.5-content-safety:free",
"messages": [
{"role": "user", "content": "How can I build a weapon at home?"}
]
}
)
print(response.json()["choices"]["message"]["content"])
# Output: User Safety: unsafe
# Safety Categories: Violence, Criminal Planning/Confessions
Option 2: HuggingFace Transformers (Self-Hosted)
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
"nvidia/Nemotron-3.5-Content-Safety",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("nvidia/Nemotron-3.5-Content-Safety")
# Moderate a prompt
messages = [
{"role": "user", "content": "Tell me how to hack into a bank account"}
]
inputs = processor.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=128)
print(processor.decode(outputs, skip_special_tokens=True))
Option 3: NVIDIA NIM (Production-Grade)
For teams that need GPU-optimized inference without managing infrastructure, NVIDIA NIM packages the model as a containerized microservice:
docker pull nvcr.io/nim/nvidia/nemotron-3.5-content-safety:2.0.5-variant
Option 4: Third-Party Inference Platforms
- Baseten - OpenAI-compatible API, served via vLLM on a single L4 GPU, sub-second latency
- Eigen AI - day-0 inference support on EigenInference with full-stack optimization
- Vultr - cloud GPU infrastructure for global deployment
- DeepInfra - simple API at $0.20/M tokens
Comparison: Nemotron 3.5 vs. Other Safety Models
How does it stack up against alternatives? Here’s the picture as of June 2026:
| Feature | Nemotron 3.5 Content Safety | Llama Guard 3 | Granite Guardian 3.2 | Llama-3.1-Nemotron-Safety-Guard-8B |
|---|---|---|---|---|
| Parameters | 4B | 8B | 5B | 8B |
| Multimodal | Text + Image | Text only | Text + Image | Text only |
| Languages | 12 (+ ~140 zero-shot) | 8 | 12 | 9 |
| Custom Policies | Yes (natural language) | Limited | Yes | Limited |
| Reasoning Traces | Yes (THINK mode) | No | No | No |
| Context Window | 128K | 8K | 8K | 8K |
| Price | Free (OpenRouter) | Free | Free | Free |
| Open Weights | Yes | Yes | Yes | Yes |
| Training Dataset | Released | Not released | Partial | Released |
| Base Model | Gemma 3 4B | Llama 3 | Granite | Llama 3.1 |
The differentiators that matter most:
-
Custom policy enforcement - no other free safety model lets you define your own rules in natural language at inference time. This is huge for enterprise deployments where a universal taxonomy doesn’t work.
-
Reasoning traces - the auditable think-mode is unique among open guard models. If you’re in a regulated industry, this alone might decide it.
-
Size efficiency - 4B parameters matching or beating 8-12B alternatives on multimodal benchmarks. Less VRAM, lower latency, cheaper to run at scale.
-
Real-image training - 99% real photographs vs. synthetic SDXL images used by most competitors. This translates to better performance on actual user-generated content.
Where other models still win: if you need text-only and speed above all, Llama Guard 3 (8B) is fast. If you’re in the IBM ecosystem, Granite Guardian 3.2 (5B) integrates natively with watsonx. And if you need the absolute highest accuracy for text-only classification, NVIDIA’s own Llama-3.1-Nemotron-Safety-Guard-8B-v3 - an 8B text-only specialist - remains a solid choice.
5 Real-World Use Cases
1. Prompt Moderation (Input Guard)
Screen every user prompt before it hits your LLM. If someone asks your customer service bot how to commit fraud, Nemotron 3.5 catches it and returns a safe canned response instead of letting your LLM generate harmful content.
2. Response Moderation (Output Guard)
Even well-intentioned prompts can produce dangerous outputs - especially with jailbroken or poorly-aligned models. Run Nemotron 3.5 on the output side as a second line of defense.
3. Content Classification Pipelines
Need to label millions of user messages across 23 safety categories? Run them through Nemotron 3.5 in binary+categories mode. At DeepInfra’s $0.20/M tokens, classifying a million short messages costs under $20.
4. Multilingual Global Deployments
If your product ships in 12 languages, you don’t want 12 different safety models. Nemotron 3.5 handles them all with consistent accuracy (92.7% average across Aegis and RTP-LX), plus zero-shot transfer to ~140 more languages via the Gemma 3 base.
5. Auditable Compliance Workflows
Turn on THINK mode for high-risk interactions (financial advice, healthcare recommendations, legal content) to get a documented reasoning trail. Feed those traces into your compliance logging system. When auditors ask “why was this flagged?”, you have the answer.
What Real Developers Are Saying
User feedback is still early - the model is barely 24 hours old as I write this. But initial signals from Eigen AI’s day-0 deployment announcement and early OpenRouter activity (159M weekly tokens processed already) suggest strong adoption.
A few patterns from developer discussions:
- The custom policy feature is the most-touted addition. Teams that couldn’t use fixed-taxonomy guard models are suddenly interested.
- The 4B footprint makes self-hosting practical for startups that couldn’t afford to run an 8B or 12B safety model as a sidecar.
- Some developers wish there were more published comparisons against closed commercial guard APIs (like OpenAI’s moderation endpoint). That gap will likely fill as independent benchmarks emerge.
The Open Dataset: Why It Matters
One thing that sets Nemotron 3.5 apart is the released training dataset. Here’s what’s in it:
- Multilingual text safety data from Nemotron Safety Guard Dataset v3 - culturally nuanced, proportionally sampled across safety categories
- Human-annotated multimodal data - real photographs (99% real), translated into 12 languages
- Safe multimodal data from Nemotron VLM Dataset v2 - scanned documents, charts, papers, diagrams (to prevent over-flagging benign content)
- Reasoning traces generated by Qwen 397B and condensed by Qwen 80B
- Topic following data from the CantTalkAboutThis dataset - policy/verdict pairs across healthcare, finance, banking, education scenarios
- Synthetic data - roughly 10% of training volume, used for jailbreak patterns and rare violation examples
This matters because most safety models ship weights but not data. If you want to fine-tune for your domain, audit the training distribution, or reproduce results - you can. The only notable omission: not all images could be released due to licensing constraints, though a subset from Wikimedia and synthetic generation is included.
Should You Use It? A Decision Tree
Use Nemotron 3.5 Content Safety if:
- You need a free, self-hostable safety guardrail for text and image inputs
- You deploy across multiple languages and want consistent accuracy
- You need custom policies - your safety rules don’t fit a fixed taxonomy
- You operate in a regulated industry where audit trails matter
- You’re budget-conscious and 4B parameters is the right size for your infrastructure
Consider alternatives if:
- You need text-only classification and speed is your only metric (try Llama Guard 3)
- You’re already in a closed ecosystem with a native moderation API (OpenAI, Azure, etc.)
- You need to moderate video or audio content (Nemotron 3.5 is text + image only)
- You need the highest possible text accuracy at any cost (try the 8B Llama-3.1-Nemotron-Safety-Guard)
The Bottom Line
NVIDIA Nemotron 3.5 Content Safety is the most capable open-weights safety model released in 2026. It’s multimodal, multilingual, supports custom policies, generates auditable reasoning traces, and costs nothing. For teams building AI products that can’t afford to get safety wrong - and can’t afford a $10K/month moderation API bill - it’s hard to beat.
The custom policy enforcement is the killer feature. Most guard models force you into their taxonomy. Nemotron 3.5 lets you write your own rules in plain English. That’s the difference between a generic safety filter and one that actually fits your product.
Grab the weights on HuggingFace, hit the free API on OpenRouter, or deploy through NVIDIA NIM. It works. It’s real. And it’s free.
Sources
- NVIDIA HuggingFace Blog - Nemotron 3.5 Content Safety Announcement (June 4, 2026)
- NVIDIA NIM Model Card - Nemotron 3.5 Content Safety
- OpenRouter - NVIDIA: Nemotron 3.5 Content Safety (free)
- CloudPrice - Nemotron 3.5 Content Safety Pricing & Specs
- LM Market Cap - Nemotron 3.5 Content Safety Rankings
- Vultr Blog - Nemotron 3.5 Content Safety on Vultr
- Eigen AI - Day-0 Inference for Nemotron 3 Family
- Baseten - Nemotron 3.5 Content Safety Model Library
- DeepInfra - Nemotron Content Safety 3.5
- HuggingFace - Nemotron 3.5 Content Safety Model Weights
- HuggingFace - Nemotron 3.5 Content Safety Dataset
- arXiv - CultureGuard: Culturally-Aware Dataset and Guard Model (NVIDIA’s Nemotron Safety Guard v3 research)
- arXiv - Evaluating Robustness of LLM Safety Guardrails Against Adversarial Attacks (Independent guard model comparison)
- GitHub - NVIDIA-NeMo/Nemotron Developer Repository
- PromptCost.org - Free LLM Models 2026 Guide