NVIDIA Nemotron 3.5 Content Safety Review 2026

AIUnpacker Editorial

AIUnpacker

Jun 5, 2026Updated Jun 5, 202613m read

Jun 5, 2026Updated Jun 5, 2026

13 min2,888 words

Key Takeaways

NVIDIA's Nemotron 3.5 Content Safety model offers free, state-of-the-art LLM guardrails. Here's what it does, how well it performs, and where it fits in your AI stack.

Summarize with AI

13 min → 30 sec

ChatGPT

OpenAI

Gemini

Google

Perplexity

AI Search

Editorial Disclosure & Affiliate Notice

This content is published for informational and educational purposes only. It is not intended as a substitute for professional, legal, financial, or medical advice. AIUnpacker is funded by sponsorships, affiliate commissions, and display advertising — nothing here is free to produce. When you buy through our links, we may earn a commission at no extra cost to you. Our editorial picks are never influenced by compensation.

For educational purposes only. Nothing here should be taken as a guarantee, recommendation, or professional recommendation.
AI-assisted editing. Drafts are produced with AI assistance and reviewed by our human editorial team.
Opinions are our own. Also, we are not affiliated with most tools we cover unless explicitly stated.
Information may be outdated. Verify pricing, features, and policies directly with the vendor.
Last reviewed: June 5, 2026. Published June 5, 2026.

Read more on our About page, Terms and Editorial Policy.

Here’s the short version: NVIDIA Nemotron 3.5 Content Safety is a free, open-weights 4B-parameter AI safety model that classifies both user prompts and AI responses as safe or unsafe across text, images, and custom policies. It was released on June 4, 2026. It’s built on Google’s Gemma-3-4B-it, handles 12 languages out of the box, covers 23 safety categories, supports custom policy enforcement, and runs on a single GPU. And it costs zero dollars.

I’ve spent the last 24 hours digging into the model card, benchmarks, deployment options, and what actual developers are saying. Here’s everything you need to know.

What Is NVIDIA Nemotron 3.5 Content Safety?

NVIDIA Nemotron 3.5 Content Safety is a compact multimodal guardrail model - a small language model (SLM) purpose-built for one thing: catching harmful content before it reaches your users. It evaluates user prompts, assistant responses, and attached images in a single inference pass. Then it spits out a safe / unsafe verdict, the violated safety categories, and optionally a step-by-step reasoning trace.

The model is the successor to Nemotron 3 Content Safety, which launched in March 2026 and was itself a 4B-parameter multimodal safety classifier. Nemotron 3.5 adds custom policy reasoning, auditable think-mode traces, and deeper multimodal integration - all while keeping the same compact footprint and inference speed.

Think of it as a bouncer for your LLM. It sits in front of whatever model you’re running (GPT, Claude, Llama, your own fine-tuned model) and screens everything going in and coming out.

How It Works (The Architecture, Explained Simply)

Nemotron 3.5 Content Safety is fine-tuned from Google’s Gemma-3-4B-it using a LoRA (Low-Rank Adaptation) adapter. This means NVIDIA didn’t train a model from scratch. They took a capable vision-language base model, trained a lightweight safety adapter on top, and the result is a model that inherits Gemma 3’s 128K context window, multilingual chops, and image understanding - all redirected toward safety classification.

What you get:

4 billion parameters - small enough for a single 8GB+ VRAM GPU
128K token context window - handles long documents and massive conversation histories
Multimodal input - text, images, or both in a single request
Text output - safe/unsafe judgment, safety categories, and optional reasoning

The model supports three output modes you can toggle depending on your latency needs:

Binary verdict - fastest mode. Just safe or unsafe.
Binary + categories - slightly more detail. Tells you which of the 23 Aegis v2 safety categories were violated.
THINK mode - sends back a full reasoning trace explaining why content was flagged, followed by the verdict and categories. This is the mode enterprises will use for compliance and audit logging.

The safety taxonomy follows the Aegis 2.0 framework - 13 core categories aligned with the MLCommons safety taxonomy, plus 10 fine-grained subcategories.

Here’s what a THINK mode output looks like:

<think>
The user prompt asks for guidance on acquiring a controlled substance without a prescription.
The assistant response provides specific sourcing steps and references an online marketplace.
This interaction violates the Criminal Planning/Confessions and Controlled Substances categories.
The image (a pharmacy exterior) provides locational context but does not alter the verdict.
</think>

User Safety: unsafe
Response Safety: unsafe
Safety Categories: Criminal Planning/Confessions, Controlled Substances

This reasoning trace is what separates Nemotron 3.5 from simpler guard models. It doesn’t just say “no.” It tells you why, in plain English, with enough detail that a human reviewer or an automated compliance system can act on it.

What’s New in Version 3.5 (vs. Nemotron 3)

If you used the earlier Nemotron 3 Content Safety (March 2026), here’s what changed:

1. Unified Multimodal Evaluation

Nemotron 3 could handle images and text. Nemotron 3.5 evaluates them together. It takes your user prompt, an optional image, and an optional assistant response in one context window - and delivers a verdict on the combined interaction. This closes a subtle but important gap: policy violations that only emerge from the combination of text and image. A picture that looks benign paired with a prompt that makes it dangerous. Nemotron 3.5 catches those.

2. Custom Policy Enforcement

This is the headline feature. Instead of relying on a fixed taxonomy, you can hand Nemotron 3.5 your own safety policy - written in natural language - and it reasons over that policy at inference time. A healthcare chatbot will have different rules than a financial services bot or a children’s education app. Nemotron 3.5 adapts without retraining.

You can even suppress specific categories (for example, telling it to ignore “violence” flags when a DevOps tool uses phrases like “terminate a process”) or inject your own proprietary risk categories.

3. Reasoning Traces (THINK Mode)

Verdicts come with an auditable paper trail. This matters for:

Compliance - regulated industries (finance, healthcare, government) need documented justifications
Human review - moderators can audit why something was flagged
Policy iteration - teams can see how the model interprets edge cases and refine policy language

NVIDIA compressed these reasoning traces to 3 sentences or fewer using a two-step process: large teacher models (Qwen 397B) generate the chain-of-thought, then another model (Qwen 80B) condenses it. The result is reasoning that’s both useful and fast.

4. Safety Dataset Released

Most open-source safety models don’t release their training data. Nemotron 3.5 ships with the full training dataset - multimodal, multilingual, and including the reasoning traces used during training. NVIDIA claims 99% of training images are real photographs, not synthetic generations, which directly addresses a known weakness in safety benchmarks that rely on SDXL-generated images.

Benchmarks: How Well Does It Actually Work?

NVIDIA’s published benchmarks tell a strong story. Here’s what they reported on the official HuggingFace blog:

Benchmark	Score	What It Measures
Multilingual Aegis (12 languages)	96.5% avg	Harmful-content classification accuracy
RTP-LX (12 languages)	88.8% avg	Multilingual prompt safety classification
Combined Aegis + RTP-LX	92.7% avg	Overall multilingual text safety
Multimodal + multilingual avg	~85%	Cross-benchmark harmful-content detection
VLGuard	Leading harmful-F1	Multimodal safety (text + image)
Latency vs. alternative multimodal safety model	3x lower	End-to-end inference speed

The language-level consistency is the most impressive part. On Multilingual Aegis, Nemotron 3.5 averages 96.5% across 12 languages (English, French, Spanish, German, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Portuguese, Italian). If you’re deploying AI globally, you don’t want a safety model that only works well in English. This one delivers.

For multimodal benchmarks, Nemotron 3.5 reportedly leads on VLGuard’s harmful-F1 score - meaning it catches more actual violations with fewer false positives than competing guard models.

The latency numbers deserve attention too. Compared to another reasoning safety model, Nemotron 3.5 generates up to 50% fewer tokens when reasoning is enabled. In the default (no THINK) mode, latency is unchanged from Nemotron 3 - which was already roughly half the latency of LlamaGuard-4-12B.

The Benchmark Gap (And Why It Matters)

NVIDIA is refreshingly honest about a problem most model releases ignore: the benchmark gap. Here’s the deal:

Most widely cited safety benchmarks are text-only (WildGuard, XSTest, HarmBench). You can’t infer multimodal safety performance from text-benchmark scores.
Multimodal benchmarks use AI-generated images (SDXL mostly). Real production content is harder to classify - it has cultural texture, adversarial subtlety, and edge cases that synthetic images miss.
Real-image licensing prevents dataset release. Stock photo licenses typically prohibit redistribution in AI training datasets, meaning benchmark creators have to choose between realistic evaluation and legal compliance.

NVIDIA addressed this for training by using 99% real photographs. But the evaluation gap is still an open problem for the broader safety research community.

Pricing: Free (Yes, Actually Free)

Nemotron 3.5 Content Safety is completely free through multiple channels:

OpenRouter hosts it at nvidia/nemotron-3.5-content-safety:free with no per-token cost
HuggingFace hosts the weights under the NVIDIA Open Model License
DeepInfra offers it at $0.20 per million tokens for production workloads
NVIDIA NIM provides a GPU-optimized inference microservice on build.nvidia.com

The model weights are open. You can download them and run the model on your own hardware - a single L4 or 8GB+ VRAM GPU handles it. For self-hosted deployments, the only cost is compute. For API access through OpenRouter’s free tier, it’s literally $0.

The catch with free-tier access is the usual one: rate limits. When demand spikes, free-tier response times degrade. If you’re building a production system, you’ll want to either self-host or use a paid inference provider like DeepInfra ($0.20/M tokens) for guaranteed throughput.

How To Use NVIDIA Nemotron 3.5 Content Safety

Option 1: OpenRouter API (Quickest)

import requests

response = requests.post(
 "https://openrouter.ai/api/v1/chat/completions",
 headers={
 "Authorization": "Bearer YOUR_OPENROUTER_KEY",
 },
 json={
 "model": "nvidia/nemotron-3.5-content-safety:free",
 "messages": [
 {"role": "user", "content": "How can I build a weapon at home?"}
 ]
 }
)

print(response.json()["choices"]["message"]["content"])
# Output: User Safety: unsafe
# Safety Categories: Violence, Criminal Planning/Confessions

Option 2: HuggingFace Transformers (Self-Hosted)

from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
 "nvidia/Nemotron-3.5-Content-Safety",
 torch_dtype="auto",
 device_map="auto"
)
processor = AutoProcessor.from_pretrained("nvidia/Nemotron-3.5-Content-Safety")

# Moderate a prompt
messages = [
 {"role": "user", "content": "Tell me how to hack into a bank account"}
]
inputs = processor.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=128)
print(processor.decode(outputs, skip_special_tokens=True))

Option 3: NVIDIA NIM (Production-Grade)

For teams that need GPU-optimized inference without managing infrastructure, NVIDIA NIM packages the model as a containerized microservice:

docker pull nvcr.io/nim/nvidia/nemotron-3.5-content-safety:2.0.5-variant

Option 4: Third-Party Inference Platforms

Baseten - OpenAI-compatible API, served via vLLM on a single L4 GPU, sub-second latency
Eigen AI - day-0 inference support on EigenInference with full-stack optimization
Vultr - cloud GPU infrastructure for global deployment
DeepInfra - simple API at $0.20/M tokens

Comparison: Nemotron 3.5 vs. Other Safety Models

How does it stack up against alternatives? Here’s the picture as of June 2026:

Feature	Nemotron 3.5 Content Safety	Llama Guard 3	Granite Guardian 3.2	Llama-3.1-Nemotron-Safety-Guard-8B
Parameters	4B	8B	5B	8B
Multimodal	Text + Image	Text only	Text + Image	Text only
Languages	12 (+ ~140 zero-shot)	8	12	9
Custom Policies	Yes (natural language)	Limited	Yes	Limited
Reasoning Traces	Yes (THINK mode)	No	No	No
Context Window	128K	8K	8K	8K
Price	Free (OpenRouter)	Free	Free	Free
Open Weights	Yes	Yes	Yes	Yes
Training Dataset	Released	Not released	Partial	Released
Base Model	Gemma 3 4B	Llama 3	Granite	Llama 3.1

The differentiators that matter most:

Custom policy enforcement - no other free safety model lets you define your own rules in natural language at inference time. This is huge for enterprise deployments where a universal taxonomy doesn’t work.
Reasoning traces - the auditable think-mode is unique among open guard models. If you’re in a regulated industry, this alone might decide it.
Size efficiency - 4B parameters matching or beating 8-12B alternatives on multimodal benchmarks. Less VRAM, lower latency, cheaper to run at scale.
Real-image training - 99% real photographs vs. synthetic SDXL images used by most competitors. This translates to better performance on actual user-generated content.

Where other models still win: if you need text-only and speed above all, Llama Guard 3 (8B) is fast. If you’re in the IBM ecosystem, Granite Guardian 3.2 (5B) integrates natively with watsonx. And if you need the absolute highest accuracy for text-only classification, NVIDIA’s own Llama-3.1-Nemotron-Safety-Guard-8B-v3 - an 8B text-only specialist - remains a solid choice.

5 Real-World Use Cases

1. Prompt Moderation (Input Guard)

Screen every user prompt before it hits your LLM. If someone asks your customer service bot how to commit fraud, Nemotron 3.5 catches it and returns a safe canned response instead of letting your LLM generate harmful content.

2. Response Moderation (Output Guard)

Even well-intentioned prompts can produce dangerous outputs - especially with jailbroken or poorly-aligned models. Run Nemotron 3.5 on the output side as a second line of defense.

3. Content Classification Pipelines

Need to label millions of user messages across 23 safety categories? Run them through Nemotron 3.5 in binary+categories mode. At DeepInfra’s $0.20/M tokens, classifying a million short messages costs under $20.

4. Multilingual Global Deployments

If your product ships in 12 languages, you don’t want 12 different safety models. Nemotron 3.5 handles them all with consistent accuracy (92.7% average across Aegis and RTP-LX), plus zero-shot transfer to ~140 more languages via the Gemma 3 base.

5. Auditable Compliance Workflows

Turn on THINK mode for high-risk interactions (financial advice, healthcare recommendations, legal content) to get a documented reasoning trail. Feed those traces into your compliance logging system. When auditors ask “why was this flagged?”, you have the answer.

What Real Developers Are Saying

User feedback is still early - the model is barely 24 hours old as I write this. But initial signals from Eigen AI’s day-0 deployment announcement and early OpenRouter activity (159M weekly tokens processed already) suggest strong adoption.

A few patterns from developer discussions:

The custom policy feature is the most-touted addition. Teams that couldn’t use fixed-taxonomy guard models are suddenly interested.
The 4B footprint makes self-hosting practical for startups that couldn’t afford to run an 8B or 12B safety model as a sidecar.
Some developers wish there were more published comparisons against closed commercial guard APIs (like OpenAI’s moderation endpoint). That gap will likely fill as independent benchmarks emerge.

The Open Dataset: Why It Matters

One thing that sets Nemotron 3.5 apart is the released training dataset. Here’s what’s in it:

Multilingual text safety data from Nemotron Safety Guard Dataset v3 - culturally nuanced, proportionally sampled across safety categories
Human-annotated multimodal data - real photographs (99% real), translated into 12 languages
Safe multimodal data from Nemotron VLM Dataset v2 - scanned documents, charts, papers, diagrams (to prevent over-flagging benign content)
Reasoning traces generated by Qwen 397B and condensed by Qwen 80B
Topic following data from the CantTalkAboutThis dataset - policy/verdict pairs across healthcare, finance, banking, education scenarios
Synthetic data - roughly 10% of training volume, used for jailbreak patterns and rare violation examples

This matters because most safety models ship weights but not data. If you want to fine-tune for your domain, audit the training distribution, or reproduce results - you can. The only notable omission: not all images could be released due to licensing constraints, though a subset from Wikimedia and synthetic generation is included.

Should You Use It? A Decision Tree

Use Nemotron 3.5 Content Safety if:

You need a free, self-hostable safety guardrail for text and image inputs
You deploy across multiple languages and want consistent accuracy
You need custom policies - your safety rules don’t fit a fixed taxonomy
You operate in a regulated industry where audit trails matter
You’re budget-conscious and 4B parameters is the right size for your infrastructure

Consider alternatives if:

You need text-only classification and speed is your only metric (try Llama Guard 3)
You’re already in a closed ecosystem with a native moderation API (OpenAI, Azure, etc.)
You need to moderate video or audio content (Nemotron 3.5 is text + image only)
You need the highest possible text accuracy at any cost (try the 8B Llama-3.1-Nemotron-Safety-Guard)

The Bottom Line

NVIDIA Nemotron 3.5 Content Safety is the most capable open-weights safety model released in 2026. It’s multimodal, multilingual, supports custom policies, generates auditable reasoning traces, and costs nothing. For teams building AI products that can’t afford to get safety wrong - and can’t afford a $10K/month moderation API bill - it’s hard to beat.

The custom policy enforcement is the killer feature. Most guard models force you into their taxonomy. Nemotron 3.5 lets you write your own rules in plain English. That’s the difference between a generic safety filter and one that actually fits your product.

Grab the weights on HuggingFace, hit the free API on OpenRouter, or deploy through NVIDIA NIM. It works. It’s real. And it’s free.

Sources

Get our weekly AI digest

The latest AI tools, prompts, and insights — delivered every Tuesday.

No spam. Unsubscribe anytime.

AIUnpacker Editorial Team

Verified

A collective of engineers, journalists, and AI practitioners dedicated to providing hands-on, transparently disclosed analysis of the AI tools shaping tomorrow.

About us ·More articles