Step 3.7 Flash vs Other AI Models 2026: Speed & Cost

AIUnpacker Editorial

AIUnpacker

Jun 5, 2026Updated Jun 5, 202615m read

Jun 5, 2026Updated Jun 5, 2026

15 min3,278 words

Key Takeaways

Which flash AI model should you use in 2026? I put Step 3.7 Flash head-to-head against Gemini Flash, Claude Haiku, GPT-4o mini, and DeepSeek Flash. The results surprised me.

Summarize with AI

15 min → 30 sec

ChatGPT

OpenAI

Gemini

Google

Perplexity

AI Search

Editorial Disclosure & Affiliate Notice

This content is published for informational and educational purposes only. It is not intended as a substitute for professional, legal, financial, or medical advice. AIUnpacker is funded by sponsorships, affiliate commissions, and display advertising — nothing here is free to produce. When you buy through our links, we may earn a commission at no extra cost to you. Our editorial picks are never influenced by compensation.

For educational purposes only. Nothing here should be taken as a guarantee, recommendation, or professional recommendation.
AI-assisted editing. Drafts are produced with AI assistance and reviewed by our human editorial team.
Opinions are our own. Also, we are not affiliated with most tools we cover unless explicitly stated.
Information may be outdated. Verify pricing, features, and policies directly with the vendor.
Last reviewed: June 5, 2026. Published June 5, 2026.

Read more on our About page, Terms and Editorial Policy.

If you’re building anything that calls an LLM in a loop - agents, coding assistants, customer support bots - you’ve probably asked yourself the same question I have: which flash model actually delivers?

The “flash” AI model category has exploded in 2025-2026. Every major lab now has a fast, cheap, lightweight model. StepFun dropped Step 3.7 Flash on May 28. DeepSeek shipped V4 Flash. Google has Gemini 2.5 Flash and the brand-new Gemini 3.5 Flash. Anthropic has Claude Haiku 4.5. OpenAI still runs GPT-4o mini. And Qwen has its own flash-tier offerings.

They can’t all be the best. So I dug into the numbers.

This isn’t a vibes-based comparison. I pulled pricing from OpenRouter, benchmark scores from official model cards and HuggingFace, throughput specs from provider documentation, and cross-referenced everything I could against independent sources. Where data comes from self-reported benchmarks, I say so.

Here’s what I found.

The Contenders: A Quick Rundown

Model	Release Date	Total Params	Active Params	Architecture	Open Source
Step 3.7 Flash	May 2026	196B (+1.8B ViT)	~11B	MoE	Apache 2.0
DeepSeek V4 Flash	Apr 2026	284B	~13B	MoE	Weights available
Gemini 2.5 Flash	Jun 2025	Undisclosed	Undisclosed	Proprietary	No
Gemini 3.5 Flash	May 2026	Undisclosed	Undisclosed	Proprietary	No
Claude 3.5 Haiku	Oct 2024	Undisclosed	Undisclosed	Proprietary	No
Claude Haiku 4.5	Oct 2025	Undisclosed	Undisclosed	Proprietary	No
GPT-4o mini	Jul 2024	Undisclosed	Undisclosed	Proprietary	No

Step 3.7 Flash and DeepSeek V4 Flash are the only open-source models in this comparison. Step 3.7 Flash is Apache 2.0 licensed - you can run it on your own hardware, which matters if you care about data sovereignty or want to avoid API rate limits. DeepSeek V4 Flash also publishes model weights on HuggingFace.

Both use Mixture-of-Experts architectures: large total parameter counts but only a fraction active per token. Step 3.7 Flash activates roughly 11B of its 196B parameters per forward pass. DeepSeek V4 Flash activates about 13B of 284B. The others don’t disclose architecture details.

Pricing: Who’s Actually Cheap?

Price is the flash category’s reason to exist. Here’s the raw cost per million tokens on OpenRouter as of June 2026:

Model	Input ($/M tokens)	Output ($/M tokens)	Cache Hit Input	Output/Input Ratio
DeepSeek V4 Flash	$0.098	$0.197	N/A	2.0×
GPT-4o mini	$0.15	$0.60	N/A	4.0×
Step 3.7 Flash	$0.20	$1.15	$0.04	5.75×
DeepSeek V3 0324	$0.20	$0.77	N/A	3.85×
Gemini 2.5 Flash	$0.30	$2.50	N/A	8.3×
Claude 3.5 Haiku	$0.80	$4.00	N/A	5.0×
Claude Haiku 4.5	$1.00	$5.00	N/A	5.0×
Gemini 3.5 Flash	$1.50	$9.00	N/A	6.0×

DeepSeek V4 Flash is the clear winner on raw price. Ten cents per million input tokens and under twenty cents for output. That’s absurdly cheap. You could process the entire King James Bible (~780K words) for about 15 cents.

GPT-4o mini comes second at $0.15/M input. But its output price - $0.60/M - is triple DeepSeek’s. If your application generates a lot of output (coding agents, content generation), that gap adds up fast.

Step 3.7 Flash sits in the affordable middle: $0.20/M input, $1.15/M output. It’s cheaper than every Gemini and Claude model. The real sleeper feature is its $0.04/M cache hit rate - if your application has repetitive system prompts or RAG contexts, your effective cost drops dramatically.

Gemini 3.5 Flash at $1.50/$9.00 is genuinely expensive for a “flash” model. At that price, you’re paying frontier-model money. You’d better be getting frontier-model performance to justify it.

Claude Haiku 4.5 at $1/$5 is pricier than most flash alternatives too, but Anthropic argues it delivers near-Sonnet quality. More on that below.

Winner: DeepSeek V4 Flash. But price isn’t everything.

Speed: Tokens Per Second

Speed benchmarks are harder to pin down because throughput depends heavily on the provider, batch size, and current load. But here’s what the official documentation says:

Step 3.7 Flash: Up to 400 tokens/second (per HuggingFace model card). That’s fast enough for real-time chat and sub-second agent decisions.
DeepSeek V4 Flash: Designed for “fast inference and high-throughput workloads” - official throughput numbers aren’t published, but OpenRouter shows it handling 3.11 trillion weekly tokens, by far the highest volume of any model in this comparison. The infrastructure is clearly optimized.
Gemini Flash models: Google positions both 2.5 and 3.5 Flash as their “fast” tier, but doesn’t publish raw tok/s numbers. In practice, latency feels comparable to other flash models.
Claude Haiku 4.5: Anthropic calls it their “fastest and most efficient model,” purpose-built for “real-time and high-volume applications.” No public tok/s figure.
GPT-4o mini: OpenAI’s smallest/fastest model - also no public throughput number.

The reality is that most flash models are fast enough for anything that isn’t real-time audio processing. Unless you’re doing something that genuinely needs 0.1-second responses, the speed differences between flash models won’t be your bottleneck. Your API latency to the provider will dominate.

Step 3.7 Flash’s 400 tok/s claim is the most concrete number available. Whether you can actually achieve that depends on your provider, but it’s a solid headline.

Winner: Step 3.7 Flash (best documented), DeepSeek V4 Flash (highest adoption).

Performance: Benchmarks That Actually Matter

This is where things get interesting. The flash category has traditionally been about “good enough” quality at low cost. But models like Step 3.7 Flash and Claude Haiku 4.5 are pushing flash-tier quality dangerously close to full-size frontier models.

Let me walk through the numbers that matter for real-world use: coding, agent tasks, long-context, and multimodal understanding.

Coding Benchmarks

If you’re using a flash model for anything code-related, SWE-bench is the benchmark to watch. Here’s how the flash models stack up:

Model	SWE-Bench Pro	SWE-Bench Verified	Terminal-Bench 2.1
Step 3.7 Flash	56.3	76.5	59.6
DeepSeek V4 Flash	55.6	79.0	62.0
Gemini 3.5 Flash	55.1	-	76.2
Claude Haiku 4.5	-	>73	-

Sources: StepFun official benchmarks for Step 3.7 Flash, DeepSeek V4 Flash, and Gemini 3.5 Flash. Claude Haiku 4.5 SWE-bench Verified from Anthropic’s model card. Terminal-Bench 2.1 from StepFun testing.

Step 3.7 Flash leads on SWE-Bench Pro (56.3), beating both DeepSeek V4 Flash (55.6) and Gemini 3.5 Flash (55.1). That’s a narrow but real edge.

On SWE-Bench Verified, DeepSeek V4 Flash pulls ahead slightly at 79.0 vs Step 3.7 Flash’s 76.5. Both are extremely strong for flash-tier models. For context, just 18 months ago, GPT-4 scored around 70% on SWE-bench Verified.

Claude Haiku 4.5 doesn’t publish SWE-Bench Pro scores, but Anthropic claims >73% on SWE-bench Verified - putting it in the same ballpark.

Gemini 3.5 Flash has a standout Terminal-Bench score at 76.2, far ahead of every other flash model. Terminal-Bench measures command-line and system interaction tasks. If your agents spend time in bash shells, this matters.

Agent Benchmarks

Flash models are increasingly used as the “worker” in agent architectures. The key benchmarks here:

Model	ClawEval-1.1	Toolathlon	HLE w/ Tools	GDPval
Step 3.7 Flash	67.1	49.5	47.2	45.8
DeepSeek V4 Flash	57.8	52.8	45.1	44.0
Gemini 3.5 Flash	-	56.5	40.2	57.8

Source: StepFun official benchmarks, 2026.

Step 3.7 Flash dominates ClawEval-1.1 with a score of 67.1. The next closest competitor is Claude Opus 4.6 at 70.8 - a frontier-tier model. ClawEval measures “adversarial trap” resistance and multi-turn agent performance. A score this high means the model actually follows instructions across long workflows without drifting.

This is Step 3.7 Flash’s single strongest differentiator. If you’re running autonomous agents that need to stay on-task for dozens of turns, this model is built for it.

On Toolathlon (multi-tool coordination ability), Gemini 3.5 Flash leads at 56.5, with DeepSeek V4 Flash at 52.8 and Step 3.7 Flash at 49.5.

On GDPval (diverse occupational tasks), Gemini 3.5 Flash significantly outpaces the others at 57.8. Step 3.7 Flash and DeepSeek V4 Flash are in the mid-40s.

Multimodal Capabilities

Not all flash models can see. Here’s the breakdown:

Model	Text	Image	Video	Audio	PDF
Step 3.7 Flash	✓	✓	-	-	-
DeepSeek V4 Flash	✓	-	-	-	-
GPT-4o mini	✓	✓	-	-	-
Gemini 2.5 Flash	✓	✓	✓	✓	-
Gemini 3.5 Flash	✓	✓	✓	✓	✓
Claude Haiku 4.5	✓	-	-	-	-

DeepSeek V4 Flash and Claude Haiku 4.5 are text-only. That’s a dealbreaker for many applications - if you need document processing, UI screenshot analysis, or any kind of visual grounding, both are off the table.

Step 3.7 Flash has native vision via its 1.8B-parameter ViT encoder. It processes images directly. It also supports a “Visual Search” tool that compensates for the model’s limited parametric knowledge by searching the web for visual information.

On multimodal benchmarks, Step 3.7 Flash scores 79.2 on SimpleVQA (with search) and 95.3 on V* (with Python tool). Both are near top-of-class, even compared to much larger models.

Gemini 3.5 Flash is the multimodal king: text, image, video, audio, PDF. If your application needs to process diverse input types, no other flash model comes close.

Winner for multimodal: Gemini 3.5 Flash (breadth), Step 3.7 Flash (depth on visual reasoning).

Context Window: How Much Can You Feed It?

Model	Context Window
DeepSeek V4 Flash	1,000,000 tokens
Gemini 2.5 Flash	1,000,000 tokens
Gemini 3.5 Flash	1,000,000 tokens
Step 3.7 Flash	256,000 tokens
Claude 3.5 Haiku	200,000 tokens
Claude Haiku 4.5	200,000 tokens
GPT-4o mini	128,000 tokens

Google and DeepSeek clearly win here - 1M token context windows are table stakes for their flash models. That’s enough to ingest entire codebases or multi-hundred-page documents in a single prompt.

Step 3.7 Flash at 256K is mid-pack. For most applications, 256K is plenty - that’s roughly a 500-page book. But if you’re doing codebase-level reasoning or processing massive legal documents, DeepSeek V4 Flash or Gemini Flash gives you 4× the headroom.

GPT-4o mini at 128K is the most constrained. Still usable, but noticeably less room than the competition.

Winner: DeepSeek V4 Flash and Gemini Flash models (1M).

Real-World Agent Performance: The Step-SWE-Bench

StepFun published something unusual and genuinely useful: performance data across multiple agent harnesses. Most benchmark tables show you how a model performs in a lab setting. StepFun showed how Step 3.7 Flash performs inside real agent frameworks:

Agent Harness	Step 3.5 Flash	Step 3.7 Flash	Improvement
Hermes Agent	60.0%	67.5%	+7.5
OpenClaw	47.0%	67.0%	+20.0
Claude Code	73.0%	71.5%	-1.5
Kilo Code	59.0%	67.5%	+8.5
OpenCode	57.0%	64.5%	+7.5
RooCode	43.0%	64.5%	+21.5
Average	56.5%	67.1%	+10.6

A few things jump out:

The consistency improvement is massive. Step 3.5 Flash ranged from 43% to 73% depending on the harness - a 30-point spread. Step 3.7 Flash ranges from 64.5% to 71.5% - a 7-point spread. This model works reliably across diverse agent frameworks without requiring harness-specific tuning.

OpenClaw and RooCode went from borderline unusable to genuinely good. A 20+ point jump means the model learned better tool-calling conventions that generalize across frameworks.

It doesn’t beat Claude Code’s native model. Claude Code running Step 3.7 Flash (71.5%) slightly trails Claude Code running Step 3.5 Flash (73%). Anthropic’s agent harness is presumably optimized for Anthropic models. But 71.5% is still excellent for a non-Anthropic model running inside Claude’s framework.

Advisor Mode: The Smartest Flash Trick

Here’s the most interesting architectural innovation in Step 3.7 Flash: Advisor Mode.

The idea is borrowed from Anthropic’s advisor strategy: a flash model handles 95% of the agent execution loop. It calls tools, reads results, and iterates autonomously. Only when it hits a genuinely hard decision - planning, recovering from repeated failures - does it escalate to a larger “advisor” model.

StepFun’s numbers:

Step 3.7 Flash alone on SWE-bench Verified: 73.7% at $0.12/task
Step 3.7 Flash + Advisor: 76.3% at $0.19/task
Claude Opus 4.6 alone: 78.7% at $1.76/task

You get 97% of Claude Opus 4.6’s coding performance at roughly one-ninth the cost. That’s the kind of efficiency gain that changes your architecture decisions.

Not every use case needs Advisor Mode. But if you’re running high-volume coding agents and can afford the occasional escalation, this pattern is worth building around.

The Comparison Table: Everything at a Glance

Criteria	Step 3.7 Flash	DeepSeek V4 Flash	Gemini 3.5 Flash	Claude Haiku 4.5	GPT-4o mini
Input $/M	$0.20	$0.098	$1.50	$1.00	$0.15
Output $/M	$1.15	$0.197	$9.00	$5.00	$0.60
Context	256K	1M	1M	200K	128K
Speed	~400 tok/s	Fast*	Fast*	Fast*	Fast*
Multimodal	Text+Image	Text only	All modes	Text only	Text+Image
SWE-Bench Pro	56.3	55.6	55.1	-	-
SWE-Bench Verified	76.5	79.0	-	>73	-
Terminal-Bench 2.1	59.6	62.0	76.2	-	-
ClawEval-1.1	67.1	57.8	-	-	-
Toolathlon	49.5	52.8	56.5	-	-
Open Source	Apache 2.0	Weights	No	No	No
Weekly Usage	578B tok	3.11T tok	530B tok	247B tok	506B tok

* Provider doesn’t publish exact tok/s - positioned as fast tier. Step 3.7 Flash is the only flash model with a published throughput number (~400 tok/s).

Recommendations: Which Flash Model Should You Use?

Cheapest Option: DeepSeek V4 Flash

At $0.098/M input and $0.197/M output, nothing else comes close. It’s 5× cheaper than GPT-4o mini on output. If you’re price-sensitive and don’t need vision, this is your model. The 1M context window and 3.11T weekly token volume on OpenRouter tell you it’s battle-tested at scale.

Catch: No multimodal. Text only.

Best for Coding Agents: Step 3.7 Flash

SWE-Bench Pro leader at 56.3 among flash models. Strong consistency across agent harnesses (64.5%-71.5% on Step-SWE-Bench). Advisor Mode gives you 97% of frontier coding performance at 11% of the cost. Open source under Apache 2.0, so you can self-host.

If you’re building Claude Code alternatives, coding assistants, or autonomous PR-review bots, this is the flash model to beat.

Best for Autonomous Agents: Step 3.7 Flash

ClawEval-1.1 score of 67.1 is the highest of any flash model and competitive with frontier models. This model resists adversarial traps and stays on-task across long multi-turn workflows. If reliability matters more than raw benchmarks - if you can’t afford an agent going off the rails 20 turns in - this is your pick.

Best Overall Quality (If Budget Allows): Claude Haiku 4.5

Anthropic claims Haiku 4.5 matches Sonnet 4’s reasoning and coding quality. At $1/$5 per million tokens, it’s premium pricing for a flash model, but the quality-to-cost ratio might justify it for applications where accuracy is paramount.

Catch: No multimodal, and at this price, you should probably just use Sonnet 4 if you can tolerate slightly higher latency.

Best Multimodal Flash: Gemini 3.5 Flash

Text, image, video, audio, PDF - no other flash model handles this range of input modalities. Terminal-Bench 2.1 score of 76.2 is exceptional for system-level tasks. If your application processes documents, screenshots, or media, this is the obvious choice.

Catch: Expensive at $1.50/$9.00 per million tokens. Make sure you actually need the multimodality before paying the premium.

Best Budget All-Rounder: GPT-4o mini

At $0.15/$0.60 per million, it’s the cheapest option that includes vision. 82% on MMLU means it’s smart enough for most general-purpose tasks. If you need cheap image understanding and don’t need cutting-edge agent or coding performance, GPT-4o mini is still a solid default.

What I’d Actually Do

If I’m building in June 2026, here’s my stack:

Primary coding agent: Step 3.7 Flash with Advisor Mode enabled. The open-source license means no vendor lock-in. The agent benchmark consistency across harnesses means I can switch frameworks without retuning. And the cost - under $0.20 per complex SWE-bench task - is hard to argue with.
High-volume simple tasks (classification, extraction, summarization): DeepSeek V4 Flash. At under $0.10/M input, I can process literally millions of documents for pocket change. The 1M context window handles even the longest documents.
Multimodal document processing: Gemini 3.5 Flash. If I need to understand PDFs with embedded charts, or process video frames, nothing else covers the full range.
Fallback for safety-critical decisions: Claude Haiku 4.5. Anthropic’s safety training and refusal handling is still the gold standard. Worth the premium when mistakes are costly.

The Bottom Line

Step 3.7 Flash earns its place at the top of the flash model conversation. It’s not the cheapest (DeepSeek V4 Flash). It’s not the most multimodal (Gemini 3.5 Flash). It doesn’t have the biggest context window (DeepSeek and Gemini tie at 1M).

But it’s the most practical flash model for agentic software development in mid-2026. The combination of open-source licensing, top-tier agent reliability, strong multimodal understanding, and the innovative Advisor Mode architecture makes it uniquely suited for building production AI applications.

The flash model race is far from over. But right now, Step 3.7 Flash is my pick for the best balance of cost, capability, and control.

Sources

StepFun Official Blog - Step 3.7 Flash launch announcement with benchmark tables. static.stepfun.com/blog/step-3.7-flash/
HuggingFace Model Card - stepfun-ai/Step-3.7-Flash technical specifications and pricing. huggingface.co/stepfun-ai/Step-3.7-Flash
OpenRouter - Model pricing, context windows, and weekly token volumes for all compared models. openrouter.ai/models
Wikipedia - StepFun company background and model history. en.wikipedia.org/wiki/StepFun
South China Morning Post - Coverage of Step 3.5 Flash launch and StepFun positioning. scmp.com/tech/article/3342222
Anthropic - The Advisor Strategy blog post referenced in Advisor Mode architecture. claude.com/blog/the-advisor-strategy

Note: Benchmark scores for DeepSeek V4 Flash, Gemini 3.5 Flash, and competitor models in the detailed comparison tables come from StepFun’s own testing unless otherwise noted. Cross-validated pricing data from OpenRouter reflects provider marketplace rates as of June 2026. Speed data for Step 3.7 Flash (~400 tok/s) comes from the HuggingFace model card; other models do not publish official throughput numbers.

Get our weekly AI digest

The latest AI tools, prompts, and insights — delivered every Tuesday.

No spam. Unsubscribe anytime.

AIUnpacker Editorial Team

Verified

A collective of engineers, journalists, and AI practitioners dedicated to providing hands-on, transparently disclosed analysis of the AI tools shaping tomorrow.

About us ·More articles

StepFun Step 3.7 Flash vs Other AI Models: Speed, Cost, and Performance Compared