StepFun Step 3.7 Flash vs Other AI Models: Speed, Cost, and Performance Compared
If you’re building anything that calls an LLM in a loop - agents, coding assistants, customer support bots - you’ve probably asked yourself the same question I have: which flash model actually delivers?
The “flash” AI model category has exploded in 2025-2026. Every major lab now has a fast, cheap, lightweight model. StepFun dropped Step 3.7 Flash on May 28. DeepSeek shipped V4 Flash. Google has Gemini 2.5 Flash and the brand-new Gemini 3.5 Flash. Anthropic has Claude Haiku 4.5. OpenAI still runs GPT-4o mini. And Qwen has its own flash-tier offerings.
They can’t all be the best. So I dug into the numbers.
This isn’t a vibes-based comparison. I pulled pricing from OpenRouter, benchmark scores from official model cards and HuggingFace, throughput specs from provider documentation, and cross-referenced everything I could against independent sources. Where data comes from self-reported benchmarks, I say so.
Here’s what I found.
The Contenders: A Quick Rundown
| Model | Release Date | Total Params | Active Params | Architecture | Open Source |
|---|---|---|---|---|---|
| Step 3.7 Flash | May 2026 | 196B (+1.8B ViT) | ~11B | MoE | Apache 2.0 |
| DeepSeek V4 Flash | Apr 2026 | 284B | ~13B | MoE | Weights available |
| Gemini 2.5 Flash | Jun 2025 | Undisclosed | Undisclosed | Proprietary | No |
| Gemini 3.5 Flash | May 2026 | Undisclosed | Undisclosed | Proprietary | No |
| Claude 3.5 Haiku | Oct 2024 | Undisclosed | Undisclosed | Proprietary | No |
| Claude Haiku 4.5 | Oct 2025 | Undisclosed | Undisclosed | Proprietary | No |
| GPT-4o mini | Jul 2024 | Undisclosed | Undisclosed | Proprietary | No |
Step 3.7 Flash and DeepSeek V4 Flash are the only open-source models in this comparison. Step 3.7 Flash is Apache 2.0 licensed - you can run it on your own hardware, which matters if you care about data sovereignty or want to avoid API rate limits. DeepSeek V4 Flash also publishes model weights on HuggingFace.
Both use Mixture-of-Experts architectures: large total parameter counts but only a fraction active per token. Step 3.7 Flash activates roughly 11B of its 196B parameters per forward pass. DeepSeek V4 Flash activates about 13B of 284B. The others don’t disclose architecture details.
Pricing: Who’s Actually Cheap?
Price is the flash category’s reason to exist. Here’s the raw cost per million tokens on OpenRouter as of June 2026:
| Model | Input ($/M tokens) | Output ($/M tokens) | Cache Hit Input | Output/Input Ratio |
|---|---|---|---|---|
| DeepSeek V4 Flash | $0.098 | $0.197 | N/A | 2.0× |
| GPT-4o mini | $0.15 | $0.60 | N/A | 4.0× |
| Step 3.7 Flash | $0.20 | $1.15 | $0.04 | 5.75× |
| DeepSeek V3 0324 | $0.20 | $0.77 | N/A | 3.85× |
| Gemini 2.5 Flash | $0.30 | $2.50 | N/A | 8.3× |
| Claude 3.5 Haiku | $0.80 | $4.00 | N/A | 5.0× |
| Claude Haiku 4.5 | $1.00 | $5.00 | N/A | 5.0× |
| Gemini 3.5 Flash | $1.50 | $9.00 | N/A | 6.0× |
DeepSeek V4 Flash is the clear winner on raw price. Ten cents per million input tokens and under twenty cents for output. That’s absurdly cheap. You could process the entire King James Bible (~780K words) for about 15 cents.
GPT-4o mini comes second at $0.15/M input. But its output price - $0.60/M - is triple DeepSeek’s. If your application generates a lot of output (coding agents, content generation), that gap adds up fast.
Step 3.7 Flash sits in the affordable middle: $0.20/M input, $1.15/M output. It’s cheaper than every Gemini and Claude model. The real sleeper feature is its $0.04/M cache hit rate - if your application has repetitive system prompts or RAG contexts, your effective cost drops dramatically.
Gemini 3.5 Flash at $1.50/$9.00 is genuinely expensive for a “flash” model. At that price, you’re paying frontier-model money. You’d better be getting frontier-model performance to justify it.
Claude Haiku 4.5 at $1/$5 is pricier than most flash alternatives too, but Anthropic argues it delivers near-Sonnet quality. More on that below.
Winner: DeepSeek V4 Flash. But price isn’t everything.
Speed: Tokens Per Second
Speed benchmarks are harder to pin down because throughput depends heavily on the provider, batch size, and current load. But here’s what the official documentation says:
- Step 3.7 Flash: Up to 400 tokens/second (per HuggingFace model card). That’s fast enough for real-time chat and sub-second agent decisions.
- DeepSeek V4 Flash: Designed for “fast inference and high-throughput workloads” - official throughput numbers aren’t published, but OpenRouter shows it handling 3.11 trillion weekly tokens, by far the highest volume of any model in this comparison. The infrastructure is clearly optimized.
- Gemini Flash models: Google positions both 2.5 and 3.5 Flash as their “fast” tier, but doesn’t publish raw tok/s numbers. In practice, latency feels comparable to other flash models.
- Claude Haiku 4.5: Anthropic calls it their “fastest and most efficient model,” purpose-built for “real-time and high-volume applications.” No public tok/s figure.
- GPT-4o mini: OpenAI’s smallest/fastest model - also no public throughput number.
The reality is that most flash models are fast enough for anything that isn’t real-time audio processing. Unless you’re doing something that genuinely needs 0.1-second responses, the speed differences between flash models won’t be your bottleneck. Your API latency to the provider will dominate.
Step 3.7 Flash’s 400 tok/s claim is the most concrete number available. Whether you can actually achieve that depends on your provider, but it’s a solid headline.
Winner: Step 3.7 Flash (best documented), DeepSeek V4 Flash (highest adoption).
Performance: Benchmarks That Actually Matter
This is where things get interesting. The flash category has traditionally been about “good enough” quality at low cost. But models like Step 3.7 Flash and Claude Haiku 4.5 are pushing flash-tier quality dangerously close to full-size frontier models.
Let me walk through the numbers that matter for real-world use: coding, agent tasks, long-context, and multimodal understanding.
Coding Benchmarks
If you’re using a flash model for anything code-related, SWE-bench is the benchmark to watch. Here’s how the flash models stack up:
| Model | SWE-Bench Pro | SWE-Bench Verified | Terminal-Bench 2.1 |
|---|---|---|---|
| Step 3.7 Flash | 56.3 | 76.5 | 59.6 |
| DeepSeek V4 Flash | 55.6 | 79.0 | 62.0 |
| Gemini 3.5 Flash | 55.1 | - | 76.2 |
| Claude Haiku 4.5 | - | >73 | - |
Sources: StepFun official benchmarks for Step 3.7 Flash, DeepSeek V4 Flash, and Gemini 3.5 Flash. Claude Haiku 4.5 SWE-bench Verified from Anthropic’s model card. Terminal-Bench 2.1 from StepFun testing.
Step 3.7 Flash leads on SWE-Bench Pro (56.3), beating both DeepSeek V4 Flash (55.6) and Gemini 3.5 Flash (55.1). That’s a narrow but real edge.
On SWE-Bench Verified, DeepSeek V4 Flash pulls ahead slightly at 79.0 vs Step 3.7 Flash’s 76.5. Both are extremely strong for flash-tier models. For context, just 18 months ago, GPT-4 scored around 70% on SWE-bench Verified.
Claude Haiku 4.5 doesn’t publish SWE-Bench Pro scores, but Anthropic claims >73% on SWE-bench Verified - putting it in the same ballpark.
Gemini 3.5 Flash has a standout Terminal-Bench score at 76.2, far ahead of every other flash model. Terminal-Bench measures command-line and system interaction tasks. If your agents spend time in bash shells, this matters.
Agent Benchmarks
Flash models are increasingly used as the “worker” in agent architectures. The key benchmarks here:
| Model | ClawEval-1.1 | Toolathlon | HLE w/ Tools | GDPval |
|---|---|---|---|---|
| Step 3.7 Flash | 67.1 | 49.5 | 47.2 | 45.8 |
| DeepSeek V4 Flash | 57.8 | 52.8 | 45.1 | 44.0 |
| Gemini 3.5 Flash | - | 56.5 | 40.2 | 57.8 |
Source: StepFun official benchmarks, 2026.
Step 3.7 Flash dominates ClawEval-1.1 with a score of 67.1. The next closest competitor is Claude Opus 4.6 at 70.8 - a frontier-tier model. ClawEval measures “adversarial trap” resistance and multi-turn agent performance. A score this high means the model actually follows instructions across long workflows without drifting.
This is Step 3.7 Flash’s single strongest differentiator. If you’re running autonomous agents that need to stay on-task for dozens of turns, this model is built for it.
On Toolathlon (multi-tool coordination ability), Gemini 3.5 Flash leads at 56.5, with DeepSeek V4 Flash at 52.8 and Step 3.7 Flash at 49.5.
On GDPval (diverse occupational tasks), Gemini 3.5 Flash significantly outpaces the others at 57.8. Step 3.7 Flash and DeepSeek V4 Flash are in the mid-40s.
Multimodal Capabilities
Not all flash models can see. Here’s the breakdown:
| Model | Text | Image | Video | Audio | |
|---|---|---|---|---|---|
| Step 3.7 Flash | ✓ | ✓ | - | - | - |
| DeepSeek V4 Flash | ✓ | - | - | - | - |
| GPT-4o mini | ✓ | ✓ | - | - | - |
| Gemini 2.5 Flash | ✓ | ✓ | ✓ | ✓ | - |
| Gemini 3.5 Flash | ✓ | ✓ | ✓ | ✓ | ✓ |
| Claude Haiku 4.5 | ✓ | - | - | - | - |
DeepSeek V4 Flash and Claude Haiku 4.5 are text-only. That’s a dealbreaker for many applications - if you need document processing, UI screenshot analysis, or any kind of visual grounding, both are off the table.
Step 3.7 Flash has native vision via its 1.8B-parameter ViT encoder. It processes images directly. It also supports a “Visual Search” tool that compensates for the model’s limited parametric knowledge by searching the web for visual information.
On multimodal benchmarks, Step 3.7 Flash scores 79.2 on SimpleVQA (with search) and 95.3 on V* (with Python tool). Both are near top-of-class, even compared to much larger models.
Gemini 3.5 Flash is the multimodal king: text, image, video, audio, PDF. If your application needs to process diverse input types, no other flash model comes close.
Winner for multimodal: Gemini 3.5 Flash (breadth), Step 3.7 Flash (depth on visual reasoning).
Context Window: How Much Can You Feed It?
| Model | Context Window |
|---|---|
| DeepSeek V4 Flash | 1,000,000 tokens |
| Gemini 2.5 Flash | 1,000,000 tokens |
| Gemini 3.5 Flash | 1,000,000 tokens |
| Step 3.7 Flash | 256,000 tokens |
| Claude 3.5 Haiku | 200,000 tokens |
| Claude Haiku 4.5 | 200,000 tokens |
| GPT-4o mini | 128,000 tokens |
Google and DeepSeek clearly win here - 1M token context windows are table stakes for their flash models. That’s enough to ingest entire codebases or multi-hundred-page documents in a single prompt.
Step 3.7 Flash at 256K is mid-pack. For most applications, 256K is plenty - that’s roughly a 500-page book. But if you’re doing codebase-level reasoning or processing massive legal documents, DeepSeek V4 Flash or Gemini Flash gives you 4× the headroom.
GPT-4o mini at 128K is the most constrained. Still usable, but noticeably less room than the competition.
Winner: DeepSeek V4 Flash and Gemini Flash models (1M).
Real-World Agent Performance: The Step-SWE-Bench
StepFun published something unusual and genuinely useful: performance data across multiple agent harnesses. Most benchmark tables show you how a model performs in a lab setting. StepFun showed how Step 3.7 Flash performs inside real agent frameworks:
| Agent Harness | Step 3.5 Flash | Step 3.7 Flash | Improvement |
|---|---|---|---|
| Hermes Agent | 60.0% | 67.5% | +7.5 |
| OpenClaw | 47.0% | 67.0% | +20.0 |
| Claude Code | 73.0% | 71.5% | -1.5 |
| Kilo Code | 59.0% | 67.5% | +8.5 |
| OpenCode | 57.0% | 64.5% | +7.5 |
| RooCode | 43.0% | 64.5% | +21.5 |
| Average | 56.5% | 67.1% | +10.6 |
A few things jump out:
The consistency improvement is massive. Step 3.5 Flash ranged from 43% to 73% depending on the harness - a 30-point spread. Step 3.7 Flash ranges from 64.5% to 71.5% - a 7-point spread. This model works reliably across diverse agent frameworks without requiring harness-specific tuning.
OpenClaw and RooCode went from borderline unusable to genuinely good. A 20+ point jump means the model learned better tool-calling conventions that generalize across frameworks.
It doesn’t beat Claude Code’s native model. Claude Code running Step 3.7 Flash (71.5%) slightly trails Claude Code running Step 3.5 Flash (73%). Anthropic’s agent harness is presumably optimized for Anthropic models. But 71.5% is still excellent for a non-Anthropic model running inside Claude’s framework.
Advisor Mode: The Smartest Flash Trick
Here’s the most interesting architectural innovation in Step 3.7 Flash: Advisor Mode.
The idea is borrowed from Anthropic’s advisor strategy: a flash model handles 95% of the agent execution loop. It calls tools, reads results, and iterates autonomously. Only when it hits a genuinely hard decision - planning, recovering from repeated failures - does it escalate to a larger “advisor” model.
StepFun’s numbers:
- Step 3.7 Flash alone on SWE-bench Verified: 73.7% at $0.12/task
- Step 3.7 Flash + Advisor: 76.3% at $0.19/task
- Claude Opus 4.6 alone: 78.7% at $1.76/task
You get 97% of Claude Opus 4.6’s coding performance at roughly one-ninth the cost. That’s the kind of efficiency gain that changes your architecture decisions.
Not every use case needs Advisor Mode. But if you’re running high-volume coding agents and can afford the occasional escalation, this pattern is worth building around.
The Comparison Table: Everything at a Glance
| Criteria | Step 3.7 Flash | DeepSeek V4 Flash | Gemini 3.5 Flash | Claude Haiku 4.5 | GPT-4o mini |
|---|---|---|---|---|---|
| Input $/M | $0.20 | $0.098 | $1.50 | $1.00 | $0.15 |
| Output $/M | $1.15 | $0.197 | $9.00 | $5.00 | $0.60 |
| Context | 256K | 1M | 1M | 200K | 128K |
| Speed | ~400 tok/s | Fast* | Fast* | Fast* | Fast* |
| Multimodal | Text+Image | Text only | All modes | Text only | Text+Image |
| SWE-Bench Pro | 56.3 | 55.6 | 55.1 | - | - |
| SWE-Bench Verified | 76.5 | 79.0 | - | >73 | - |
| Terminal-Bench 2.1 | 59.6 | 62.0 | 76.2 | - | - |
| ClawEval-1.1 | 67.1 | 57.8 | - | - | - |
| Toolathlon | 49.5 | 52.8 | 56.5 | - | - |
| Open Source | Apache 2.0 | Weights | No | No | No |
| Weekly Usage | 578B tok | 3.11T tok | 530B tok | 247B tok | 506B tok |
* Provider doesn’t publish exact tok/s - positioned as fast tier. Step 3.7 Flash is the only flash model with a published throughput number (~400 tok/s).
Recommendations: Which Flash Model Should You Use?
Cheapest Option: DeepSeek V4 Flash
At $0.098/M input and $0.197/M output, nothing else comes close. It’s 5× cheaper than GPT-4o mini on output. If you’re price-sensitive and don’t need vision, this is your model. The 1M context window and 3.11T weekly token volume on OpenRouter tell you it’s battle-tested at scale.
Catch: No multimodal. Text only.
Best for Coding Agents: Step 3.7 Flash
SWE-Bench Pro leader at 56.3 among flash models. Strong consistency across agent harnesses (64.5%-71.5% on Step-SWE-Bench). Advisor Mode gives you 97% of frontier coding performance at 11% of the cost. Open source under Apache 2.0, so you can self-host.
If you’re building Claude Code alternatives, coding assistants, or autonomous PR-review bots, this is the flash model to beat.
Best for Autonomous Agents: Step 3.7 Flash
ClawEval-1.1 score of 67.1 is the highest of any flash model and competitive with frontier models. This model resists adversarial traps and stays on-task across long multi-turn workflows. If reliability matters more than raw benchmarks - if you can’t afford an agent going off the rails 20 turns in - this is your pick.
Best Overall Quality (If Budget Allows): Claude Haiku 4.5
Anthropic claims Haiku 4.5 matches Sonnet 4’s reasoning and coding quality. At $1/$5 per million tokens, it’s premium pricing for a flash model, but the quality-to-cost ratio might justify it for applications where accuracy is paramount.
Catch: No multimodal, and at this price, you should probably just use Sonnet 4 if you can tolerate slightly higher latency.
Best Multimodal Flash: Gemini 3.5 Flash
Text, image, video, audio, PDF - no other flash model handles this range of input modalities. Terminal-Bench 2.1 score of 76.2 is exceptional for system-level tasks. If your application processes documents, screenshots, or media, this is the obvious choice.
Catch: Expensive at $1.50/$9.00 per million tokens. Make sure you actually need the multimodality before paying the premium.
Best Budget All-Rounder: GPT-4o mini
At $0.15/$0.60 per million, it’s the cheapest option that includes vision. 82% on MMLU means it’s smart enough for most general-purpose tasks. If you need cheap image understanding and don’t need cutting-edge agent or coding performance, GPT-4o mini is still a solid default.
What I’d Actually Do
If I’m building in June 2026, here’s my stack:
-
Primary coding agent: Step 3.7 Flash with Advisor Mode enabled. The open-source license means no vendor lock-in. The agent benchmark consistency across harnesses means I can switch frameworks without retuning. And the cost - under $0.20 per complex SWE-bench task - is hard to argue with.
-
High-volume simple tasks (classification, extraction, summarization): DeepSeek V4 Flash. At under $0.10/M input, I can process literally millions of documents for pocket change. The 1M context window handles even the longest documents.
-
Multimodal document processing: Gemini 3.5 Flash. If I need to understand PDFs with embedded charts, or process video frames, nothing else covers the full range.
-
Fallback for safety-critical decisions: Claude Haiku 4.5. Anthropic’s safety training and refusal handling is still the gold standard. Worth the premium when mistakes are costly.
The Bottom Line
Step 3.7 Flash earns its place at the top of the flash model conversation. It’s not the cheapest (DeepSeek V4 Flash). It’s not the most multimodal (Gemini 3.5 Flash). It doesn’t have the biggest context window (DeepSeek and Gemini tie at 1M).
But it’s the most practical flash model for agentic software development in mid-2026. The combination of open-source licensing, top-tier agent reliability, strong multimodal understanding, and the innovative Advisor Mode architecture makes it uniquely suited for building production AI applications.
The flash model race is far from over. But right now, Step 3.7 Flash is my pick for the best balance of cost, capability, and control.
Sources
- StepFun Official Blog - Step 3.7 Flash launch announcement with benchmark tables. static.stepfun.com/blog/step-3.7-flash/
- HuggingFace Model Card - stepfun-ai/Step-3.7-Flash technical specifications and pricing. huggingface.co/stepfun-ai/Step-3.7-Flash
- OpenRouter - Model pricing, context windows, and weekly token volumes for all compared models. openrouter.ai/models
- Wikipedia - StepFun company background and model history. en.wikipedia.org/wiki/StepFun
- South China Morning Post - Coverage of Step 3.5 Flash launch and StepFun positioning. scmp.com/tech/article/3342222
- Anthropic - The Advisor Strategy blog post referenced in Advisor Mode architecture. claude.com/blog/the-advisor-strategy
Note: Benchmark scores for DeepSeek V4 Flash, Gemini 3.5 Flash, and competitor models in the detailed comparison tables come from StepFun’s own testing unless otherwise noted. Cross-validated pricing data from OpenRouter reflects provider marketplace rates as of June 2026. Speed data for Step 3.7 Flash (~400 tok/s) comes from the HuggingFace model card; other models do not publish official throughput numbers.