Step 3.7 Flash Review: The Fastest Open-Source Multimodal Agent in 2026?
There’s a new speed demon in town, and it’s not from OpenAI, Google, or Anthropic. Step 3.7 Flash, built by Shanghai-based StepFun, landed on May 29, 2026, and it’s already turning heads in the AI coding and agent community. This is a 198-billion-parameter Mixture-of-Experts model that only activates 11 billion parameters per token, which means it’s designed for one thing: pure, unadulterated throughput for production workloads.
I’ve spent the last week digging through benchmarks, talking to developers who’ve shipped it in production, and running my own side-by-side comparisons. Here’s the honest, unfiltered take.
Who Is StepFun?
Before we get to the model, let’s talk about the company behind it. StepFun (阶跃星辰) is a Shanghai-based AI lab founded in 2023 by former Microsoft senior VP and AI researcher Jiang Daxin. The team has grown to over 100 researchers and engineers, with open-source releases on Hugging Face pulling in hundreds of thousands of downloads.
StepFun is part of a broader wave of Chinese AI labs — DeepSeek, Kimi (Moonshot AI), Zhipu AI (GLM), and Qwen (Alibaba) — that have been systematically releasing competitive, often open-weight models. Step 3.7 Flash is their third major generation of Flash models, following Step 1, Step 2, and Step 3.5 Flash.
The company is backed by significant VC funding and operates dual API platforms for both Chinese and global markets.
Architecture: MoE Done Right
Step 3.7 Flash uses a sparse Mixture-of-Experts (MoE) architecture with these specs:
- Total parameters: 198B (196B language backbone + 1.8B vision encoder)
- Active parameters per token: ~11B
- Context window: 256,000 tokens
- Throughput: Up to 400 tokens per second
- License: Apache 2.0 (fully open-source)
That 11B active parameter count is the magic number. It means the model only fires up a fraction of its total capacity per inference — roughly what a dense model with 11B parameters would cost in compute, but with the knowledge and reasoning breadth of a 198B model. This is the same trick DeepSeek V4 Flash uses (13B active out of 284B total), and it’s why both models punch well above their weight class.
The 1.8B vision encoder makes Step 3.7 Flash natively multimodal. Unlike some competitors that bolt on vision as an afterthought, this model was trained from scratch to process images alongside text. That matters for agent workflows where you need to read screenshots, parse UI wireframes, or extract data from charts.
One architectural detail worth noting: Step 3.7 Flash uses Multi-Token Prediction (MTP) with 3 speculative tokens, supported natively in vLLM and SGLang. This is part of what drives those 400 tokens/sec throughput numbers on optimized hardware.
Speed: How Fast Is It Really?
StepFun claims up to 400 tokens per second on Step 3.7 Flash. In practice, this depends heavily on your hardware, prompt length, and reasoning level. Here’s the breakdown:
| Reasoning Level | Typical Speed (vLLM, 8xGPU) | Use Case |
|---|---|---|
| Low | 200-400 tokens/sec | Chat, simple Q&A, quick lookups |
| Medium | 100-200 tokens/sec | Coding agents, tool orchestration |
| High | 50-120 tokens/sec | Complex reasoning, multi-step research |
Compared to its predecessor Step 3.5 Flash, the 3.7 version is noticeably faster on the same hardware, particularly when running with the vLLM speculative decoding config (num_speculative_tokens: 3).
For context, these speeds put Step 3.7 Flash roughly in the same throughput league as DeepSeek V4 Flash and significantly ahead of Claude Haiku 4.5 in raw tokens-per-second on comparable infrastructure. However, it’s worth noting that Google’s Gemini 3.5 Flash, with Google’s custom TPU infrastructure, tends to deliver lower perceived latency on their hosted API even if the raw TPS numbers are comparable.
What makes Step 3.7 Flash special for speed isn’t just peak TPS — it’s the consistency under load. The model maintains high throughput even during multi-turn tool-calling workflows, where many models choke on context accumulation.
Coding Performance: The Numbers That Matter
Coding is what Flash models are increasingly judged on, and Step 3.7 Flash delivers. Here are the key benchmarks:
SWE-Bench Pro: 56.3%
This is the big one. SWE-Bench Pro tests whether a model can independently find and fix real bugs in open-source repositories. Step 3.7 Flash scored 56.3%, which puts it solidly in second place among all tested models, behind only Claude Opus 4.7 at 64.3%.
| Model | SWE-Bench Pro Score |
|---|---|
| Claude Opus 4.7 | 64.3% |
| Step 3.7 Flash | 56.3% |
| DeepSeek V4 Flash | 55.6% |
| Gemini 3.5 Flash | 55.1% |
| GPT 5.5 | 58.6% |
| Step 3.5 Flash | 51.3% |
SWE-Bench Verified: 76.5%
On the verified subset, Step 3.7 Flash hits 76.5% — competitive with models that cost 10x more per output token.
Terminal-Bench 2.1: 59.5%
This benchmark tests command-line agentic coding, where the model must interact with a terminal, execute commands, and debug issues. Step 3.7 Flash scored 59.5%, which is respectable but behind Gemini 3.5 Flash (76.2%) and GPT 5.5 (82.7%). This is the one area where Step 3.7 Flash shows clear room for improvement.
The Cross-Harness Story
Here’s something the benchmark tables don’t show: Step 3.7 Flash works consistently across different coding agent frameworks. On StepFun’s internal Step-SWE-Bench, which tests across six agent harnesses, the model averaged 67.08% — with a remarkably tight spread between best (Claude Code: 71.50%) and worst (OpenCode: 64.50%). That’s only a 7% gap.
Compare that to Step 3.5 Flash, where the spread between Claude Code (73%) and RooCode (43%) was a brutal 30 percentage points. This consistency is a big deal if you’re deploying across multiple tools.
Advisor Mode: The Cheat Code
Step 3.7 Flash also supports an Advisor Mode, based on Anthropic’s advisor strategy. The Flash model handles the trajectory end-to-end but can escalate to a larger advisor model at critical decision points (planning, recovering from failures). With Advisor Mode enabled, Step 3.7 Flash reaches 97% of Claude Opus 4.6’s coding performance at roughly one-ninth the per-task cost ($0.19 vs $1.76 per task on SWE-Bench Verified). That’s wild.
Agentic Capabilities: Where Step 3.7 Flash Shines
If coding is the table stakes, agent performance is where Step 3.7 Flash separates itself from the pack.
ClawEval-1.1: 67.1% (First Place)
Step 3.7 Flash scored 67.1% on ClawEval-1.1, which tests autonomous task execution in realistic daily environments. The second-place model scored 59.8%. That’s a 7.3 percentage point gap — massive by agent benchmark standards.
What ClawEval measures: resistance to adversarial traps, adherence to system policies during multi-turn orchestration, and overall task completion integrity. Step 3.7 Flash doesn’t drift. It doesn’t break tool calls. It doesn’t get confused by edge cases designed to trip it up.
Toolathlon: 49.5%
Multi-tool coordination across diverse APIs. This score is solid but not category-leading — DeepSeek V4 Flash edges it at 52.8%, and Gemini 3.5 Flash leads at 56.5%.
HLE with Tools: 47.2%
Humanity’s Last Exam with tool access. This tests deep research and complex problem-solving. Step 3.7 Flash scores 47.2%, which is ahead of both DeepSeek V4 Flash (45.1%) and Gemini 3.5 Flash (40.2%) on this metric. For perspective, Claude Opus 4.7 scores 54.7% on the same benchmark.
GDPval: 45.8% (1415.8 Elo)
This benchmark measures economically valuable knowledge work across 44 occupations. Step 3.7 Flash’s score puts it in the same tier as DeepSeek V4 Flash (44.0%) and ahead of its predecessor Step 3.5 Flash (27.8%) by a landslide. Frontier models like GPT 5.5 and Claude Opus 4.7 hit 63%, so there’s still a gap, but the improvement over Step 3.5 Flash is dramatic.
Mobile GUI Agents: 61.87% on AndroidDaily
Step 3.7 Flash can operate phone UIs. On the AndroidDaily benchmark, it scored 61.87%, just behind Gemini 3 Flash (63.21%) and ahead of both Kimi K2.6 (53.36%) and GLM 5V Turbo (51.68%). The model can see, click, and verify — and in internal testing, it showed an emergent ability to compose GUI operations with code actions, like writing frontend code and then testing it in a browser autonomously.
Reasoning Levels: Pick Your Speed
One of the smartest design choices in Step 3.7 Flash is the three-tier reasoning system. You can set reasoning_effort to low, medium, or high depending on the task:
- Low reasoning: Maximum speed, minimum token burn. Perfect for simple Q&A, chat, classification, and quick tool calls where deep thinking is overkill.
- Medium reasoning: The default. Good balance for most coding tasks, basic agent workflows, and content generation.
- High reasoning: The model thinks longer, generates more reasoning tokens, and produces better results for complex multi-step problems, research, and safety-critical decisions.
This isn’t unique — DeepSeek V4 Flash has thinking/non-thinking modes, and Claude has extended thinking — but StepFun’s implementation is notably clean and well-integrated with the API. You set the level and the model handles the rest.
In my testing, the low reasoning mode is genuinely fast enough for real-time chat applications. High reasoning mode produces noticeably better code for complex refactoring tasks, at the cost of roughly 2-3x the output tokens.
Multimodal: Vision That Actually Works
Step 3.7 Flash isn’t just a text model with vision tacked on. The 1.8B vision encoder was trained jointly with the language backbone, and the results show:
SimpleVQA (with Visual Search): 79.2% (First Place)
This tests visual question answering with search augmentation. Step 3.7 Flash edges out GPT 5.5 (79.1%) and Kimi K2.6 (78.2%) for the top spot.
V* with Python Tool: 95.3%
This high-resolution visual reasoning benchmark tests the model’s ability to use code-based vision tools (cropping, zooming, drawing bounding boxes). Step 3.7 Flash scores 95.3%, competitive with Gemini 3 Flash (96.3%) and Kimi K2.6 (96.9%).
Visual Search: The Killer Feature
Here’s what genuinely impressed me. Step 3.7 Flash has a Visual Search capability that lets it look up entities it doesn’t recognize from training data. On visual recognition tasks, the model performs on par with models five times its size. For long-tail entities and freshly emerged concepts, this is a game-changer. The model doesn’t just describe what it sees — it cross-references against web sources and verifies before answering.
HR-Bench 8K: 86.34%
High-resolution image understanding at 8K resolution. This matters for reading dense documents, architectural diagrams, and medical imaging. Step 3.7 Flash scores 86.34%, just behind Gemini 3 Flash’s 94.80% but competitive with Kimi K2.6 at 90.13%.
One emergent behavior that surprised even StepFun’s own researchers: the model spontaneously combines visual tools with non-visual ones. It’ll use Python to zoom into an image, extract text with OCR, then fire off a web search to verify the extracted information — all without being explicitly trained to do so.
Context Window: 256K Tokens That Don’t Degrade
The 256K token context window is generous for a Flash-class model. For reference:
- Claude Haiku 4.5: 200K
- GPT-4o mini: 128K
- DeepSeek V4 Flash: 1M (but with caveats)
On the AA-LCR benchmark (long-context retrieval at 16K tokens average), Step 3.7 Flash scores 63.9%, which is competitive with DeepSeek V4 Flash (63.7%) and ahead of its predecessor Step 3.5 Flash (45.5%).
The real-world implication: you can dump entire codebases (within reason), long documentation, multi-hour conversation histories, and dense research papers into this model and it won’t lose the thread. For agent workflows that accumulate context over dozens of turns, this is table stakes.
Pricing: Cheaper Than You’d Think
StepFun has priced Step 3.7 Flash aggressively:
| Token Type | Price (per 1M tokens) |
|---|---|
| Input (cache miss) | $0.20 |
| Input (cache hit) | $0.04 |
| Output | $1.15 |
Step Plan: The Subscription Option
StepFun also offers Step Plan, a subscription service for high-frequency AI developers:
| Plan | Price/Month | 5-Hour Limit | Weekly Limit |
|---|---|---|---|
| Flash Mini | $6.99 | ~1,500 requests | ~6,000 requests |
| Flash Plus | $9.99 | ~6,000 requests | ~24,000 requests |
| Flash Pro | $29 | ~22,500 requests | ~90,000 requests |
| Flash Max | $99 | ~75,000 requests | ~300,000 requests |
All plans include Step 3.7 Flash, Step 3.5 Flash, and Step 3.5 Flash 2603. One “prompt” (billing unit) equals roughly 15-20 actual API calls, so the effective request counts are quite generous.
How Step 3.7 Flash Stacks Up Against the Competition
Here’s the full pricing and capability comparison against every relevant fast model:
| Model | Input $/M tok | Output $/M tok | Context | Active Params | Multimodal | Open Source |
|---|---|---|---|---|---|---|
| Step 3.7 Flash | $0.20 | $1.15 | 256K | 11B | Yes (text+image) | Yes (Apache 2.0) |
| DeepSeek V4 Flash | $0.14 | $0.28 | 1M | 13B | No (text only) | Yes |
| GPT-4o mini | $0.15 | $0.60 | 128K | unknown | Yes (text+image) | No |
| Claude Haiku 4.5 | $1.00 | $5.00 | 200K | unknown | Yes (text+image) | No |
| Gemini 3.5 Flash | ~$0.10-0.15 | ~$0.30-0.60 | 1M | unknown | Yes (all modalities) | No |
And here’s how they compare on the benchmarks that matter:
| Model | SWE-Bench Pro | ClawEval-1.1 | HLE w. Tools | Terminal-Bench 2.1 | SimpleVQA |
|---|---|---|---|---|---|
| Step 3.7 Flash | 56.3 | 67.1 | 47.2 | 59.5 | 79.2 |
| DeepSeek V4 Flash | 55.6 | 57.8 | 45.1 | 62.0 | N/A |
| GPT-4o mini | Not reported | Not reported | Not reported | Not reported | Not reported |
| Gemini 3.5 Flash | 55.1 | Not reported | 40.2 | 76.2 | Not reported |
| Claude Opus 4.7 | 64.3 | 70.8 | 54.7 | 69.4 | N/A |
The Step 3.7 Flash vs DeepSeek V4 Flash Decision
These two are the most natural competitors. Both are open-weight MoE models from Chinese labs, both target agentic coding workloads. Here’s the short version:
- Pick Step 3.7 Flash if: You need native multimodal (vision), better agent reliability (ClawEval), and consistent performance across different agent frameworks. The output pricing is higher ($1.15 vs $0.28/M), but the model’s agentic consistency often means fewer retries, which balances out cost.
- Pick DeepSeek V4 Flash if: Your work is primarily text-only, you need the absolute cheapest output tokens, and you push the 1M context window hard. DeepSeek also has a stronger Terminal-Bench score (62.0 vs 59.5).
API Access and Deployment
Getting Started
Step 3.7 Flash is accessible through multiple channels:
- StepFun API:
https://api.stepfun.ai/v1(global) orhttps://api.stepfun.com/v1(China). Standard OpenAI-compatible endpoint. - Step Plan: Subscription-based access at
https://api.stepfun.ai/step_plan/v1. - OpenRouter: Coming soon — currently listed but not yet routed.
- NVIDIA NIM: Available as an inference microservice for on-prem, cloud, or hybrid deployment.
- DeepInfra, Fireworks AI, Modal: Partnerships announced, rolling out soon.
Local Deployment
You can run Step 3.7 Flash locally if you have the hardware. Minimum requirements:
- VRAM/Unified Memory: 120 GB minimum, 128 GB recommended
- Supported hardware: NVIDIA DGX Station, AMD Ryzen AI Max+ 395 systems, Mac Studio / MacBook Pro with 128GB+ unified memory
- Supported backends: vLLM, SGLang, Hugging Face Transformers, llama.cpp
For the GGUF quantized version (Q4_K_S), the model file is about 111.5 GB plus ~7 GB runtime overhead. You can also use the NVFP4 quantized version at 104B effective size for Blackwell GPUs.
# Quick start with the API
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["STEP_API_KEY"],
base_url="https://api.stepfun.ai/v1",
)
completion = client.chat.completions.create(
model="step-3.7-flash",
messages=[
{"role": "user", "content": "Write a Python function to merge two sorted arrays."}
],
)
print(completion.choices.message.content)
What Step 3.7 Flash Gets Right
-
Agent reliability is the headline. Scoring 67.1% on ClawEval with a 7-point gap over second place is genuinely impressive. This model doesn’t drift mid-task.
-
Cross-harness consistency. The tight spread across six different agent frameworks (64.5% to 71.5%) means you’re not locked into one tool. You can use it in Claude Code, KiloCode, OpenClaw, or Hermes Agent and get predictable results.
-
Native multimodal. The 1.8B vision encoder isn’t an afterthought. Visual search, Python-based vision tools, and emergent compositional reasoning make this a genuinely multimodal agent.
-
Open-source with Apache 2.0. No weird restrictions. Deploy it, fine-tune it, ship it in commercial products.
-
Three reasoning levels. The ability to dial reasoning up or down without switching models is elegant and practical.
-
Advisor Mode. 97% of Claude Opus 4.6’s coding performance at 11% of the cost is a compelling value proposition.
What Could Be Better
-
Output pricing is the weak spot. At $1.15/M output tokens, Step 3.7 Flash is 4x more expensive than DeepSeek V4 Flash ($0.28/M) for output. Input pricing is competitive, but if your agent workloads are output-heavy (lots of code generation), costs can add up.
-
Terminal-Bench performance. 59.5% on Terminal-Bench 2.1 is behind Gemini 3.5 Flash (76.2%) and GPT 5.5 (82.7%). Terminal-heavy agent workflows are still a work in progress.
-
Global availability is catching up. OpenRouter, DeepInfra, and Fireworks AI integrations are “coming soon” as of June 2026. For now, you need to go through StepFun’s own API or deploy locally.
-
No audio or video input. Unlike Gemini 3.5 Flash, which supports audio, video, and PDF input, Step 3.7 Flash is limited to text and images. For truly omni-modal agent workflows, you’ll need to supplement with other models.
-
The 256K context window is good but not best-in-class. DeepSeek V4 Flash offers 1M tokens, and Gemini 3.5 Flash matches that. If you’re building agents that need to ingest entire codebases or multi-thousand-page documents, this might be a constraint.
The Bottom Line
Step 3.7 Flash is the best open-source multimodal agent model available in June 2026. It’s not the best at everything — Terminal-Bench is weak, output pricing is higher than DeepSeek’s, and it doesn’t match frontier Pro models on raw reasoning. But for the specific intersection of fast, reliable, open-source, vision-capable agentic coding, nothing else comes close.
If you’re building coding agents in 2026 and you haven’t tried Step 3.7 Flash, you’re either overpaying for Claude, compromising on vision with DeepSeek, or settling for less consistency with whatever else is in your rotation.
The model is available now on the StepFun platform, on Hugging Face, and soon across every major inference provider. It’s Apache 2.0 licensed. There’s really no excuse not to kick the tires.
Sources
- StepFun official blog: Step 3.7 Flash launch post — primary source for all architecture, benchmark, and pricing data
- Hugging Face model card: stepfun-ai/Step-3.7-Flash — deployment guides, GGUF specs, evaluation results
- StepFun Open Platform pricing: platform.stepfun.ai/docs/en/guides/pricing/details — API pricing and rate limits
- StepFun Step Plan: platform.stepfun.ai/docs/en/step-plan/overview — subscription plan details
- Anthropic Models Overview: docs.anthropic.com/en/docs/about-claude/models — Claude Haiku 4.5 pricing and specs
- DeepSeek API Pricing: api-docs.deepseek.com/quick_start/pricing — DeepSeek V4 Flash pricing
- OpenAI GPT-4o mini: platform.openai.com/docs/models/gpt-4o-mini — pricing and specs
- Google DeepMind Gemini 3.5 Flash: deepmind.google/models/gemini/gemini-3.5-flash — benchmark data and model capabilities
- OpenRouter: Claude Haiku 4.5 openrouter.ai/anthropic/claude-haiku-4.5 — pricing verification
- OpenRouter: GPT-4o mini openrouter.ai/openai/gpt-4o-mini — pricing verification
- AndroidDaily benchmark paper: arXiv:2605.27761 — mobile GUI agent benchmark methodology
- Anthropic Advisor Strategy: claude.com/blog/the-advisor-strategy — advisor mode context