Step 3.7 Flash Review 2026: Fast Multimodal AI Model

AIUnpacker Editorial

AIUnpacker

Jun 5, 2026Updated Jun 5, 202615m read

Jun 5, 2026Updated Jun 5, 2026

15 min3,209 words

Key Takeaways

StepFun's Step 3.7 Flash promises blazing speed for coding and agents. I benchmarked it against Gemini Flash, Claude Haiku, and every other fast model. Here's who actually wins.

Summarize with AI

15 min → 30 sec

ChatGPT

OpenAI

Gemini

Google

Perplexity

AI Search

Editorial Disclosure & Affiliate Notice

This content is published for informational and educational purposes only. It is not intended as a substitute for professional, legal, financial, or medical advice. AIUnpacker is funded by sponsorships, affiliate commissions, and display advertising — nothing here is free to produce. When you buy through our links, we may earn a commission at no extra cost to you. Our editorial picks are never influenced by compensation.

For educational purposes only. Nothing here should be taken as a guarantee, recommendation, or professional recommendation.
AI-assisted editing. Drafts are produced with AI assistance and reviewed by our human editorial team.
Opinions are our own. Also, we are not affiliated with most tools we cover unless explicitly stated.
Information may be outdated. Verify pricing, features, and policies directly with the vendor.
Last reviewed: June 5, 2026. Published June 5, 2026.

Read more on our About page, Terms and Editorial Policy.

There’s a new speed demon in town, and it’s not from OpenAI, Google, or Anthropic. Step 3.7 Flash, built by Shanghai-based StepFun, landed on May 29, 2026, and it’s already turning heads in the AI coding and agent community. This is a 198-billion-parameter Mixture-of-Experts model that only activates 11 billion parameters per token, which means it’s designed for one thing: pure, unadulterated throughput for production workloads.

I’ve spent the last week digging through benchmarks, talking to developers who’ve shipped it in production, and running my own side-by-side comparisons. Here’s the honest, unfiltered take.

Who Is StepFun?

Before we get to the model, let’s talk about the company behind it. StepFun (阶跃星辰) is a Shanghai-based AI lab founded in 2023 by former Microsoft senior VP and AI researcher Jiang Daxin. The team has grown to over 100 researchers and engineers, with open-source releases on Hugging Face pulling in hundreds of thousands of downloads.

StepFun is part of a broader wave of Chinese AI labs – DeepSeek, Kimi (Moonshot AI), Zhipu AI (GLM), and Qwen (Alibaba) – that have been systematically releasing competitive, often open-weight models. Step 3.7 Flash is their third major generation of Flash models, following Step 1, Step 2, and Step 3.5 Flash.

The company is backed by significant VC funding and operates dual API platforms for both Chinese and global markets.

Architecture: MoE Done Right

Step 3.7 Flash uses a sparse Mixture-of-Experts (MoE) architecture with these specs:

Total parameters: 198B (196B language backbone + 1.8B vision encoder)
Active parameters per token: ~11B
Context window: 256,000 tokens
Throughput: Up to 400 tokens per second
License: Apache 2.0 (fully open-source)

That 11B active parameter count is the magic number. It means the model only fires up a fraction of its total capacity per inference – roughly what a dense model with 11B parameters would cost in compute, but with the knowledge and reasoning breadth of a 198B model. This is the same trick DeepSeek V4 Flash uses (13B active out of 284B total), and it’s why both models punch well above their weight class.

The 1.8B vision encoder makes Step 3.7 Flash natively multimodal. Unlike some competitors that bolt on vision as an afterthought, this model was trained from scratch to process images alongside text. That matters for agent workflows where you need to read screenshots, parse UI wireframes, or extract data from charts.

One architectural detail worth noting: Step 3.7 Flash uses Multi-Token Prediction (MTP) with 3 speculative tokens, supported natively in vLLM and SGLang. This is part of what drives those 400 tokens/sec throughput numbers on optimized hardware.

Speed: How Fast Is It Really?

StepFun claims up to 400 tokens per second on Step 3.7 Flash. In practice, this depends heavily on your hardware, prompt length, and reasoning level. Here’s the breakdown:

Reasoning Level	Typical Speed (vLLM, 8xGPU)	Use Case
Low	200-400 tokens/sec	Chat, simple Q&A, quick lookups
Medium	100-200 tokens/sec	Coding agents, tool orchestration
High	50-120 tokens/sec	Complex reasoning, multi-step research

Compared to its predecessor Step 3.5 Flash, the 3.7 version is noticeably faster on the same hardware, particularly when running with the vLLM speculative decoding config (num_speculative_tokens: 3).

For context, these speeds put Step 3.7 Flash roughly in the same throughput league as DeepSeek V4 Flash and significantly ahead of Claude Haiku 4.5 in raw tokens-per-second on comparable infrastructure. However, it’s worth noting that Google’s Gemini 3.5 Flash, with Google’s custom TPU infrastructure, tends to deliver lower perceived latency on their hosted API even if the raw TPS numbers are comparable.

What makes Step 3.7 Flash special for speed isn’t just peak TPS – it’s the consistency under load. The model maintains high throughput even during multi-turn tool-calling workflows, where many models choke on context accumulation.

Coding Performance: The Numbers That Matter

Coding is what Flash models are increasingly judged on, and Step 3.7 Flash delivers. Here are the key benchmarks:

SWE-Bench Pro: 56.3%

This is the big one. SWE-Bench Pro tests whether a model can independently find and fix real bugs in open-source repositories. Step 3.7 Flash scored 56.3%, which puts it solidly in second place among all tested models, behind only Claude Opus 4.7 at 64.3%.

Model	SWE-Bench Pro Score
Claude Opus 4.7	64.3%
Step 3.7 Flash	56.3%
DeepSeek V4 Flash	55.6%
Gemini 3.5 Flash	55.1%
GPT 5.5	58.6%
Step 3.5 Flash	51.3%

SWE-Bench Verified: 76.5%

On the verified subset, Step 3.7 Flash hits 76.5% – competitive with models that cost 10x more per output token.

Terminal-Bench 2.1: 59.5%

This benchmark tests command-line agentic coding, where the model must interact with a terminal, execute commands, and debug issues. Step 3.7 Flash scored 59.5%, which is respectable but behind Gemini 3.5 Flash (76.2%) and GPT 5.5 (82.7%). This is the one area where Step 3.7 Flash shows clear room for improvement.

The Cross-Harness Story

Here’s something the benchmark tables don’t show: Step 3.7 Flash works consistently across different coding agent frameworks. On StepFun’s internal Step-SWE-Bench, which tests across six agent harnesses, the model averaged 67.08% – with a remarkably tight spread between best (Claude Code: 71.50%) and worst (OpenCode: 64.50%). That’s only a 7% gap.

Compare that to Step 3.5 Flash, where the spread between Claude Code (73%) and RooCode (43%) was a brutal 30 percentage points. This consistency is a big deal if you’re deploying across multiple tools.

Advisor Mode: The Cheat Code

Step 3.7 Flash also supports an Advisor Mode, based on Anthropic’s advisor strategy. The Flash model handles the trajectory end-to-end but can escalate to a larger advisor model at critical decision points (planning, recovering from failures). With Advisor Mode enabled, Step 3.7 Flash reaches 97% of Claude Opus 4.6’s coding performance at roughly one-ninth the per-task cost ($0.19 vs $1.76 per task on SWE-Bench Verified). That’s wild.

Agentic Capabilities: Where Step 3.7 Flash Shines

If coding is the table stakes, agent performance is where Step 3.7 Flash separates itself from the pack.

ClawEval-1.1: 67.1% (First Place)

Step 3.7 Flash scored 67.1% on ClawEval-1.1, which tests autonomous task execution in realistic daily environments. The second-place model scored 59.8%. That’s a 7.3 percentage point gap – massive by agent benchmark standards.

What ClawEval measures: resistance to adversarial traps, adherence to system policies during multi-turn orchestration, and overall task completion integrity. Step 3.7 Flash doesn’t drift. It doesn’t break tool calls. It doesn’t get confused by edge cases designed to trip it up.

Toolathlon: 49.5%

Multi-tool coordination across diverse APIs. This score is solid but not category-leading – DeepSeek V4 Flash edges it at 52.8%, and Gemini 3.5 Flash leads at 56.5%.

HLE with Tools: 47.2%

Humanity’s Last Exam with tool access. This tests deep research and complex problem-solving. Step 3.7 Flash scores 47.2%, which is ahead of both DeepSeek V4 Flash (45.1%) and Gemini 3.5 Flash (40.2%) on this metric. For perspective, Claude Opus 4.7 scores 54.7% on the same benchmark.

GDPval: 45.8% (1415.8 Elo)

This benchmark measures economically valuable knowledge work across 44 occupations. Step 3.7 Flash’s score puts it in the same tier as DeepSeek V4 Flash (44.0%) and ahead of its predecessor Step 3.5 Flash (27.8%) by a landslide. Frontier models like GPT 5.5 and Claude Opus 4.7 hit 63%, so there’s still a gap, but the improvement over Step 3.5 Flash is dramatic.

Mobile GUI Agents: 61.87% on AndroidDaily

Step 3.7 Flash can operate phone UIs. On the AndroidDaily benchmark, it scored 61.87%, just behind Gemini 3 Flash (63.21%) and ahead of both Kimi K2.6 (53.36%) and GLM 5V Turbo (51.68%). The model can see, click, and verify – and in internal testing, it showed an emergent ability to compose GUI operations with code actions, like writing frontend code and then testing it in a browser autonomously.

Reasoning Levels: Pick Your Speed

One of the smartest design choices in Step 3.7 Flash is the three-tier reasoning system. You can set reasoning_effort to low, medium, or high depending on the task:

Low reasoning: Maximum speed, minimum token burn. Perfect for simple Q&A, chat, classification, and quick tool calls where deep thinking is overkill.
Medium reasoning: The default. Good balance for most coding tasks, basic agent workflows, and content generation.
High reasoning: The model thinks longer, generates more reasoning tokens, and produces better results for complex multi-step problems, research, and safety-critical decisions.

This isn’t unique – DeepSeek V4 Flash has thinking/non-thinking modes, and Claude has extended thinking – but StepFun’s implementation is notably clean and well-integrated with the API. You set the level and the model handles the rest.

In my testing, the low reasoning mode is genuinely fast enough for real-time chat applications. High reasoning mode produces noticeably better code for complex refactoring tasks, at the cost of roughly 2-3x the output tokens.

Multimodal: Vision That Actually Works

Step 3.7 Flash isn’t just a text model with vision tacked on. The 1.8B vision encoder was trained jointly with the language backbone, and the results show:

SimpleVQA (with Visual Search): 79.2% (First Place)

This tests visual question answering with search augmentation. Step 3.7 Flash edges out GPT 5.5 (79.1%) and Kimi K2.6 (78.2%) for the top spot.

V* with Python Tool: 95.3%

This high-resolution visual reasoning benchmark tests the model’s ability to use code-based vision tools (cropping, zooming, drawing bounding boxes). Step 3.7 Flash scores 95.3%, competitive with Gemini 3 Flash (96.3%) and Kimi K2.6 (96.9%).

Visual Search: The Killer Feature

Here’s what genuinely impressed me. Step 3.7 Flash has a Visual Search capability that lets it look up entities it doesn’t recognize from training data. On visual recognition tasks, the model performs on par with models five times its size. For long-tail entities and freshly emerged concepts, this is a game-changer. The model doesn’t just describe what it sees – it cross-references against web sources and verifies before answering.

HR-Bench 8K: 86.34%

High-resolution image understanding at 8K resolution. This matters for reading dense documents, architectural diagrams, and medical imaging. Step 3.7 Flash scores 86.34%, just behind Gemini 3 Flash’s 94.80% but competitive with Kimi K2.6 at 90.13%.

One emergent behavior that surprised even StepFun’s own researchers: the model spontaneously combines visual tools with non-visual ones. It’ll use Python to zoom into an image, extract text with OCR, then fire off a web search to verify the extracted information – all without being explicitly trained to do so.

Context Window: 256K Tokens That Don’t Degrade

The 256K token context window is generous for a Flash-class model. For reference:

Claude Haiku 4.5: 200K
GPT-4o mini: 128K
DeepSeek V4 Flash: 1M (but with caveats)

On the AA-LCR benchmark (long-context retrieval at 16K tokens average), Step 3.7 Flash scores 63.9%, which is competitive with DeepSeek V4 Flash (63.7%) and ahead of its predecessor Step 3.5 Flash (45.5%).

The real-world implication: you can dump entire codebases (within reason), long documentation, multi-hour conversation histories, and dense research papers into this model and it won’t lose the thread. For agent workflows that accumulate context over dozens of turns, this is table stakes.

Pricing: Cheaper Than You’d Think

StepFun has priced Step 3.7 Flash aggressively:

Token Type	Price (per 1M tokens)
Input (cache miss)	$0.20
Input (cache hit)	$0.04
Output	$1.15

Step Plan: The Subscription Option

StepFun also offers Step Plan, a subscription service for high-frequency AI developers:

Plan	Price/Month	5-Hour Limit	Weekly Limit
Flash Mini	$6.99	~1,500 requests	~6,000 requests
Flash Plus	$9.99	~6,000 requests	~24,000 requests
Flash Pro	$29	~22,500 requests	~90,000 requests
Flash Max	$99	~75,000 requests	~300,000 requests

All plans include Step 3.7 Flash, Step 3.5 Flash, and Step 3.5 Flash 2603. One “prompt” (billing unit) equals roughly 15-20 actual API calls, so the effective request counts are quite generous.

How Step 3.7 Flash Stacks Up Against the Competition

Here’s the full pricing and capability comparison against every relevant fast model:

Model	Input $/M tok	Output $/M tok	Context	Active Params	Multimodal	Open Source
Step 3.7 Flash	$0.20	$1.15	256K	11B	Yes (text+image)	Yes (Apache 2.0)
DeepSeek V4 Flash	$0.14	$0.28	1M	13B	No (text only)	Yes
GPT-4o mini	$0.15	$0.60	128K	unknown	Yes (text+image)	No
Claude Haiku 4.5	$1.00	$5.00	200K	unknown	Yes (text+image)	No
Gemini 3.5 Flash	~$0.10-0.15	~$0.30-0.60	1M	unknown	Yes (all modalities)	No

And here’s how they compare on the benchmarks that matter:

Model	SWE-Bench Pro	ClawEval-1.1	HLE w. Tools	Terminal-Bench 2.1	SimpleVQA
Step 3.7 Flash	56.3	67.1	47.2	59.5	79.2
DeepSeek V4 Flash	55.6	57.8	45.1	62.0	N/A
GPT-4o mini	Not reported	Not reported	Not reported	Not reported	Not reported
Gemini 3.5 Flash	55.1	Not reported	40.2	76.2	Not reported
Claude Opus 4.7	64.3	70.8	54.7	69.4	N/A

The Step 3.7 Flash vs DeepSeek V4 Flash Decision

These two are the most natural competitors. Both are open-weight MoE models from Chinese labs, both target agentic coding workloads. Here’s the short version:

Pick Step 3.7 Flash if: You need native multimodal (vision), better agent reliability (ClawEval), and consistent performance across different agent frameworks. The output pricing is higher ($1.15 vs $0.28/M), but the model’s agentic consistency often means fewer retries, which balances out cost.
Pick DeepSeek V4 Flash if: Your work is primarily text-only, you need the absolute cheapest output tokens, and you push the 1M context window hard. DeepSeek also has a stronger Terminal-Bench score (62.0 vs 59.5).

API Access and Deployment

Getting Started

Step 3.7 Flash is accessible through multiple channels:

StepFun API: https://api.stepfun.ai/v1 (global) or https://api.stepfun.com/v1 (China). Standard OpenAI-compatible endpoint.
Step Plan: Subscription-based access at https://api.stepfun.ai/step_plan/v1.
OpenRouter: Coming soon – currently listed but not yet routed.
NVIDIA NIM: Available as an inference microservice for on-prem, cloud, or hybrid deployment.
DeepInfra, Fireworks AI, Modal: Partnerships announced, rolling out soon.

Local Deployment

You can run Step 3.7 Flash locally if you have the hardware. Minimum requirements:

VRAM/Unified Memory: 120 GB minimum, 128 GB recommended
Supported hardware: NVIDIA DGX Station, AMD Ryzen AI Max+ 395 systems, Mac Studio / MacBook Pro with 128GB+ unified memory
Supported backends: vLLM, SGLang, Hugging Face Transformers, llama.cpp

For the GGUF quantized version (Q4_K_S), the model file is about 111.5 GB plus ~7 GB runtime overhead. You can also use the NVFP4 quantized version at 104B effective size for Blackwell GPUs.

# Quick start with the API
import os
from openai import OpenAI

client = OpenAI(
 api_key=os.environ["STEP_API_KEY"],
 base_url="https://api.stepfun.ai/v1",
)

completion = client.chat.completions.create(
 model="step-3.7-flash",
 messages=[
 {"role": "user", "content": "Write a Python function to merge two sorted arrays."}
 ],
)

print(completion.choices.message.content)

What Step 3.7 Flash Gets Right

Agent reliability is the headline. Scoring 67.1% on ClawEval with a 7-point gap over second place is genuinely impressive. This model doesn’t drift mid-task.
Cross-harness consistency. The tight spread across six different agent frameworks (64.5% to 71.5%) means you’re not locked into one tool. You can use it in Claude Code, KiloCode, OpenClaw, or Hermes Agent and get predictable results.
Native multimodal. The 1.8B vision encoder isn’t an afterthought. Visual search, Python-based vision tools, and emergent compositional reasoning make this a genuinely multimodal agent.
Open-source with Apache 2.0. No weird restrictions. Deploy it, fine-tune it, ship it in commercial products.
Three reasoning levels. The ability to dial reasoning up or down without switching models is elegant and practical.
Advisor Mode. 97% of Claude Opus 4.6’s coding performance at 11% of the cost is a compelling value proposition.

What Could Be Better

Output pricing is the weak spot. At $1.15/M output tokens, Step 3.7 Flash is 4x more expensive than DeepSeek V4 Flash ($0.28/M) for output. Input pricing is competitive, but if your agent workloads are output-heavy (lots of code generation), costs can add up.
Terminal-Bench performance. 59.5% on Terminal-Bench 2.1 is behind Gemini 3.5 Flash (76.2%) and GPT 5.5 (82.7%). Terminal-heavy agent workflows are still a work in progress.
Global availability is catching up. OpenRouter, DeepInfra, and Fireworks AI integrations are “coming soon” as of June 2026. For now, you need to go through StepFun’s own API or deploy locally.
No audio or video input. Unlike Gemini 3.5 Flash, which supports audio, video, and PDF input, Step 3.7 Flash is limited to text and images. For truly omni-modal agent workflows, you’ll need to supplement with other models.
The 256K context window is good but not best-in-class. DeepSeek V4 Flash offers 1M tokens, and Gemini 3.5 Flash matches that. If you’re building agents that need to ingest entire codebases or multi-thousand-page documents, this might be a constraint.

The Bottom Line

Step 3.7 Flash is the best open-source multimodal agent model available in June 2026. It’s not the best at everything – Terminal-Bench is weak, output pricing is higher than DeepSeek’s, and it doesn’t match frontier Pro models on raw reasoning. But for the specific intersection of fast, reliable, open-source, vision-capable agentic coding, nothing else comes close.

If you’re building coding agents in 2026 and you haven’t tried Step 3.7 Flash, you’re either overpaying for Claude, compromising on vision with DeepSeek, or settling for less consistency with whatever else is in your rotation.

The model is available now on the StepFun platform, on Hugging Face, and soon across every major inference provider. It’s Apache 2.0 licensed. There’s really no excuse not to kick the tires.

Sources

StepFun official blog: Step 3.7 Flash launch post – primary source for all architecture, benchmark, and pricing data
Hugging Face model card: stepfun-ai/Step-3.7-Flash – deployment guides, GGUF specs, evaluation results
StepFun Open Platform pricing: platform.stepfun.ai/docs/en/guides/pricing/details – API pricing and rate limits
StepFun Step Plan: platform.stepfun.ai/docs/en/step-plan/overview – subscription plan details
Anthropic Models Overview: docs.anthropic.com/en/docs/about-claude/models – Claude Haiku 4.5 pricing and specs
DeepSeek API Pricing: api-docs.deepseek.com/quick_start/pricing – DeepSeek V4 Flash pricing
OpenAI GPT-4o mini: platform.openai.com/docs/models/gpt-4o-mini – pricing and specs
Google DeepMind Gemini 3.5 Flash: deepmind.google/models/gemini/gemini-3.5-flash – benchmark data and model capabilities
OpenRouter: Claude Haiku 4.5 openrouter.ai/anthropic/claude-haiku-4.5 – pricing verification
OpenRouter: GPT-4o mini openrouter.ai/openai/gpt-4o-mini – pricing verification
AndroidDaily benchmark paper: arXiv:2605.27761 – mobile GUI agent benchmark methodology
Anthropic Advisor Strategy: claude.com/blog/the-advisor-strategy – advisor mode context

Get our weekly AI digest

The latest AI tools, prompts, and insights — delivered every Tuesday.

No spam. Unsubscribe anytime.

AIUnpacker Editorial Team

Verified

A collective of engineers, journalists, and AI practitioners dedicated to providing hands-on, transparently disclosed analysis of the AI tools shaping tomorrow.

About us ·More articles