Google Gemini (as an agent)
Gemini is a multimodal AI model capable of understanding and operating across various data types, including images, video, and text, enabling sophisticated reasoning and direct UI control.
Step 3.7 Flash is a high-efficiency, multimodal Mixture-of-Experts (MoE) vision-language model designed for real-world agentic workflows, developed by StepFun.
Similar Tools
Other tools you might consider
Google Gemini (as an agent)
Gemini is a multimodal AI model capable of understanding and operating across various data types, including images, video, and text, enabling sophisticated reasoning and direct UI control.
AskUI Vision Agent
AskUI Vision Agent specializes in automating desktop and mobile workflows by visually understanding and interacting with graphical user interfaces at the operating system level.
Skygen
Skygen is an AI desktop automation agent that provides real-time visibility and runs tasks across various applications, websites, and cloud computers.
OpenAI Operator
OpenAI Operator is designed to execute multi-step actions directly within a web browser, enabling autonomous completion of complex web tasks.
overview
Step 3.7 Flash is a high-efficiency, multimodal Mixture-of-Experts (MoE) vision-language model developed by StepFun that enables AI Developers and Enterprise users to build and deploy advanced AI agents. It provides advanced perception, search, and reasoning capabilities at production scale for agentic workflows. This 198-billion-parameter sparse MoE model, released on May 28, 2026, activates approximately 11 billion parameters per token during inference, ensuring high throughput. It integrates a 196B-parameter language backbone with a 1.8B-parameter vision encoder, facilitating native image and video understanding. The model supports a substantial 256k context window and offers three selectable reasoning levels (low, medium, and high) to balance speed, cost, and cognitive depth. Its primary function is to support agentic workflows requiring multimodal perception, search, and multi-step reasoning across various digital environments.
quick facts
| Attribute | Value |
|---|---|
| Developer | StepFun |
| Business Model | Freemium, Usage-based |
| Pricing | Freemium, Usage-based (Step 3.7 Flash input: $0.00020 per 1k tokens, output: $0.00115 per 1k tokens) |
| Platforms | API, Web (StepFun Open Platform) |
| API Available | Yes |
| Integrations | NVIDIA NIM, SGLang, NVIDIA TensorRT-LLM, vLLM, Hugging Face, OpenRouter, ModelScope |
| Founded | 2023 |
| HQ | Shanghai, China |
features
Step 3.7 Flash incorporates a suite of technical features designed for high-performance agentic AI applications, leveraging a multimodal Mixture-of-Experts architecture. These capabilities enable advanced perception, reasoning, and action across diverse data types and operational environments.
use cases
Step 3.7 Flash is engineered for professionals and organizations requiring advanced multimodal AI capabilities for agentic workflows, particularly those focused on automation, complex data interpretation, and application development.
pricing
Step 3.7 Flash operates on a freemium and usage-based pricing model, allowing users to access a free tier before incurring costs based on token consumption. Specific rate limits are applied to concurrency, requests per minute (RPM), and tokens per minute (TPM), with a request timeout of 10 minutes. Users requiring higher limits can contact platform@stepfun.com.
competitors
Step 3.7 Flash is positioned as a leading multimodal agentic model, competing in the 'Flash' model market against established and emerging AI solutions. Its strengths lie in native multimodal perception, robust tool orchestration, and competitive performance in coding and visual intelligence benchmarks.
Gemini is a multimodal AI model capable of understanding and operating across various data types, including images, video, and text, enabling sophisticated reasoning and direct UI control.
Similar to Step 3.7 Flash, Gemini offers real-time perception and action capabilities, particularly strong in multimodal understanding and complex decision-making. Its freemium access is typically via API for developers, allowing for the creation of custom agents.
AskUI Vision Agent specializes in automating desktop and mobile workflows by visually understanding and interacting with graphical user interfaces at the operating system level.
This is a direct competitor focusing on the 'see and act' aspect for digital interfaces, translating visual data into low-level commands. Its specialization in GUI automation provides a focused alternative to a general 'flash-speed' agent model.
Skygen is an AI desktop automation agent that provides real-time visibility and runs tasks across various applications, websites, and cloud computers.
Skygen aligns closely with Step 3.7 Flash's description of a 'flash-speed agent model that can see and act' within digital environments, emphasizing real-time operation and broad application interaction. It offers a freemium model, similar to the described pricing of Step 3.7 Flash.
OpenAI Operator is designed to execute multi-step actions directly within a web browser, enabling autonomous completion of complex web tasks.
While its pricing is listed as a paid 'Pro' tier rather than freemium, OpenAI Operator offers a direct functional comparison by focusing on agents that 'see' (perceive web interfaces) and 'act' (perform tasks) at speed within a browser environment.
Agno AI Agents is a framework built for performance, enabling the creation of lightning-fast, production-ready AI agents with minimal startup times and a tiny footprint.
Agno directly addresses the 'flash-speed' aspect, offering a framework to build agents that are exceptionally fast and efficient. While its 'see' capability is more about perceiving digital states for action rather than explicit visual recognition, its emphasis on rapid, production-grade agent deployment makes it a strong competitor for high-performance autonomous tasks.
Step 3.7 Flash is a high-efficiency, multimodal Mixture-of-Experts (MoE) vision-language model developed by StepFun that enables AI Developers and Enterprise users to build and deploy advanced AI agents. It provides advanced perception, search, and reasoning capabilities at production scale for agentic workflows.
Step 3.7 Flash operates on a freemium model, offering a free tier. For usage beyond the free tier, it is usage-based, with input tokens priced at $0.00020 per 1k tokens and output tokens at $0.00115 per 1k tokens.
Key features of Step 3.7 Flash include its 198-billion-parameter sparse MoE architecture, native image and video understanding via a 1.8B-parameter vision encoder, a 256k context window, three selectable reasoning levels, and reliable interaction with external APIs and tools. It also supports NVIDIA inference stacks and offers an Advisor Mode for cost-efficient agentic operations.
Step 3.7 Flash is primarily intended for AI Developers, Enterprise Users, Engineers/Researchers, and Content Creators who require advanced multimodal AI agents for tasks such as building AI applications, automating complex workflows, agentic coding, and processing diverse data types.
Step 3.7 Flash distinguishes itself with native multimodal support (images and video), outperforming competitors like DeepSeek V4 Flash in this aspect. It demonstrates strong coding performance, scoring 56.3 on SWE-Bench PRO, and leads the ClawEval-1.1 benchmark for tool orchestration. Its Advisor Mode offers a cost-effective alternative to models like Claude Opus 4.6 for similar performance levels.
More on Stork
Other tools in this category, ranked by community signal
Pounce
🤖 AI Tools
AI monitors X and Reddit for the right conversations — you just reply and build relationships.
Hermes
🤖 AI Tools
Self-hosted AI agent that remembers your projects, builds skills automatically, and reaches you on Telegram, Discord & more. MIT license. No tracking.
Upstash Agent Analytics
🤖 AI Tools
Upstash is a serverless data platform providing low latency and high scalability for real-time applications. Optimize your data infrastructure with Upstash's managed services for Redis, Vector, QStash, and other key data technologies.
Novu Connect
🤖 AI Tools
Novu is an open-source notification platform that empowers developers to create robust, multi-channel notifications for web and mobile apps. With powerful workflows, seamless integrations, and a flexible API-first approach, Novu enables product teams.
Tinfoil Pigeons
🤖 AI Tools
Tinfoil Pigeons is a live radar scope: enter your postcode and see the flights overhead right now, then tap one to find out what it is.
Verol
🤖 AI Tools
Real-time AI fact checker and hallucination detector for ChatGPT, Claude, Gemini & Grok. Automatically verifies responses.
For builders
AI agents read it. Buyers find it. Backlinks accrue. Your tool can have one too — live in 24 hours, indexed by Claude, ChatGPT, and Perplexity, queryable via MCP.