DeepSeek V4 Flash: the fast model of the V4 series

Overview

DeepSeek V4 Flash: 284B parameters, 13B active, 1M token context

Flash is the compact and fast variant of the DeepSeek V4 series. With 284B total parameters and 13B activated per token, it retains the same 1M token context window as the Pro model while offering lighter inference. Available on OpenRouter at $0.14/M input tokens and $0.28/M output tokens.

Flash Architecture

284B total parameters, 13B activated per inference

Flash uses DeepSeek's MoE (Mixture of Experts) architecture with hybrid attention and diversity-constrained hyper-connections. Only 13B parameters are activated per token, significantly reducing inference cost compared to Pro (49B active).

Ideal for daily tasks, quick summaries, and high-volume workflows.

Open chat View pricing

1M Token Context

Same context window as Pro: 1 million tokens

Despite its smaller size, Flash retains the 1M token context window. DeepSeek reports that V4-Flash uses only 10% of DeepSeek-V3.2's KV cache in the 1M token scenario, thanks to hybrid attention and architecture optimizations.

Test Flash on long documents before switching to Pro if the task is more complex.

Test long context Official data

Model choice

Flash vs Pro

Flash: 284B total / 13B active. Pro: 1.6T total / 49B active. Same 1M token context.

Flash is the default free entry point. Pro is reserved for unlimited subscriptions and tasks requiring deeper reasoning.

Guide

Compare

Technical

MoE Architecture

Mixture of Experts with hybrid attention, diversity-constrained hyper-connections, and Muon optimizer.

The MoE architecture activates only a fraction of parameters per token, enabling a large model while keeping inference cost reasonable.

Practical

Official model card

Usage

Reasoning modes

Non-think, Think High, and Think Max to adjust analysis depth.

Non-think prioritizes speed. Think High improves accuracy. Think Max pushes reasoning to the maximum, recommended with at least 384K tokens of context.

Practical

Test modes

Evaluation

Benchmarks

MMLU-Pro, HumanEval, GSM8K, LongBench-V2, LiveCodeBench, SWE Verified, MCPAtlas.

Official tables cover general knowledge, reasoning, code, math, long context, and agentic tasks.

Practical

View data

Integration

OpenAI-compatible API

API identifier: deepseek-v4-flash. OpenAI and Anthropic compatible format.

Use deepseek-v4-flash in your existing API integrations. Recommended temperature: 1.0, top_p: 1.0.

Practical

View API

Deployment

Open weights

Weights available on Hugging Face for local or cloud deployment.

Flash can be run locally. The model card includes encoding instructions, sampling settings, and compatibility notes. FP8 supported.

Entry

Download on HF

Why Flash

Flash is designed for speed without sacrificing long context

With 13B active parameters and 1M token context, Flash offers a good balance between cost, speed, and capability for everyday tasks.

Lightweight inference

Only 13B parameters activated per token. DeepSeek reports Flash uses 27% of DeepSeek-V3.2's single-token inference FLOPs.

Optimized KV cache

10% of DeepSeek-V3.2's KV cache in the 1M token scenario, thanks to hybrid attention.

Adjustable reasoning

Non-think for maximum speed, Think High for more accuracy, Think Max for difficult tasks.

Code and agents

Evaluated on LiveCodeBench, SWE Verified, Toolathlon, and MCPAtlas for developer and agentic workflows.

Resources

Official DeepSeek V4 Flash links

Access weights, source code, and official documentation to deploy or evaluate Flash.

Weights and model card

Official model card with benchmarks and deployment instructions.
Weights available for local and cloud inference.
FP8 instructions, encoding, and recommended sampling parameters.

Source code

GitHub repository with integration examples and scripts.
Compatible with standard inference frameworks.
Documented prompt examples and use cases.

Recommended usage

Temperature 1.0, top_p 1.0 for local deployment.
Minimum 384K tokens of context for Think Max.
Test your own documents before choosing between Flash and Pro.

HuggingFace model card GitHub DeepSeek V4

Official data

DeepSeek V4 Flash benchmarks: what the numbers say

The official model card publishes results on knowledge, reasoning, code, math, long context, and agentic tasks. Here are the key points.

Compare Flash and Pro on the benchmarks that match your real use cases, not just general rankings.

Test Flash now Official model card

DeepSeek V4 Flash benchmark table - official results

Flash: 284B total parameters, 13B active. Pro: 1.6T total parameters, 49B active. Same 1M token context.

Benchmarks covered: MMLU-Pro, HumanEval, GSM8K, LongBench-V2, LiveCodeBench, SWE Verified, Toolathlon, MCPAtlas.

Flash uses 27% of single-token inference FLOPs and 10% of DeepSeek-V3.2's KV cache in the 1M token scenario.

Instruct modes: Non-think (speed), Think High (accuracy), Think Max (maximum reasoning, min. 384K tokens).

Speed and cost

Flash for fast tasks and high-volume workflows

With 13B active parameters and a cost of $0.14/M input tokens, Flash is the natural choice for daily use, summaries, and high-throughput API integrations.

Document summaries, emails, everyday writing.
API integrations with high request volume.
Quick comparison of multiple responses before switching to Pro.

Open chat

DeepSeek V4 Flash - fast and lightweight model

Long context

1M token context even on the Flash model

Flash retains the same context window as Pro. Test it on your long documents, codebases, or multi-step analyses before deciding if Pro is necessary.

Contracts, manuals, long technical documentation.
Large codebases for review or refactoring.
Multi-layer analyses in a single context.

Test long context

Local deployment

Deploy Flash locally or via API

Flash's open weights are available on Hugging Face. The model card includes encoding instructions, recommended sampling parameters, and compatibility notes.

Weights available on HuggingFace for local deployment.
FP8 supported to reduce memory footprint.
Compatible with standard inference frameworks.

View on HuggingFace GitHub

FAQ

DeepSeek V4 Flash: basics and architecture

Answers to the most common questions about the Flash model.

What is DeepSeek V4 Flash?

Flash is the compact variant of the DeepSeek V4 series. 284B total parameters, 13B activated per token, 1M token context. It's the default free entry point.

What's the difference with DeepSeek V4 Pro?

Pro: 1.6T total parameters, 49B active. Flash: 284B total, 13B active. Both have 1M token context. Pro is more powerful, Flash is faster and cheaper.

What architecture does Flash use?

MoE (Mixture of Experts) with hybrid attention, diversity-constrained hyper-connections, and Muon optimizer. Same architectural family as Pro.

Is Flash open source?

Yes, weights are available on Hugging Face. Check the license on the official model card.

FAQ

Performance and reasoning modes

What benchmarks and instruct modes mean in practice.

Which benchmarks are published?

MMLU-Pro, HumanEval, GSM8K, LongBench-V2, LiveCodeBench, SWE Verified, Toolathlon, MCPAtlas. Cover knowledge, code, math, long context, and agents.

What do Non-think / Think High / Think Max do?

Non-think: fast response without extended reasoning. Think High: more accuracy. Think Max: maximum reasoning, requires at least 384K tokens of context.

Is Flash efficient on long context?

Yes. DeepSeek reports Flash uses 10% of V3.2's KV cache in the 1M token scenario thanks to hybrid attention.

Should I expect the same results as Pro?

No. Flash is more compact. For complex tasks, Pro remains more suitable. Test your own workflows to decide.

FAQ

Deployment, API, and resources

How to use Flash in production or locally.

How to access Flash via API?

API identifier: deepseek-v4-flash. OpenAI and Anthropic compatible format. Available on OpenRouter at $0.14/M input tokens.

What sampling parameters are recommended?

Temperature 1.0, top_p 1.0 for local deployment according to the official model card.

Where to download the weights?

Weights are available on Hugging Face. FP8 supported to reduce memory footprint.

Where to find source code and examples?

The GitHub repository contains integration scripts, prompt examples, and technical documentation.

Resources

Everything you need to know about DeepSeek V4 Flash

Architecture, benchmarks, reasoning modes, API, local deployment, and comparison with Pro.

Open chat

Flash vs Pro

284B vs 1.6T parameters. Same 1M token context.

Compare

1M token context

Optimized long context with 10% of V3.2's KV cache.

Test

Reasoning modes

Non-think, Think High, Think Max.

Explore

Official benchmarks

Code, math, agents, long context.

View data

OpenAI-compatible API

deepseek-v4-flash, $0.14/M tokens.

Integrate

Open weights HuggingFace

Local deployment, FP8, official instructions.

Download

GitHub source code

Scripts, examples, and documentation.

View repo

Pricing

Plans and unlimited Pro access.

View pricing

Get started

Test DeepSeek V4 Flash on a real task

Start with a summary, code review, or long document. Compare Flash and Pro on the same workflow to choose the right model.

Open chat Model card on HuggingFace GitHub source code