DeepSeek V4 Just Reset the Price Floor for Agentic AI
DeepSeek dropped the V4 series this week and the paper is one of the more interesting reads of the year so far. There are two models in the family. V4 Pro at 1.6T parameters with 49B activated. V4 Flash at 284B parameters with 13B activated. Both natively handle a one million token context window and both ship as open weights on Hugging Face.
The headline is not the parameter count. It is the efficiency. At a 1M token context, V4 Pro uses roughly 27% of the inference FLOPs and 10% of the KV cache of V3.2. V4 Flash pushes that further, landing at around 10% of the FLOPs and 7% of the KV cache. That is an order of magnitude swing, and it is the kind of swing that flips entire categories of product from impossible to obvious.
This post walks through the genuinely new ideas in the paper, then connects them to what they mean in practice for agent frameworks like Hermes and OpenClaw, and finally to what they mean for NVIDIA and the CUDA stack.
The architecture changes that actually matter
DeepSeek kept the bones from V3. DeepSeekMoE for feed forward layers. Multi Token Prediction. The Transformer skeleton. What changed sits in three places.
Hybrid attention with CSA and HCA
The big one. Vanilla attention is quadratic in sequence length, which is the structural reason long context costs so much. V4 splits attention into two regimes that work side by side.
Compressed Sparse Attention (CSA) compresses the KV cache along the sequence dimension and then runs a sparse attention pass on top. It is the descendant of the DeepSeek Sparse Attention work from the V3.2 era, but the compression layer cuts the per token state dramatically before sparsity even kicks in.
Heavily Compressed Attention (HCA) goes the other way. It compresses the KV cache more aggressively but keeps attention dense. The team mixes these two regimes across layers so the model gets both worlds. CSA layers do the long range scanning cheaply. HCA layers preserve dense attention behavior on a much smaller compressed state.
A small Sliding Window Attention path runs alongside both for the most recent tokens, since recency is where dense attention earns its keep. The KV cache then has a layered structure: a classical KV cache for CSA and HCA blocks, and a state cache for SWA plus the trailing uncompressed tokens that are not yet ready for compression.
The net effect is what the efficiency numbers say. You can keep a million tokens around without paying a million tokens of FLOPs per step.
Manifold Constrained Hyper Connections (mHC)
The paper calls this an upgrade to residual connections. The intuition is that vanilla residuals just add the previous activation back, which lets information flow but also lets representational drift compound across many layers. Hyper Connections from the Xie et al. work generalize residuals to multiple parallel paths. mHC then constrains these paths to live on a manifold, which keeps the representational geometry well behaved as you stack hundreds of layers and trillions of training tokens.
In practice this means the network gets more capacity for routing information across depth without the training instability that usually shows up when you start messing with residual streams. The team also reports cost effective implementations using recomputation and fused kernels so the wall clock cost is small.
The Muon optimizer
V4 swaps Adam for Muon. Muon is a relatively new optimizer that operates on matrix valued parameters by orthogonalizing updates rather than treating each weight as a scalar. The selling point in the paper is faster convergence and better training stability. The team uses a hybrid ZeRO strategy to make Muon memory efficient at scale, since naive Muon needs additional state that does not fit cleanly into the standard distributed training playbook.
This is a small line in the abstract but a big deal in practice. Optimizer choice usually does not move the needle. When it does, you notice immediately in training curves and final benchmark numbers. The fact that the team committed to Muon for a flagship run is itself a signal that the technique has graduated from research curiosity to production tool.
A new MoE affinity score
A subtle change worth mentioning. V4 changes the activation that computes routing affinity scores from sigmoid to sqrt(softplus). It also drops the constraint on the number of routing target nodes and replaces dense FFN layers in the early Transformer blocks with hash routed MoE layers. Small changes individually, but they compound. The first few layers being hash routed instead of dense is a nice efficiency trick that does not seem to have been widely tried in flagship models before this.
The infrastructure work is half the paper
If you only read the architecture section you miss the point. DeepSeek spends nearly as many pages on systems work as on the model itself, and that is where a lot of the practical efficiency lives.
MegaMoE: a single fused kernel for the whole MoE block
The team built a single fused kernel that handles dispatch, expert compute, and combine in one shot, fully overlapping compute, communication, and memory access. They validated it on both NVIDIA GPUs and Huawei Ascend NPUs. Reported speedups are 1.5x to 1.73x for general inference and up to 1.96x on latency sensitive workloads like RL rollouts and high speed agent serving. They open sourced the CUDA implementation as part of DeepGEMM under the name MegaMoE.
That last sentence matters. High speed agent serving is the workload most people running agent frameworks actually care about, and a near 2x improvement on that specific path is enormous.
TileLang for kernel development
Hand writing CUDA for every fused operator does not scale, and Triton has limits when you need bitwise reproducibility. The team adopted TileLang, a domain specific language that aligns its lowering rules with NVCC so the same source can compile to bit identical CUDA output when needed. They used it to replace what would otherwise have been hundreds of fine grained PyTorch ATen operators.
This is the kind of choice that pays dividends for years. A team that can prototype a new kernel in TileLang and validate it against a hand written CUDA reference in hours has a different iteration velocity than a team that has to write everything by hand.
Batch invariant deterministic kernels
Most attention kernels use split KV tricks to balance load across streaming multiprocessors. That is fast but not bitwise reproducible across batch shapes. V4 uses a dual kernel strategy where one kernel computes the full attention for a sequence inside a single SM, and a second kernel handles the partially filled tail with multiple SMs while being carefully designed to keep the same accumulation order. The team also replaced cuBLAS end to end with DeepGEMM since cuBLAS itself is not batch invariant.
The payoff is reproducibility across pre training, post training, and inference. If you have ever debugged a model that behaves differently in evaluation than in training, you know how valuable bitwise alignment between those pipelines actually is.
FP4 quantization aware training
The MoE expert weights and the indexer QK path are trained in FP4, not just quantized down at the end. On current hardware FP4 times FP8 has the same peak FLOPs as FP8 times FP8, so the win today is mostly memory and bandwidth. The team notes that on future hardware the same code path could be 1x to 3x faster. It is a forward looking bet on where the silicon is going.
Heterogeneous KV cache with on disk storage
This is the unsexy infrastructure trick that makes the agent economics work. Shared prefixes across requests, system prompts, tool definitions, conversation history, are written to disk in a format that lets the inference server pull them back in without re prefilling. CSA, HCA, and SWA each get their own storage strategy because their cache shapes are different.
If you run agents, the prefix is almost everything. A 12k token system prompt that stays the same across turns is the dominant input cost. On disk shared prefix reuse is what makes cache hit pricing work, and DeepSeek priced it accordingly.
The post training pipeline
V4 uses a two stage post training paradigm. First, train independent expert models for each target domain (math, coding, agent, instruction following) using SFT followed by GRPO reinforcement learning. Then consolidate them into a single unified model via on policy distillation, where the unified model acts as the student learning to optimize reverse KL loss against the domain experts.
This is one of the cleaner ways anyone has put on policy distillation into a flagship release. It side steps the usual issue where multi domain RL ends up with experts that fight each other inside one set of weights.
What this means for agentic workloads
Now the practical part. DeepSeek published the V4 pricing alongside the paper, with V4 Pro under a temporary 75% discount until May 5. It is worth quoting in full because the gap to Western flagships is not small.
DeepSeek V4 Flash: $0.14 input on cache miss, $0.0028 input on cache hit, $0.28 output per million tokens.
DeepSeek V4 Pro: $0.435 input on cache miss, $0.003625 on cache hit, $0.87 output per million tokens at promo rates. Regular rates land around 4x higher.
For comparison:
- Claude Opus 4.7: $5.00 input, $25.00 output, roughly $0.50 cached input.
- GPT 5.5: $5.00 input, $30.00 output, roughly $0.50 cached input.
- GPT 5.4 mid tier: about $2.50 input, $15.00 output.
- Claude Sonnet 4.5/4.6: about $3.00 input, $15.00 output.
On raw output tokens alone V4 Flash is 89x to 107x cheaper than Opus 4.7 or GPT 5.5, and still about 54x cheaper than GPT 5.4 or Sonnet. Once you factor in cache hit rates of 70 to 90%, which is normal for any agent that runs with a fixed system prompt, persistent tool definitions, and an evolving conversation history, the effective cost gap widens further. Flash’s cache hit input is essentially free.
For agent frameworks like Hermes and OpenClaw, the difference is transformative. These frameworks generate far more output tokens per step than chat does, often 2x to 5x the input per step, because of internal reasoning traces and function calls. A persistent agent session typically consumes 8k to 12k input tokens (mostly cached after warmup) and 3k to 6k output tokens per turn across 20 to 100 turns.
On V4 Flash, a full day of heavy agent use at 50 turns costs pennies to a few dollars. A persistent Hermes style agent running 24/7 with healthy caching can stay under $2 to $5 per month at production scale. The same workload on Opus 4.7 or GPT 5.5 easily runs $150 to $400 per month per agent instance, before you account for Opus’s new tokenizer inflating token counts 15 to 35% on English and agent traces. Even GPT 5.4 or Sonnet land in the $80 to $200 per month range for equivalent usage. Flash turns running fleets of agents from a budget conversation into something you can host on a single cheap VPS.
V4 Pro sits in a sweet middle ground for the harder agent tasks. In thinking mode it benchmarks ahead of Sonnet 4.5 and approaches Opus 4.6/4.7 on many agent and coding suites, while still costing 20x to 30x less on output and remaining dramatically cheaper even after the May 5 promo ends. For OpenClaw or Hermes setups that need deeper multi hop reasoning, tool orchestration, or complex state management, Pro is often the practical pick. Flash wins when you want maximum throughput, lowest cost, or many lightweight agents running in parallel. Both models support the full 1M context that agent frameworks love for loading entire codebases or long histories.
The net result is that DeepSeek V4 makes sophisticated agentic AI economically viable at scales that simply were not viable a quarter ago. Startups and researchers can run fleets of specialized agents (research, coding, data analysis, customer support) without burning through thousands of dollars a month. The combination of near free cached input, sub thirty cent output, and competitive agent benchmark performance flips the old tradeoff. You no longer have to choose between intelligent and affordable. Western frontier models still earn their keep on the absolute hardest closed ended reasoning, or where maximum reliability on long tail edge cases is non negotiable. But for most production and experimental agent deployments in 2026, V4 Flash and Pro have reset the price floor by an order of magnitude.
The threat to NVIDIA and to American frontier dominance
There is a second thread running through the V4 paper that is easy to miss. Day zero support for the Huawei Ascend 950 series. Native optimization for CloudMatrix supernodes. Partial training on domestic silicon. The MegaMoE kernel was validated on both NVIDIA GPUs and Ascend NPUs. The hardware design proposals at the end of the systems section read as if they are written for chip designers in Shenzhen as much as for Santa Clara.
In a scenario where open weight models broadly shift to Huawei Ascend chips for both training and inference, the AI industry undergoes a profound rebalancing of costs, accessibility, and supply chains. Ascend already delivers roughly 60% of H100 class inference performance per chip in tuned workloads. Cluster scale systems like CloudMatrix close the gap further through volume and efficiency. Open models become far cheaper to host and fine tune at global scale, which accelerates deployment in emerging markets, research labs, and cost sensitive applications. Training runs that once required scarce sanctioned NVIDIA GPUs can leverage abundant domestic Ascend production. Huawei is targeting hundreds of thousands of 950PR chips in 2026 alone, which removes the export control bottleneck for any lab that wants to iterate.
NVIDIA would face its most serious long term challenge yet to its hardware and software dominance. China was previously 17 to 22% of NVIDIA revenue, and that share has already shrunk under export controls. Full open model migration to Ascend would largely eliminate the remainder as ByteDance, Alibaba, and others scale domestic clusters. More importantly, the strategic moat erodes. Every frontier open weight model that natively tunes for CANN and MindSpore (Huawei’s CUDA and PyTorch analogs) no longer depends on the NVIDIA ecosystem for peak performance. Per chip training throughput and memory bandwidth still trail Blackwell class chips, but the combination of lower unit cost, domestic supply security, and improving software maturity lets Chinese labs train and serve competitive models at unprecedented scale and speed. The valuation premium built on perpetual lock in comes under real pressure.
The CUDA moat itself, long considered nearly unassailable, would face meaningful fragmentation. CUDA’s dominance comes from 15+ years of optimized libraries, deep developer familiarity, and tight PyTorch integration. Switching costs are real. But as DeepSeek V4 and GLM 5 demonstrate (the latter trained entirely on Ascend 910B via MindSpore with zero NVIDIA GPUs in the loop), native kernel optimization and translation layers are rapidly lowering those costs. CANN is maturing from early instability complaints toward practical parity for Transformer era workloads, with claims of high CUDA compatibility on optimized paths. If the majority of open weight releases follow this path, developers worldwide face a bifurcated reality. CUDA stays the default for proprietary Western frontier work and the high end Western clouds, but a parallel high performance stack thrives for open models, Global South deployments, and anyone prioritizing cost or sovereignty. This does not kill CUDA overnight. The installed base and tooling depth are enormous. But it transforms AI compute from a near monopoly into a competitive multipolar market and spurs innovation in compilers, kernels, and cross platform frameworks.
Geopolitically the shift accelerates China’s trajectory toward AI self sufficiency and could tilt the balance of applied AI capability. Freed from reliance on the latest sanctioned NVIDIA GPUs, Chinese labs can iterate at faster scale and lower marginal cost. More chips available domestically through SMIC, shorter supply chains, policy driven procurement. Open weight models serve as a global multiplier. Anyone can download the weights and run them on cheap Ascend clusters, which lets China set de facto standards for efficient inference, long context handling, and domain specific fine tunes. The US and allies likely retain leadership in closed source frontier models, advanced data curation, and talent pipelines, but China could dominate the picks and shovels of widespread deployment and the economic value created by ubiquitous affordable AI. The result is a more multipolar AI future. Less US centric chokepoint risk, more resilience through competition, but also more fragmentation and a real need for new governance around dual use hardware and software stacks.
The caveats temper the speed of this transition. Ascend chips still lag in peak training throughput, interconnect fabric maturity, and software polish compared with NVIDIA’s best in class offerings. Early clusters required heroic engineering effort. Production volumes are ramping aggressively but face constraints in advanced packaging and high bandwidth memory. Western organizations may continue preferring NVIDIA for trust, security certifications, and ecosystem support, which limits global adoption. Even so, the directional signal from 2026 is unmistakable. DeepSeek V4’s deliberate pivot, surging domestic market share for Chinese accelerators (41% in 2025), and explicit policy pressure for self reliance all point the same way. If open weight models broadly embrace Ascend class hardware, the industry moves from NVIDIA centric scarcity to distributed abundance, NVIDIA’s software moat springs leaks, and AI capability becomes less the property of any single nation or company and more a contested rapidly evolving global commons. The long term winners will be those who best exploit this new economics of scale and openness rather than those defending yesterday’s hardware lock in.
What we are doing with this
We are already running V4 Flash behind a few of our internal agents and the cost telemetry confirms the pricing math. Caching is the difference between an interesting research toy and something you can leave on. We will also be testing V4 Pro for the deeper reasoning paths in our memory products, where the 1M context pairs nicely with the way OAMP knowledge stores are exported and replayed.
A few things in the V4 paper that we think will quietly become standard practice across the field:
- Hybrid attention regimes where different layers use different cache shapes. The era of one attention pattern fits all is ending.
- Manifold constrained residual connections. Once one frontier model proves the training stability story, the rest will follow.
- Bitwise reproducible kernels across training and inference. This is one of those quality of life upgrades that nobody asks for until they have it, and then they cannot live without it.
- On disk shared prefix caches. If you sell agent infrastructure and you do not have this, you will.
If you are building agents in 2026 and you have not yet tried running them on V4 Flash, the next afternoon you spend doing it will be a good one.
The paper is at huggingface.co/collections/deepseek-ai/deepseek-v4 along with the model checkpoints. The MegaMoE kernel ships as part of DeepGEMM.