Chinese AI firm DeepSeek has released an experimental model, V3.2-exp, which it says can significantly reduce inference costs for long-context operations through a novel “sparse attention” mechanism.
The company announced the model Monday on Hugging Face, alongside a linked research paper on GitHub. V3.2-exp introduces DeepSeek Sparse Attention, which uses a “lightning indexer” to prioritize key excerpts from a large context window and a “fine-grained token selection system” to further filter relevant tokens. Together, these modules enable long-context processing with lower server demands.
Early testing suggests the model could cut the cost of API calls in half for extended-context use cases. Since V3.2-exp is open-weight and freely available, independent researchers are expected to validate those claims soon.
DeepSeek’s work comes amid industry-wide efforts to lower inference costs—the ongoing expense of running pre-trained AI models, which remains a major barrier to scaling AI services. The company says its approach shows the transformer architecture still has untapped efficiency gains.
The Beijing-based startup drew global attention earlier this year with its R1 model, trained primarily with reinforcement learning at a fraction of U.S. competitors’ budgets. While R1 fell short of predictions that it would reshape training economics, DeepSeek’s latest release could influence how providers tackle the cost problem.
Although less headline-grabbing than R1, the sparse attention technique may offer U.S. and global firms new methods to reduce infrastructure strain and improve affordability in large-scale AI deployment.















