DeepSeek-V2: A Strong, Economical, and Efficient
Mixture-of-Experts Language ModelLatent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models 7 2.1.3 Decoupled Rotary Position Embedding 8 2.1.4 Comparison of Key-Value Cache 8 2.2 DeepSeekMoE: Training Strong Models at Economical Costs 9 2.2.1 Basic Architecture 9 2.2.2 Device-Limited Contributions and Acknowledgments 27 B DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoE 29 B.1 Model Description ..... 29 B.2 Performance Evaluation ..... 30 C Full Formulas of MLA0 码力 | 52 页 | 1.23 MB | 2 年前3
DeepSeek-V4: Towards Highly Efficient Million-Token Context IntelligenceCompared with the DeepSeek-V3 architecture (DeepSeek-AI, 2024), DeepSeek-V4 series retain the DeepSeekMoE framework (Dai et al., 2024) and Multi-Token Prediction (MTP) strategy, while introducing several CSA (Compressed Sparse Attention) and HCA (Heavily Compressed Attention) for attention layers, DeepSeekMoE for feed-forward layers, and strengthen conventional residual connections with mHC. performance al., 2025) as the optimizer. For the Mixture-of- Experts (MoE) components, we still adopt the DeepSeekMoE (Dai et al., 2024) architecture, with only minor adjustments from DeepSeek-V3. The Multi-Token0 码力 | 58 页 | 4.27 MB | 1 月前3
共 2 条
- 1













