DeepSeekMoE - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models 7 2.1.3 Decoupled Rotary Position Embedding 8 2.1.4 Comparison of Key-Value Cache 8 2.2 DeepSeekMoE: Training Strong Models at Economical Costs 9 2.2.1 Basic Architecture 9 2.2.2 Device-Limited Contributions and Acknowledgments 27 B DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoE 29 B.1 Model Description ..... 29 B.2 Performance Evaluation ..... 30 C Full Formulas of MLA

0 码力 | 52 页 | 1.23 MB | 2 年前
3
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

Compared with the DeepSeek-V3 architecture (DeepSeek-AI, 2024), DeepSeek-V4 series retain the DeepSeekMoE framework (Dai et al., 2024) and Multi-Token Prediction (MTP) strategy, while introducing several CSA (Compressed Sparse Attention) and HCA (Heavily Compressed Attention) for attention layers, DeepSeekMoE for feed-forward layers, and strengthen conventional residual connections with mHC. performance al., 2025) as the optimizer. For the Mixture-of- Experts (MoE) components, we still adopt the DeepSeekMoE (Dai et al., 2024) architecture, with only minor adjustments from DeepSeek-V3. The Multi-Token

0 码力 | 58 页 | 4.27 MB | 3 月前
3

共 2 条前往

页

Multi-head Latent Attention (MLA)DeepSeekMoE Mixture-of-Experts (MoE)Transformer architecture training efficiency DeepSeek-V4 Compressed Sparse Attention (CSA)Heavily Compressed Attention (HCA)hybrid attention