Multi-head Latent Attention (MLA) - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE Introduction 4 2 Architecture 6 2.1 Multi-Head Latent Attention: Boosting Inference Efficiency 6 2.1.1 Preliminaries: Standard Multi-Head Attention 6 2.1.2 Low-Rank Key-Value Joint Compression 16B Model Equipped with MLA and DeepSeekMoE 29 B.1 Model Description ..... 29 B.2 Performance Evaluation ..... 30 C Full Formulas of MLA ..... 31 D Ablation of Attention Mechanisms ..... 31 D.1

0 码力 | 52 页 | 1.23 MB | 2 年前
3
机器学习课程-温州大学-13深度学习-Transformer

[Image](/uploads/documents/a/b/7/b/ab7b254a5c187d70765c98d89cffb40d/p6_1.jpg) ### 1 \.Transformer介绍 ## Attention注意力机制在介绍什么是注意力机制之前，先让大家看一张图片。当大家看到下面图片，会首先看到什么内容？当过载信息映入眼帘时，我们的大脑会把注意力放在主要的信息上，这就是大脑的注意力机制。 ! [Image](/uploads/documents/a/b/7/b/ab7b254a5c187d70765c98d89cffb40d/p7_1.jpg) ### 1 \.Transformer介绍 ## 每个词的Attention计算 ## 每个词的Q会跟整个序列中每一个K计算得分，然后基于得分再分配特征 Q: query，要去查询的 K: key，等着被查的 V: value，实际的特征信息 ![Image Transformer介绍 ## Attention的优点 1. 参数少：相比于 CNN、RNN，其复杂度更小，参数也更少。所以对算力的要求也就更小。 2.速度快：Attention 解决了 RNN 及其变体模型不能并行计算的问题。Attention 机制每一步计算不依赖于上一步的计算结果，因此可以和 CNN 一样并行处理。 3. 效果好：在Attention 机制引入之前，有一个问题大家一直

0 码力 | 60 页 | 3.51 MB | 2 年前
3
2024 中国开源开发者报告

del-rel 其中，Qwen 系列凭借灵活的多尺寸选项，强大的多语言支持以及友好的模型授权功能，赢得了社区开发者的高度评价。DeepSeek 通过引入多头潜在注意力（Multi-head Latent Attention，MLA）技术，在性能和成本上实现了革命性突破，开创高性价比的 AI 新纪元。智谱的 CogVideoX 系列文生视频模型，成为全球首批开源的文生视频模型之一，不仅在技术方面

0 码力 | 111 页 | 11.44 MB | 1 年前
3
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold- Manifold-Constrained Hyper-Connections 7 2.3 Hybrid Attention with CSA and HCA 9 2.3.1 Compressed Sparse Attention 9 2.3.2 Heavily Compressed Attention 11 2.3.3 Other Details 12 2.3.4 Efficiency Discussion Cost-Effective and Memory-Efficient Implementation of mHC 21 3.5.3 Contextual Parallelism for Long-Context Attention 21 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing 21 3.6 Inference

0 码力 | 58 页 | 4.27 MB | 3 月前
3
2020美团技术年货算法篇

Transformer 是谷歌在论文《Attention is all you need》 $ ^{[1]} $ 中提出来解决 Sequence to Sequence 问题的模型，其本质上是一个编解码（Encoder-Decoder）结构，编码器 Encoder 由 6 个编码 block 组成，Encoder 中的每个 block 包含 Multi-Head Attention 和 FFN（Feed-Forward FFN（Feed-Forward Network）；同样解码器 Decoder 也是 6 个解码 block 组成，每个 block 包含 Multi-Head Attention、Encoder-Decoder Attention 和FFN。具体结构如图1所示，其详细的介绍可参考文献 $ [1,6] $ 。 ![Image](/uploads/documents/4/2/c/4/42c4fbcf187df0f62 Transformer 的编码层，这里先对它做简单的介绍。它主要由以下两部分组成： Multi-Head Attention Multi-Head Attention 实际上是 h 个 Self-Attention 的集成，h 代表头的个数。其中 Self-Attention 的计算公式如下： $$ Attention\left(\mathbf{K},\mathbf{Q},\mathbf{V}\r

0 码力 | 317 页 | 16.57 MB | 2 年前
3
《Efficient Deep Learning Book》[EDL] Chapter 4 - Efficient Architectures

In the first chapter, we briefly introduced architectures like depthwise separable convolution, attention mechanism and the hashing trick. In this chapter, we will deepdive into their architectures and architectures, and how they help us outperform baseline methods. Another example in this domain is the attention mechanism, which forms the backbone of the state of the art NLP model architectures such as the showing great promise in computer vision applications as well! ## Learn Long-Term Dependencies Using Attention Imagine yourself in your favorite buffet restaurant. A variety of food items from savory to sweet

0 码力 | 53 页 | 3.92 MB | 2 年前
3
vLLM v0.4.0.post1 Documentation

serving. vLLM is fast with: - State-of-the-art serving throughput - Efficient management of attention key and value memory with PagedAttention - Continuous batching of incoming requests - Fast model tested version ## 1. Install flash attention for ROCm Install ROCm's flash attention (v2.0.4) following the instructions from ROCmSoftwarePlatform/flash attention Note: - If you are using rocm5.7 ROCm flash attention directly. - If you fail to install ROCmSoftwarePlatform/flash-attention, try cloning from the commit 6fd2f8e572805681cd67ef8596c7e2ce521ed3c6. - ROCm's Flash-attention-2 (v2.0.4)

0 码力 | 68 页 | 810.15 KB | 5 月前
3
2022年美团技术年货合辑

AutoHensGNN 框架 ## 多类别层次化图模型优化：（1）候选图模型的生成：现实世界中的图通常是多种属性的组合，这些属性信息很难只用一种方法捕捉完全，因此，我们使用了基于谱域、空域、Attention机制等多种不同类型的模型来捕捉多种属性关系。不同模型在不同数据集上效果差异较大，为了防止后续模型融合时加入效果较差的模型，会对GCN、GAT、APPNP、TAGC、DNA、GraphSAGE、网络注重于细粒度刻画空间信息，源于不同的球面距离，不同的区块位置影响大，需要多重信息深度建模。更多详情，大家可参考团队的 CIKM 论文：Trilateral Spatiotemporal Attention Network for User Behavior Modeling in Location-based Search $ ^{[23]} $ 。 ![Image](/uploads/doc M. xgboost: Extreme Gradient Boosting[J]. 2016. [23] Qi, Yi, et al. “Trilateral Spatiotemporal Attention Network for User Behavior Modeling in Location-based Search”, CIKM 2021. [24] 广告深度预估技术在美团到店场景下的突破与畅想

0 码力 | 1356 页 | 45.90 MB | 2 年前
3
vLLM v0.5.4 Documentation

serving. vLLM is fast with: - State-of-the-art serving throughput - Efficient management of attention key and value memory with PagedAttention - Continuous batching of incoming requests - Fast model BUILD_FA: specifies whether to build CK flash-attention. The default is 1. For Radeon RX 7900 series (gfx1100), this should be set to 0 before flash-attention supports this target. - FX_GFX_ARCHS: specifies flash-attention, for example, gfx90a;gfx942 for MI200 and MI300. The default is gfx90a;gfx942 - FA_BRANCH: specifies the branch used to build the CK flash-attention in ROCm's flash-attention repo.

0 码力 | 152 页 | 1.10 MB | 5 月前
3
4_杨柳_基于Python构建高稳定可扩展的自动化测试集群

Succeeded✓ Succeeded✓ Succeeded✓ SucceededHUAWEI HUAWEI MLA-AL10✓ Succeeded✓ Succeeded✓ Succeeded✓ Succeeded✓ Suc 30.99°C| |---|---|---|---|---|---|---| |✗ Vivo vivo Y66 522s|✗ OPPO OPPO A59m 3.25s|✗ HUAWEI HUAWEI MLA-AL10 6.21%|✗ MI MI 6X 264.29MB|✗ Samsung SM-N9500 0fps|✗ MI Redmi Note 4X 2,605.5KB|✗ OPPO OPPO R11s |Vivo|vivo Y66|1.4|3,072|522| |Vivo|vivo Y75s|1.8|4,096|503| |MEIZU|m3 note|1.8|2,048|503| |HUAWEI|HUAWEI MLA-AL10|2|4,096|497| |Smartisan|OS103|1.6|6,144|496| ![Image](/uploads/documents/5/e/6/f/5e6f968ae29

0 码力 | 62 页 | 25.29 MB | 2 年前
3

共 1000 条前往

页

分类

语言

格式

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

机器学习课程-温州大学-13深度学习-Transformer

2024 中国开源开发者报告

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

2020美团技术年货算法篇

《Efficient Deep Learning Book》[EDL] Chapter 4 - Efficient Architectures

vLLM v0.4.0.post1 Documentation

2022年美团技术年货合辑

vLLM v0.5.4 Documentation

4_杨柳_基于Python构建高稳定可扩展的自动化测试集群

搜索

分类

语言

格式