FP16 Mixed-Precision Training - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Ubuntu Desktop Training 2009

Ubuntu Desktop Training ## Ubuntu Desktop Training Written by and attributed to Canonical Ltd. and the Ubuntu Training community 2008-2009. This license is bound by the Creative Commons: CC by NC SA .... ix 3. Ubuntu Session Plan ..... x 4. Instructor Responsibilities ..... xiii 4.1. Pre-Training Preparation/Checks ..... xiii 4.2. Instructional Methods ..... xiii 4.3. Instructional Tips/Guidelines Target Audience and Pre-requisites This course provides both home and office users with hands on training on Ubuntu. No prior knowledge of Ubuntu is required, although computer literacy is assumed and is

0 码力 | 428 页 | 57.45 MB | 2 年前
3
PAI & TVM Meetup - Shanghai 20191116

AutoCodeGen ## and ## Mixed-Precision Training/Inference PAI (Platform of AI) Alibaba Cloud Intelligence ## Outline • TensorCore AutoCodeGen in TVM • FP16 Mixed-Precision Training on PAI • INT8 Inference TensorCore • A revolutionary technology that delivers groundbreaking AI performance. • Performs mixed-precision matrix multiply and accumulate in a single operation. ![Image](/uploads/documents/e/6/f/1/e A_{2,2} & A_{2,3} \\ A_{3,0} & A_{3,1} & A_{3,2} & A_{3,3} \end{array} $$ FP16 or FP32 FP16 $$ \begin{array}{l|ccc} B_{0,0} & B_{0,1} & B_{0,2} & B_{0,3} \\ \hline B_{1

0 码力 | 26 页 | 5.82 MB | 1 年前
3
PyTorch Release Notes

It includes support for 8-bit floating point (FP8) precision on Hopper GPUs which provides better training and inference performance with lower memory utilization. Transformer Engine also includes a collection users to try mixed precision training by adding only three lines of Python to an existing FP32 (default) script. AMP will select an optimal set of operations to cast to FP16. FP16 operations require 2X reduced (reducing the overall memory consumption of your model). Additionally, GEMMs and convolutions with FP16 inputs can run on Tensor Cores, which provide an 8X increase in computational throughput over FP32

0 码力 | 365 页 | 2.94 MB | 2 年前
3
动手学深度学习 v2.0

通常，损失函数是根据模型参数定义的，并取决于数据集。在一个数据集上，我们可以通过最小化总损失来学习模型参数的最佳值。该数据集由一些为训练而收集的样本组成，称为训练数据集（training dataset，或称为训练集（training set））。然而，在训练数据上表现良好的模型，并不一定在“新数据集”上有同样的性能，这里的“新数据集”通常称为测试数据集（test dataset，或称为测试集（test 算房屋价格（美元）。为了开发一个能预测房价的模型，我们需要收集一个真实的数据集。这个数据集包括了房屋的销售价格、面积和房龄。在机器学习的术语中，该数据集称为训练数据集（training data set）或训练集（training set）。每行数据（比如一次房屋交易相对应的数据）称为样本（sample），也可以称为数据点（data point）或数据样本（data instance）。我们把试图练迭代周期，模型最终可以在训练集上达到完美的精度，此时测试集的准确性却下降了。 #### 4.4.1 训练误差和泛化误差为了进一步讨论这一现象，我们需要了解训练误差和泛化误差。训练误差（training error）是指，模型在训练数据集上计算得到的误差。泛化误差（generalization error）是指，模型应用在同样从原始样本的分布中抽取的无限多数据样本时，模型误差的期望。问题

0 码力 | 797 页 | 29.45 MB | 2 年前
3
AI大模型千问 qwen 中文文档

对于量化模型，我们推荐使用 AWQ 结合 AutoAWQ。AWQ 即激活感知权重量化，是一种针对 LLM 的低比特权重量化的硬件友好方法。而 AutoAWQ 是一个易于使用的工具包，专门用于 4 比特量化模型。相较于 FP16，AutoAWQ 能够将模型的运行速度提升 3 倍，并将内存需求降低至原来的 1/3。AutoAWQ 实现了激活感知权重量化（AWQ）算法，可用于 LLM 的量化处理。在本文档中，我们将向您展示如何在在进行量化操作之前，请确保你已经按照指导开始使用 llama.cpp。以下指引将不会提供有关安装和构建的步骤。现在，假设你要对 Qwen1.5-7B-Chat 模型进行量化，首先需要按照如下所示的方式为 fp16 模型创建一个 GGUF 文件： python convert-hf-to-gguf.py Qwen/Qwen1.5-7B-Chat --outfile models/7B/qwen1_5-7b-chat-fp16 或者 HF 模型的名称，第二个参数则指的是你想要生成的 GGUF 文件的路径（此处我将其置于 models/7B 目录下）。请记住，在运行命令之前，需要先创建这个目录。通过这种方式，你已经为你的 fp16 模型生成了一个 GGUF 文件，接下来你需要根据实际需求将其量化至低比特位。以下是一个将模型量化至 4 位的具体示例：” ./quantize models/7B/qwen1_5-7b-chat-fp16

0 码力 | 56 页 | 835.78 KB | 2 年前
3
2 使用Python训练和部署低精度模型张校捷

使用Python训练和部署低精度模型 (TensorFlow版) 张校捷 2019/9/21 ## 目录 >> 低精度的概念和意义 TensorFlow的FP16模型 >> TensorRT的FP16/Int8模型总结 ![Image](/uploads/documents/a/3/b/b/a3bbe1f6675c3cec959e1f224b976c60/p2_2 E8M23 (tf.float32) FP16: E8M7 (TPU, tf.bfloat16) FP16: E5M10 (GPU, tf.float16) Int8 ## 低精度浮点数的优点 ### 1. 节约内存/显存的使用（FP16为原来的1/2，int8为原来的1/4） 2. 特殊的硬件专门用于低精度浮点数的计算加速（TensorCore） FP16 storage/input [Image](/uploads/documents/a/3/b/b/a3bbe1f6675c3cec959e1f224b976c60/p5_3.jpg) SSD-RN50-FPN-640 ## FP16浮点数（E5M10）的表示范围 ![Image](/uploads/documents/a/3/b/b/a3bbe1f6675c3cec959e1f224b976c60/p6_2.jpg)

0 码力 | 24 页 | 981.45 KB | 2 年前
3
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further High-Performance Batch-Invariant and Deterministic Kernel Libraries 18 3.4 FP4 Quantization-Aware Training 19 3.5 Training Framework 20 3.5.1 Efficient Implementation of Muon 20 3.5.2 Cost-Effective and Memory-Efficient On-Disk KV Cache Storage 23 4 Pre-Training 24 4.1 Data Construction 24 4.2 Pre-Training Setups 25 4.2.1 Model Setups 25 4.2.2 Training Setups 25 4.2.3 Mitigating Training Instability 26 4.3 Evaluations

0 码力 | 58 页 | 4.27 MB | 3 月前
3
云原生中的数据科学KubeConAsia2018Final

Data Science Pipeline Data Ingestion Feature Transforms Data Cleaning Production ML/AI Model Training Feature Engineering Production Model Testing Model Selection, Parameter Search Model Export Data Science Pipeline Data Ingestion Feature Transforms Data Cleaning Production ML/AI Model Training Feature Engineering Production Model Testing Model Selection, Parameter Search Model Export

0 码力 | 47 页 | 14.91 MB | 2 年前
3
vLLM v0.4.0.post1 Documentation

supported in ROCm, but SqueezeLLM quantization has been ported. Data types currently supported in ROCm are FP16 and BF16. ## 1.2.1 Requirements - OS: Linux - Python: 3.8-3.11 - GPU: MI200s (gfx90a), MI300 (gfx942) continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.3.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension

0 码力 | 68 页 | 810.15 KB | 5 月前
3
2022年美团技术年货合辑

YOLOv6-nano 的消融实验结果，从实验结果可以看出，我们自主设计的检测网络在精度和速度上都带来了很大的增益。 |NO.|Method|mAP(0.5:0.95)|Speed (T4) TRT fp16 bs32 (FPS)| |---|---|---|---| |A|YOLOv5-nano|28.0|671.5| |B|A + Decoupled Head|29.4(+1.4)|636.8| |C|B + Rep-PAN Neck|34.3(+6.3)|1162.6| |E|D + Efficient Decoupled Head|34.5(+6.5)|1242.2| |F|E + More training epochs (400 epoch)|35.0(+7.0)|1242.2| 表 1 YOLOv6-nano 消融实验结果下表 2 展示了 YOLOv6 与当前主流的其他 YOLO 系 colspan="2">Speed (v100) bs32 (ms)Speed (T4) TRT fp16 (FPS)Params (M)Flops (G)fp16fp32bs1bs32
0 码力 | 1356 页 | 45.90 MB | 2 年前
3

共 457 条前往

页

分类

语言

格式

Ubuntu Desktop Training 2009

PAI & TVM Meetup - Shanghai 20191116

PyTorch Release Notes

动手学深度学习 v2.0

AI大模型千问 qwen 中文文档

2 使用Python训练和部署低精度模型张校捷

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

云原生中的数据科学KubeConAsia2018Final

vLLM v0.4.0.post1 Documentation

2022年美团技术年货合辑

搜索

分类

语言

格式