FP16 - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

2 使用Python训练和部署低精度模型张校捷

使用Python训练和部署低精度模型 (TensorFlow版) 张校捷 2019/9/21 ## 目录 >> 低精度的概念和意义 TensorFlow的FP16模型 >> TensorRT的FP16/Int8模型总结 ![Image](/uploads/documents/a/3/b/b/a3bbe1f6675c3cec959e1f224b976c60/p2_2 E8M23 (tf.float32) FP16: E8M7 (TPU, tf.bfloat16) FP16: E5M10 (GPU, tf.float16) Int8 ## 低精度浮点数的优点 ### 1. 节约内存/显存的使用（FP16为原来的1/2，int8为原来的1/4） 2. 特殊的硬件专门用于低精度浮点数的计算加速（TensorCore） FP16 storage/input [Image](/uploads/documents/a/3/b/b/a3bbe1f6675c3cec959e1f224b976c60/p5_3.jpg) SSD-RN50-FPN-640 ## FP16浮点数（E5M10）的表示范围 ![Image](/uploads/documents/a/3/b/b/a3bbe1f6675c3cec959e1f224b976c60/p6_2.jpg)

0 码力 | 24 页 | 981.45 KB | 2 年前
3
PAI & TVM Meetup - Shanghai 20191116

Training/Inference PAI (Platform of AI) Alibaba Cloud Intelligence ## Outline • TensorCore AutoCodeGen in TVM • FP16 Mixed-Precision Training on PAI • INT8 Inference on PAI-Blade ## TensorCore ## AutoCodeGen ## Background A_{2,2} & A_{2,3} \\ A_{3,0} & A_{3,1} & A_{3,2} & A_{3,3} \end{array} $$ FP16 or FP32 FP16 $$ \begin{array}{l|ccc} B_{0,0} & B_{0,1} & B_{0,2} & B_{0,3} \\ \hline B_{1 & B_{2,2} & B_{2,3} \\ B_{3,0} & B_{3,1} & B_{3,2} & B_{3,3} \end{array} $$ FP16 $$ \begin{array}{cccc} C_{0,0} & C_{0,1} & C_{0,2} & C_{0,3} \\ C_{1,0} & C_{1,1}

0 码力 | 26 页 | 5.82 MB | 1 年前
3
PyTorch Release Notes

to an existing FP32 (default) script. AMP will select an optimal set of operations to cast to FP16. FP16 operations require 2X reduced memory bandwidth (resulting in a 2X speedup for bandwidth-bound operations (reducing the overall memory consumption of your model). Additionally, GEMMs and convolutions with FP16 inputs can run on Tensor Cores, which provide an 8X increase in computational throughput over FP32 to an existing FP32 (default) script. AMP will select an optimal set of operations to cast to FP16. FP16 operations require 2X reduced memory bandwidth (resulting in a 2X speedup for bandwidth-bound operations

0 码力 | 365 页 | 2.94 MB | 2 年前
3
vLLM v0.4.0.post1 Documentation

supported in ROCm, but SqueezeLLM quantization has been ported. Data types currently supported in ROCm are FP16 and BF16. ## 1.2.1 Requirements - OS: Linux - Python: 3.8-3.11 - GPU: MI200s (gfx90a), MI300 (gfx942) continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.3.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension

0 码力 | 68 页 | 810.15 KB | 5 月前
3
vLLM v0.4.1 Documentation

supported in ROCm, but SqueezeLLM quantization has been ported. Data types currently supported in ROCm are FP16 and BF16. ## 1.2.1 Requirements - OS: Linux - Python: 3.8-3.11 - GPU: MI200s (gfx90a), MI300 (gfx942) continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.3.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension

0 码力 | 101 页 | 894.09 KB | 5 月前
3
vLLM v0.4.3 Documentation

continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.3.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension float, float32 Data type for model weights and activations. - "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. - "half" for FP16. Recommended

0 码力 | 121 页 | 1.02 MB | 5 月前
3
vLLM v0.5.3 Documentation

pattern of users. - VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 to control KV cache precision. By default, FP16 / BF16 is used depending on platform. - VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON to enable U8 weights Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.5.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in

0 码力 | 143 页 | 1.07 MB | 5 月前
3
vLLM v0.5.0.post1 Documentation

Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.4.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in float, float32 Data type for model weights and activations. - "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. - "half" for FP16. Recommended

0 码力 | 144 页 | 1.09 MB | 5 月前
3
vLLM v0.5.3.post1 Documentation

pattern of users. - VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 to control KV cache precision. By default, FP16 / BF16 is used depending on platform. - VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON to enable U8 weights Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.5.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in

0 码力 | 143 页 | 1.07 MB | 5 月前
3
vLLM v0.5.0 Documentation

continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.3.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension bfloat16, float, float32 Data type for model weights and activations. •“auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. •“half” for FP16. Recommended for AWQ quantization

0 码力 | 132 页 | 1.05 MB | 5 月前
3

共 36 条前往

页

分类

语言

格式

2 使用Python训练和部署低精度模型张校捷

PAI & TVM Meetup - Shanghai 20191116

PyTorch Release Notes

vLLM v0.4.0.post1 Documentation

vLLM v0.4.1 Documentation

vLLM v0.4.3 Documentation

vLLM v0.5.3 Documentation

vLLM v0.5.0.post1 Documentation

vLLM v0.5.3.post1 Documentation

vLLM v0.5.0 Documentation

搜索

分类

语言

格式