2 使用Python训练和部署低精度模型 张校捷使用Python训练和部署低精度模型 (TensorFlow版) 张校捷 2019/9/21 ## 目录 >> 低精度的概念和意义 TensorFlow的FP16模型 >> TensorRT的FP16/Int8模型 总结  FP16: E8M7 (TPU, tf.bfloat16) FP16: E5M10 (GPU, tf.float16) Int8 ## 低精度浮点数的优点 ### 1. 节约内存/显存的使用(FP16为原来的1/2,int8为原来的1/4) 2. 特殊的硬件专门用于低精度浮点数的计算加速(TensorCore) FP16 storage/input [Image](/uploads/documents/a/3/b/b/a3bbe1f6675c3cec959e1f224b976c60/p5_3.jpg) SSD-RN50-FPN-640 ## FP16浮点数(E5M10)的表示范围 0 码力 | 24 页 | 981.45 KB | 2 年前3
PAI & TVM Meetup - Shanghai 20191116Training/Inference PAI (Platform of AI) Alibaba Cloud Intelligence ## Outline • TensorCore AutoCodeGen in TVM • FP16 Mixed-Precision Training on PAI • INT8 Inference on PAI-Blade ## TensorCore ## AutoCodeGen ## Background A_{2,2} & A_{2,3} \\ A_{3,0} & A_{3,1} & A_{3,2} & A_{3,3} \end{array} $$ FP16 or FP32 FP16 $$ \begin{array}{l|ccc} B_{0,0} & B_{0,1} & B_{0,2} & B_{0,3} \\ \hline B_{1 & B_{2,2} & B_{2,3} \\ B_{3,0} & B_{3,1} & B_{3,2} & B_{3,3} \end{array} $$ FP16 $$ \begin{array}{cccc} C_{0,0} & C_{0,1} & C_{0,2} & C_{0,3} \\ C_{1,0} & C_{1,1}0 码力 | 26 页 | 5.82 MB | 1 年前3
PyTorch Release Notesto an existing FP32 (default) script. AMP will select an optimal set of operations to cast to FP16. FP16 operations require 2X reduced memory bandwidth (resulting in a 2X speedup for bandwidth-bound operations (reducing the overall memory consumption of your model). Additionally, GEMMs and convolutions with FP16 inputs can run on Tensor Cores, which provide an 8X increase in computational throughput over FP32 to an existing FP32 (default) script. AMP will select an optimal set of operations to cast to FP16. FP16 operations require 2X reduced memory bandwidth (resulting in a 2X speedup for bandwidth-bound operations0 码力 | 365 页 | 2.94 MB | 2 年前3
vLLM v0.4.0.post1 Documentationsupported in ROCm, but SqueezeLLM quantization has been ported. Data types currently supported in ROCm are FP16 and BF16. ## 1.2.1 Requirements - OS: Linux - Python: 3.8-3.11 - GPU: MI200s (gfx90a), MI300 (gfx942) continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.3.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension0 码力 | 68 页 | 810.15 KB | 3 月前3
vLLM v0.4.1 Documentationsupported in ROCm, but SqueezeLLM quantization has been ported. Data types currently supported in ROCm are FP16 and BF16. ## 1.2.1 Requirements - OS: Linux - Python: 3.8-3.11 - GPU: MI200s (gfx90a), MI300 (gfx942) continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.3.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension0 码力 | 101 页 | 894.09 KB | 3 月前3
vLLM v0.4.3 Documentationcontinuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.3.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension float, float32 Data type for model weights and activations. - "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. - "half" for FP16. Recommended0 码力 | 121 页 | 1.02 MB | 3 月前3
vLLM v0.5.0.post1 DocumentationNote: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.4.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in float, float32 Data type for model weights and activations. - "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. - "half" for FP16. Recommended0 码力 | 144 页 | 1.09 MB | 3 月前3
vLLM v0.5.3.post1 Documentationpattern of users. - VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 to control KV cache precision. By default, FP16 / BF16 is used depending on platform. - VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON to enable U8 weights Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.5.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in0 码力 | 143 页 | 1.07 MB | 3 月前3
vLLM v0.5.3 Documentationpattern of users. - VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 to control KV cache precision. By default, FP16 / BF16 is used depending on platform. - VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON to enable U8 weights Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension continuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.5.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in0 码力 | 143 页 | 1.07 MB | 3 月前3
vLLM v0.5.0 Documentationcontinuous batching is supported in transformersneuronx. Data types currently supported in Neuron SDK are FP16 and BF16. ## 1.3.1 Requirements - OS: Linux - Python: 3.8-3.11 - Accelerator: NeuronCore_v2 (in Note: - BF16 is the default data type in the current CPU backend (that means the backend will cast FP16 to BF16), and is compatible will all CPUs with AVX512 ISA support. - AVX512_BF16 is an extension bfloat16, float, float32 Data type for model weights and activations. •“auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. •“half” for FP16. Recommended for AWQ quantization0 码力 | 132 页 | 1.05 MB | 3 月前3
共 37 条
- 1
- 2
- 3
- 4
相关搜索词
低精度模型TensorFlowFP16Int8TensorRTTensorCore AutoCodeGenFP16 Mixed-Precision TrainingINT8 InferencePAI PlatformTVM FrameworkPyTorchCUDAcuDNNNCCLDALIvLLMpaged attentioncontinuous batchingLLM inferencequantization模型量化性能指标LLM性能调优模型支持集成部署多模态推理引擎性能监控Vision Language Modelsmulti_modal_datapreemptionchunked prefillperformance tuning模型支持策略使用统计收集LLM推理与服务VLM支持













