PAI & TVM Meetup - Shanghai 20191116Mixed-Precision Training/Inference PAI (Platform of AI) Alibaba Cloud Intelligence ## Outline • TensorCore AutoCodeGen in TVM • FP16 Mixed-Precision Training on PAI • INT8 Inference on PAI-Blade ## TensorCore PAI-TF  ## I NT8 Inference on PAI-Blade ## PAI-Blade  PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. ## DOCUMENTATION ## 1.1 Installation vLLM is0 码力 | 68 页 | 810.15 KB | 3 月前3
2 使用Python训练和部署低精度模型 张校捷(TensorFlow版) 张校捷 2019/9/21 ## 目录 >> 低精度的概念和意义 TensorFlow的FP16模型 >> TensorRT的FP16/Int8模型 总结   FP16: E8M7 (TPU, tf.bfloat16) FP16: E5M10 (GPU, tf.float16) Int8 ## 低精度浮点数的优点 ### 1. 节约内存/显存的使用(FP16为原来的1/2,int8为原来的1/4) 2. 特殊的硬件专门用于低精度浮点数的计算加速(TensorCore) FP16 storage/input Full precision 使用低精度的意义 ## TensorCores适用条件 1. 卷积:K(输入通道),C(输出通道) 2. 通用矩阵乘法(GEMM):MxK,KxN,(M,N,K) FP16: 大小为8x Int8: 大小为16x 如果FP32要使用,可以设置(内部转为FP16): TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=1 TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=10 码力 | 24 页 | 981.45 KB | 2 年前3
vLLM v0.5.1 Documentation153 Python Module Index 155 Index 157 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput - Efficient management including parallel sampling, beam search, and more - Tensor parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs and AMD GPUs - (Experimental) PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. - vLLM Meetups. ## DOCUMENTATION ## 1.1 Installation0 码力 | 162 页 | 1.14 MB | 3 月前3
vLLM v0.5.2 Documentation157 Python Module Index 159 Index 161 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput - Efficient management sampling, beam search, and more - Tensor parallelism and pipeline parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs and AMD GPUs - (Experimental) PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. - vLLM Meetups. ## DOCUMENTATION ## 1.1 Installation0 码力 | 166 页 | 1.15 MB | 3 月前3
TVM@AliOSTVM@ARM CPU • Support TFLite (Open Source and Upstream Master) • Optimize on INT8 & FP32 ## AliOS TVM @ ARM CPU INT8 Convolution • NHWC layout • im2col + pack • Tensorize GEMM  ## AliOS TVM @ ARM CPU INT8 Depthwise Convolution instruction if your ARM does not have dot 3. compute_at is very important ## AliOS TVM @ ARM CPU INT8 TVM / QNNPACK Speed Up @ Mobilenet V2 @ rasp 3b+ AARCH64 













