AWQ - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

AI大模型千问 qwen 中文文档

### 1.7 AWQ 对于量化模型，我们推荐使用 AWQ 结合 AutoAWQ。AWQ 即激活感知权重量化，是一种针对 LLM 的低比特权重量化的硬件友好方法。而 AutoAWQ 是一个易于使用的工具包，专门用于 4 比特量化模型。相较于 FP16，AutoAWQ 能够将模型的运行速度提升 3 倍，并将内存需求降低至原来的 1/3。AutoAWQ 实现了激活感知权重量化（AWQ）算法，可用于 LLM 框架下使用量化模型，以及如何对您自己的模型进行量化。 #### 1.7.1 如何在 Transformers 中使用 AWQ 量化模型现在，Transformers 已经正式支持 AutoAWQ，这意味着您可以直接在 Transformers 中使用量化模型。以下是一个非常简单的代码片段，展示如何运行量化模型 Qwen1.5-7B-Chat-AWQ： from transformers import AutoModelForCausalLM from_pretrained( "Qwen/Qwen1.5-7B-Chat-AWQ", # the quantized model device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat-AWQ") prompt = "Give me a short

0 码力 | 56 页 | 835.78 KB | 2 年前
3
vLLM v0.6.1.post2 Documentation

Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt, →marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes, →qqq,experts_int8

0 码力 | 215 页 | 1.29 MB | 5 月前
3
vLLM v0.6.1.post1 Documentation

Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt, →marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes, →qqq,experts_int8

0 码力 | 215 页 | 1.28 MB | 5 月前
3
vLLM v0.6.2 Documentation

Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt, →marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes, →qqq,experts_int8

0 码力 | 227 页 | 1.33 MB | 5 月前
3
vLLM v0.6.0 Documentation

Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin, →gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors, →bitsandbytes

0 码力 | 201 页 | 1.26 MB | 5 月前
3
vLLM v0.6.1 Documentation

Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt, →marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes, →qqq,experts_int8

0 码力 | 215 页 | 1.29 MB | 5 月前
3
vLLM v0.4.0.post1 Documentation

Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless ROCm vLLM 0.2.4 onwards supports model inferencing and serving on AMD GPUs with ROCm. At the moment AWQ quantization is not supported in ROCm, but SqueezeLLM quantization has been ported. Data types currently [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS] [--disable-log-stats] [--quantization {awq,gptq,squeezellm,None}] [--enforce-eager] [--max-context-len-to-capture MAX_CONTEXT_LEN_TO_CAPTURE]

0 码力 | 68 页 | 810.15 KB | 5 月前
3
vLLM v0.5.5 Documentation

Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin, →gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors, →bitsandbytes

0 码力 | 193 页 | 1.22 MB | 5 月前
5
vLLM v0.5.0 Documentation

Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", ``` (continues [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,fp8, →marlin,gptq_marlin_24,gptq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes

0 码力 | 132 页 | 1.05 MB | 5 月前
3
vLLM v0.5.0.post1 Documentation

Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless "name": "AWQ_inference_with_lora_example", "model": 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', "quantization": "awq", "lora_repo": [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,fp8, →marlin,gptq_marlin_24,gptq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes

0 码力 | 144 页 | 1.09 MB | 5 月前
3

共 18 条前往

页

分类

语言

格式

AI大模型千问 qwen 中文文档

vLLM v0.6.1.post2 Documentation

vLLM v0.6.1.post1 Documentation

vLLM v0.6.2 Documentation

vLLM v0.6.0 Documentation

vLLM v0.6.1 Documentation

vLLM v0.4.0.post1 Documentation

vLLM v0.5.5 Documentation

vLLM v0.5.0 Documentation

vLLM v0.5.0.post1 Documentation

搜索

分类

语言

格式