AI大模型千问 qwen 中文文档### 1.7 AWQ 对于量化模型,我们推荐使用 AWQ 结合 AutoAWQ。AWQ 即激活感知权重量化,是一种针对 LLM 的低比特权重量化的硬件友好方法。而 AutoAWQ 是一个易于使用的工具包,专门用于 4 比特量化模型。相较于 FP16,AutoAWQ 能够将模型的运行速度提升 3 倍,并将内存需求降低至原来的 1/3。AutoAWQ 实现了激活感知权重量化(AWQ)算法,可用于 LLM 框架下使用量化模型,以及如何对您自己的模型进行量化。 #### 1.7.1 如何在 Transformers 中使用 AWQ 量化模型 现在,Transformers 已经正式支持 AutoAWQ,这意味着您可以直接在 Transformers 中使用量化模型。以下是一个非常简单的代码片段,展示如何运行量化模型 Qwen1.5-7B-Chat-AWQ: from transformers import AutoModelForCausalLM from_pretrained( "Qwen/Qwen1.5-7B-Chat-AWQ", # the quantized model device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat-AWQ") prompt = "Give me a short0 码力 | 56 页 | 835.78 KB | 2 年前3
vLLM v0.6.1.post2 DocumentationContinuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt, →marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes, →qqq,experts_int80 码力 | 215 页 | 1.29 MB | 3 月前3
vLLM v0.6.1.post1 DocumentationContinuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt, →marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes, →qqq,experts_int80 码力 | 215 页 | 1.28 MB | 3 月前3
vLLM v0.6.2 DocumentationContinuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt, →marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes, →qqq,experts_int80 码力 | 227 页 | 1.33 MB | 3 月前3
vLLM v0.6.0 DocumentationContinuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin, →gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors, →bitsandbytes0 码力 | 201 页 | 1.26 MB | 3 月前3
vLLM v0.6.1 DocumentationContinuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt, →marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes, →qqq,experts_int80 码力 | 215 页 | 1.29 MB | 3 月前3
vLLM v0.4.0.post1 DocumentationContinuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless ROCm vLLM 0.2.4 onwards supports model inferencing and serving on AMD GPUs with ROCm. At the moment AWQ quantization is not supported in ROCm, but SqueezeLLM quantization has been ported. Data types currently [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS] [--disable-log-stats] [--quantization {awq,gptq,squeezellm,None}] [--enforce-eager] [--max-context-len-to-capture MAX_CONTEXT_LEN_TO_CAPTURE]0 码力 | 68 页 | 810.15 KB | 3 月前3
vLLM v0.5.5 DocumentationContinuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", 'lora_repo': MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin, →gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors, →bitsandbytes0 码力 | 193 页 | 1.22 MB | 3 月前5
vLLM v0.5.0 DocumentationContinuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless "name": "AWQ_inference_with_lora_example", 'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', 'quantization': "awq", ``` (continues [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,fp8, →marlin,gptq_marlin_24,gptq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes0 码力 | 132 页 | 1.05 MB | 3 月前3
vLLM v0.5.0.post1 DocumentationContinuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless "name": "AWQ_inference_with_lora_example", "model": 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', "quantization": "awq", "lora_repo": [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS] [--disable-log-stats] [--quantization {aqlm,awq,deepspeedfp,fp8, →marlin,gptq_marlin_24,gptq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes0 码力 | 144 页 | 1.09 MB | 3 月前3
共 18 条
- 1
- 2
相关搜索词
Qwen大模型AWQ模型部署多语言支持上下文窗口vLLMLoRA AdapterVision Language ModelsPerformance TuningSampling ParametersLoRA adapterVision Language Models (VLMs)量化模型多模态模型分布式推理OpenAI兼容服务器PagedAttentionKV缓存管理量化分批处理KV cacheLoRApaged attentioncontinuous batchingLLM inferencequantization性能基准测试模型集成参数配置模型支持策略使用统计收集LLM推理与服务VLM支持LLM模型支持多模态推理引擎性能监控













