vLLM v0.4.2 DocumentationvLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 91 Python Module Index 93 Index 95 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in0 码力 | 99 页 | 982.83 KB | 3 月前3
vLLM v0.4.1 DocumentationvLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 93 Python Module Index 95 Index 97 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in0 码力 | 101 页 | 894.09 KB | 3 月前3
vLLM v0.6.0 DocumentationvLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 191 Python Module Index 193 Index 195 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM including integration with FlashAttention and FlashInfer. - Speculative decoding - Chunked prefill vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in0 码力 | 201 页 | 1.26 MB | 3 月前3
vLLM v0.4.3 DocumentationvLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 113 Python Module Index 115 Index 117 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in0 码力 | 121 页 | 1.02 MB | 3 月前3
vLLM v0.6.1 DocumentationvLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 205 Python Module Index 207 Index 209 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM including integration with FlashAttention and FlashInfer. - Speculative decoding - Chunked prefill vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in0 码力 | 215 页 | 1.29 MB | 3 月前3
vLLM v0.6.2 DocumentationvLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 217 Python Module Index 219 Index 221 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM including integration with FlashAttention and FlashInfer. - Speculative decoding - Chunked prefill vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in0 码力 | 227 页 | 1.33 MB | 3 月前3
vLLM v0.5.0 DocumentationvLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 123 Python Module Index 125 Index 127 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in0 码力 | 132 页 | 1.05 MB | 3 月前3
vLLM v0.5.1 DocumentationvLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 153 Python Module Index 155 Index 157 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in0 码力 | 162 页 | 1.14 MB | 3 月前3
vLLM v0.5.3 DocumentationvLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 135 Python Module Index 137 Index 139 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in0 码力 | 143 页 | 1.07 MB | 3 月前3
vLLM v0.5.4 DocumentationvLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 143 Python Module Index 145 Index 147 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in0 码力 | 152 页 | 1.10 MB | 3 月前3
共 22 条
- 1
- 2
- 3













