vLLM - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

vLLM v0.6.2 Documentation

vLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 217 Python Module Index 219 Index 221 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM including integration with FlashAttention and FlashInfer. - Speculative decoding - Chunked prefill vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in

0 码力 | 227 页 | 1.33 MB | 5 月前
3
vLLM v0.5.0 Documentation

vLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 123 Python Module Index 125 Index 127 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in

0 码力 | 132 页 | 1.05 MB | 5 月前
3
vLLM v0.5.1 Documentation

vLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 153 Python Module Index 155 Index 157 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in

0 码力 | 162 页 | 1.14 MB | 5 月前
3
vLLM v0.5.3 Documentation

vLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 135 Python Module Index 137 Index 139 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in

0 码力 | 143 页 | 1.07 MB | 5 月前
3
vLLM v0.5.4 Documentation

vLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 143 Python Module Index 145 Index 147 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in

0 码力 | 152 页 | 1.10 MB | 5 月前
3
vLLM v0.4.2 Documentation

vLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 91 Python Module Index 93 Index 95 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in

0 码力 | 99 页 | 982.83 KB | 5 月前
3
vLLM v0.4.1 Documentation

vLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 93 Python Module Index 95 Index 97 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in

0 码力 | 101 页 | 894.09 KB | 5 月前
3
vLLM v0.6.0 Documentation

vLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 191 Python Module Index 193 Index 195 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM including integration with FlashAttention and FlashInfer. - Speculative decoding - Chunked prefill vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in

0 码力 | 201 页 | 1.26 MB | 5 月前
3
vLLM v0.4.3 Documentation

vLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 113 Python Module Index 115 Index 117 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM with CUDA/HIP graph - Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache - Optimized CUDA kernels vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in

0 码力 | 121 页 | 1.02 MB | 5 月前
3
vLLM v0.6.1 Documentation

vLLM the vLLM Team ## GETTING STARTED 1 Documentation 3 2 Indices and tables 205 Python Module Index 207 Index 209 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM including integration with FlashAttention and FlashInfer. - Speculative decoding - Chunked prefill vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput Multi-lora support For more information, check out the following: - vLLM announcing blog post (intro to PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in

0 码力 | 215 页 | 1.29 MB | 5 月前
3

共 20 条前往

页

分类

语言

格式

vLLM v0.6.2 Documentation

vLLM v0.5.0 Documentation

vLLM v0.5.1 Documentation

vLLM v0.5.3 Documentation

vLLM v0.5.4 Documentation

vLLM v0.4.2 Documentation

vLLM v0.4.1 Documentation

vLLM v0.6.0 Documentation

vLLM v0.4.3 Documentation

vLLM v0.6.1 Documentation

搜索

分类

语言

格式