夏歌-使用Rust构建LLM应用## RUST CHINA CONF 2023 第三届中国 Rust 开发者大会 6.17-6.18 @Shanghai ## 使用 Rust 构建 LLM 应用 夏歌 ## 😍  ## Bojan Tunguz  PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. ## DOCUMENTATION ## 1.1 Installation vLLM is0 码力 | 68 页 | 810.15 KB | 3 月前3
vLLM v0.5.2 DocumentationDocumentation 3 2 Indices and tables 157 Python Module Index 159 Index 161 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput sampling, beam search, and more - Tensor parallelism and pipeline parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs and AMD GPUs - (Experimental) PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. - vLLM Meetups. ## DOCUMENTATION ## 1.1 Installation0 码力 | 166 页 | 1.15 MB | 3 月前3
OpenAI 《A practical guide to building agents》multi-step tasks. Advances in reasoning, multimodality, and tool use have unlocked a new category of LLM-powered systems known as agents. This guide is designed for product and engineering teams exploring characteristics that allow it to act reliably and consistently on behalf of a user: 01 It leverages an LLM to manage workflow execution and make decisions. It recognizes when a workflow is complete and can rules engine works like a checklist, flagging transactions based on preset criteria. In contrast, an LLM agent functions more like a seasoned investigator, evaluating context, considering subtle patterns0 码力 | 34 页 | 7.00 MB | 1 年前3
vLLM v0.5.1 DocumentationDocumentation 3 2 Indices and tables 153 Python Module Index 155 Index 157 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput including parallel sampling, beam search, and more - Tensor parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs and AMD GPUs - (Experimental) PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. - vLLM Meetups. ## DOCUMENTATION ## 1.1 Installation0 码力 | 162 页 | 1.14 MB | 3 月前3
vLLM v0.5.3 DocumentationDocumentation 3 2 Indices and tables 135 Python Module Index 137 Index 139 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput sampling, beam search, and more - Tensor parallelism and pipeline parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs and AMD GPUs - (Experimental) PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. - vLLM Meetups. ## DOCUMENTATION ## 1.1 Installation0 码力 | 143 页 | 1.07 MB | 3 月前3
vLLM v0.6.1.post2 DocumentationDocumentation 3 2 Indices and tables 205 Python Module Index 207 Index 209 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput sampling, beam search, and more - Tensor parallelism and pipeline parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs, AMD CPUs and GPUs, PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. - vLLM Meetups. ## DOCUMENTATION ## 1.1 Installation0 码力 | 215 页 | 1.29 MB | 3 月前3
vLLM v0.6.2 DocumentationDocumentation 3 2 Indices and tables 217 Python Module Index 219 Index 221 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput sampling, beam search, and more - Tensor parallelism and pipeline parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs, AMD CPUs and GPUs, PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. - vLLM Meetups. ## DOCUMENTATION ## 1.1 Installation0 码力 | 227 页 | 1.33 MB | 3 月前3
vLLM v0.6.1.post1 DocumentationDocumentation 3 2 Indices and tables 205 Python Module Index 207 Index 209 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput sampling, beam search, and more - Tensor parallelism and pipeline parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs, AMD CPUs and GPUs, PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. - vLLM Meetups. ## DOCUMENTATION ## 1.1 Installation0 码力 | 215 页 | 1.28 MB | 3 月前3
共 442 条
- 1
- 2
- 3
- 4
- 5
- 6
- 45
相关搜索词
RustLLMWebAssemblyServerless平台WasmEdge大语言模型向量数据库微调训练平台工具和平台vLLMpaged attentioncontinuous batchingLLM inferencequantizationproduction metricsusage statisticsmulti-modal models代理工具指令模型选择Vision Language ModelsOffline Batched InferencePreemptionChunked PrefillMultiModalDataDictpreemptionchunked prefillperformance tuningLoRA AdapterPerformance TuningSampling Parameters量化模型多模态模型分布式推理OpenAI兼容服务器LoRA adapterVision Language Models (VLMs)













