Boosting Software Efficiency## +24 ## Boosting Software Efficiency: A Case Study of 100% Performance Improvement in an Embedded C++ System ## GILI KAMMA ## 20 24 September 15 - 20 ☐ The talk today is about software development0 码力 | 180 页 | 1.65 MB | 1 年前3
HUAWEI CLOUD Microservice Tool Improves Development Efficiency## HUAWEI CLOUD Microservice Tool Improves Development Efficiency Department: Application Platform Service Author: Wang Qijun Date: 2019-09-20 ## Contents 1. Tool for Splitting Monolithic Applications ss-level| |Overall availability|Low|High| |Continuous evolution|Difficult|Easy| |Communication efficiency|Low|High| |Technology stack selection|Restricted|Flexible| |Scalable|Restricted|Flexible| |Reusability|Low|High| increases. ## Tool for Splitting Monolithic Applications into Microservices Improves Development Efficiency  ✓ Distributed0 码力 | 14 页 | 795.42 KB | 2 年前3
Balancing Efficiency and Flexibility: Cost of Abstractions in Embedded Systems## +24 ## Balancing Efficiency and Flexibility: Cost of Abstractions in Embedded Systems MARCELL JUHASZ  zühlke ## whoami0 码力 | 75 页 | 2.12 MB | 1 年前3
《Efficient Deep Learning Book》[EDL] Chapter 1 - Introductionrapid growth. We will establish our motivation behind seeking efficiency in deep learning models. We will also introduce core areas of efficiency techniques (compression techniques, learning techniques, automation Our hope is that even if you just read this chapter, you would be able to appreciate why we need efficiency in deep learning models today, how to think about it in terms of metrics that you care about, and models is rate-limited by their efficiency. While efficiency can be an overloaded term, let us investigate two primary aspects: ## Training Efficiency Training Efficiency involves benchmarking the model0 码力 | 21 页 | 3.17 MB | 2 年前3
vLLM v0.4.0.post1 Documentationtables 59 Python Module Index 61 Index 63 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput - Efficient management including parallel sampling, beam search, and more - Tensor parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs and AMD GPUs - (Experimental) PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. ## DOCUMENTATION ## 1.1 Installation vLLM is0 码力 | 68 页 | 810.15 KB | 3 月前3
vLLM v0.5.1 Documentation153 Python Module Index 155 Index 157 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput - Efficient management including parallel sampling, beam search, and more - Tensor parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs and AMD GPUs - (Experimental) PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. - vLLM Meetups. ## DOCUMENTATION ## 1.1 Installation0 码力 | 162 页 | 1.14 MB | 3 月前3
vLLM v0.5.2 Documentation157 Python Module Index 159 Index 161 ## LLM vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput - Efficient management sampling, beam search, and more - Tensor parallelism and pipeline parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs and AMD GPUs - (Experimental) PagedAttention) - vLLM paper (SOSP 2023) - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. - vLLM Meetups. ## DOCUMENTATION ## 1.1 Installation0 码力 | 166 页 | 1.15 MB | 3 月前3
DeepSeek-V2: A Strong, Economical, and Efficient
Mixture-of-Experts Language Modelstrong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE costs and inference efficiency of DeepSeek 67B (Dense) and DeepSeek-V2. ## Contents 1 Introduction 4 2 Architecture 6 2.1 Multi-Head Latent Attention: Boosting Inference Efficiency 6 2.1.1 Preliminaries:0 码力 | 52 页 | 1.23 MB | 2 年前3
PAI & TVM Meetup - Shanghai 20191116Mixed-Precision Training/Inference PAI (Platform of AI) Alibaba Cloud Intelligence ## Outline • TensorCore AutoCodeGen in TVM • FP16 Mixed-Precision Training on PAI • INT8 Inference on PAI-Blade ## TensorCore PAI-TF  ## I NT8 Inference on PAI-Blade ## PAI-Blade  and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold- Constrained Hyper-Connections (mHC) that enhance conventional residual connections; scenarios. In the one-million-token context setting, DeepSeek V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token Figure 1 | Left: benchmark performance of DeepSeek-V4-Pro-Max and its counterparts. Right: inference FLOPs and KV cache size of DeepSeek-V4 series and DeepSeek-V3.2. Contents 1 Introduction 4 2 Architecture0 码力 | 58 页 | 4.27 MB | 1 月前3
共 927 条
- 1
- 2
- 3
- 4
- 5
- 6
- 93
相关搜索词
软件效率性能改进自动化测试监控发布周期微服务架构单体应用拆分工具表关联规则数据库拆分抽象化嵌入式系统C++代码膨胀零成本抽象efficient deep learningcompression techniquestraining efficiencyinference efficiencyneural architecture searchvLLMpaged attentioncontinuous batchingLLM inferencequantizationVision Language ModelsOffline Batched InferencePreemptionChunked PrefillMultiModalDataDictproduction metricsusage statisticsmulti-modal modelsMulti-head Latent Attention (MLA)DeepSeekMoEMixture-of-Experts (MoE)Transformer architectureTensorCore AutoCodeGenFP16 Mixed-Precision TrainingINT8 InferencePAI PlatformTVM FrameworkDeepSeek-V4Compressed Sparse Attention (CSA)Heavily Compressed Attention (HCA)hybrid attention













