Cuda - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Bringing Existing Code to CUDA Using constexpr and std::pmr

## +21 ## Bringing Existing Code to CUDA Using constexpr and std::pmr BOWIE OWENS 20 21 October 24-29 ## Outline • Introduction • Memory • Host vs Device Functions • Return on Investment • Concluding principles from introductory CUDA examples to an existing project that has a meaningful amount of non-trivial code. • Provide some guidance to people about to embark on using CUDA to speed up existing software Easier Introduction to CUDA void add_cpu(int n, float* x, float* y) { for (int i = 0; i < n; i++) y[i] = x[i] + y[i]; } TEST_CASE("cppcon-0", "[CUDA]") { int N = 1 <<

0 码力 | 51 页 | 3.68 MB | 1 年前
3
C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程

## CUDA 开启的 GPU 编程 by 彭于斌 (@archibate) 往期录播：https://www.bilibili.com/video/BV1fa411r7zp 课程 PPT 和代码：https://github.com/parallel101/course ## 前置条件 • 学过 C/C++ 语言编程。 - 理解 malloc/free 之类的概念。 • 熟悉 STL 英伟达 GTX900 及以上显卡。 • CUDA 11 及以上。 • CMake 3.18 及以上。 ![Image](/uploads/documents/6/b/e/7/6be70db418434c4b3ebda53c2593beaa/p2_1.jpg) 我负责监督你学习第 0 章：Hello, world! ## CMake 中启用 CUDA 支持 - 最新版的 CMake（3 后面加上 CUDA 即可启用。 - 然后在 add_executable 里直接加你的 .cu 文件，和 .cpp 一样。 cmake_minimum_required(VERSION 3.10) set(CMAKE_CXX_STANDARD 17) set(CMAKE_BUILD_TYPE Release) project(hellocuda LANGUAGES CXX CUDA) add_executable(main

0 码力 | 142 页 | 13.52 MB | 2 年前
3
C++高性能并行编程与优化 - 课件 - 09 CUDA C++ 流体仿真实战

## CUDA C++ 流体仿真实战 by 彭于斌 (@archibate) 往期录播：https://www.bilibili.com/video/BV16b4y1E74f 课程 PPT 和代码：https://github.com/parallel101/course ## CUDA 纹理对象 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index g-guide/index.html#texture-and-surface-memory ## CUDA 多维数组：封装 • cudaMalloc3DArray 用于分配一个三维数组。各维度上的大小通过 cudaExtent 指定，方便起见我们的 C++ 封装类用了 uint3 表示大小。 - GPU 的多维数组有特殊的数据排布来保障访存的高效，和我们 CPU 那样简单地行主序或列主序（如 return m_cuArray; } ~CudaArray() { checkCudaErrors(cudaFreeArray(m_cuArray)); } ## CUDA 表面对象：封装 - 要访问一个多维数组，必须先创建一个表面对象（cudaSurfaceObject_t）。 - 考虑到多维数组始终是需要通过表面对象来访问的，这里我们让表面对象继承自多维数组。

0 码力 | 58 页 | 14.90 MB | 2 年前
3
PyTorch Release Notes

container image. The container also includes the following: Ubuntu 22.04 including Python 3.10 NVIDIA CUDA $ ^{®} $ 12.1.1 NVIDIA cuBLAS 12.1.3.1 NVIDIA cuDNN 8.9.3 NVIDIA NCCL 2.18.3 NVIDIA RAPIDS $ 0.10.0+96ed6fc PyTorch quantization wheel 2.1.2 ## Driver Requirements Release 23.07 is based on CUDA 12.1.1, which requires NVIDIA Driver release 530 or later. However, if you are running on a data center The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, R460, and R520 drivers, which are not forward-compatible with CUDA 12.1.

0 码力 | 365 页 | 2.94 MB | 2 年前
3
深度学习与PyTorch入门实战 - 02. 开发环境安装

ANACONDA ![Image](/uploads/documents/6/b/1/f/6b1f60f4edfd32064becee8e52a1bbcc/p2_1.jpg) NVIDIA. CUDA $ ^{®} $ PC — ## ANACONDA #### Anaconda3 5.3.1 (64-bit) Setup ## ANACONDA Advanced Installation < Back Skip Cancel ### CUDA 10.0 NVIDIA显卡 NVIDIA 安装程序 ## CUDA Visual Studio Integration No supported version of Visual Studio was found. Some components of the CUDA Toolkit will not work properly functionality. I understand, and wish to continue the installation regardless. 后退(B) NEXT 取消(C) ## CUDA 安装确认 |File|Home|Share|View|Manage|Application Tools|bin|| |---|---|---|---|---|---|---|---| ## 路径添加到PATH

0 码力 | 14 页 | 729.50 KB | 2 年前
3
4 Python机器学习性能优化

37d7f9feafcf0f03f1/p25_2.jpg) PYTHON 30th ## 4 动手优化 ## 多线程服务器的问题 · 每个请求单独进GPU，利用率不高 · 大量请求并行，CUDA会爆 • wrk截图 ![Image](/uploads/documents/7/1/6/5/71656c39f0055537d7f9feafcf0f03f1/p26_2.jpg) ## service-streamer 56c39f0055537d7f9feafcf0f03f1/p29_2.jpg) ## model inference optimize · 终于到了我们直觉的优化部分 - 先补了补GPU和Cuda的知识 · 几个可以选择的方案： 1. 买更多更贵的机器——fp16、v100、cpu化 2. 优化算法——知识蒸馏 3. 优化实现——jit/TensorRT ## PyTorch jit /p34_2.jpg) ![Image](/uploads/documents/7/1/6/5/71656c39f0055537d7f9feafcf0f03f1/p34_3.jpg) ## cuda优化 ## • 更高效的kernel函数实现，替代默认导出的算子 ![Image](/uploads/documents/7/1/6/5/71656c39f0055537d7f9feafcf0f03f1/p35_2

0 码力 | 38 页 | 2.25 MB | 2 年前
3
Bridging the Gap: Writing Portable Programs for CPU and GPU

the Gap: Writing Portable Programs for CPU and GPU using CUDA Thomas Mejstrik Sebastian Woblistin ## Content ## 1 Motivation Audience etc.. Cuda crash course Quiz time 2 Patterns Oldschool ___host___ Conditional function body ■ constexpr everything ☐ Disable Cuda warnings __host__ __device__ template 3 The dark path Function dispatch triple 4 Cuda proposal Conditional ___host___ ___device___ Forbid cross function calls ## Motivation 1 Motivation ☐ Audience etc.. Cuda crash course Quiz time Patterns 3 The dark path 4 Cuda proposal Motivation: Audience etc.. ## Audience etc ☐ Ask questions

0 码力 | 124 页 | 4.10 MB | 1 年前
3
Taro: Task graph-based Asynchronous Programming Using C++ Coroutine

hpp=""> #include cuda.hpp=""> taro::Taro taro{NUM_THREADS}; auto cuda = taro.cuda_scheduler(NUM_STREAMS); auto task_a = taro.emplace([&]) { cuda.wait([&](cudaStream_t stream) synchronize }; auto task_b = taro.emplace([&]) -> taro::Coro { cpu_work_b1(); co_await cuda.suspend_callback([&](cudaStream_t stream) { kernel_b1 << 32, 256, 0, stream >>>(); hpp=""> #include cuda.hpp=""> taro::Taro taro{NUM_THREADS}; auto cuda = taro.cuda_scheduler(NUM_STREAMS); auto task_a = taro.emplace([&]) { cuda.wait([&](cudaStream_t stream)

0 码力 | 84 页 | 8.82 MB | 1 年前
3
POCOAS in C++: A Portable Abstraction for Distributed Data Structures

libraries offer GPU-initiated communication - NVSHMEM will utilize both GPUDirect RDMA and NVLink CUDA-Aware MPI ![Image](/uploads/documents/9/2/2/c/922c187ff1767211f82042aebd90131f/p106_1.jpg) GASNet-EX libraries offer GPU-initiated communication - NVSHMEM will utilize both GPUDirect RDMA and NVLink CUDA-Aware MPI ![Image](/uploads/documents/9/2/2/c/922c187ff1767211f82042aebd90131f/p107_1.jpg) GASNet-EX __global__ void kernel(BCL::cuda::ptr ptr) { size_t tid = ...; ptr[tid] = tid; } BCL::cuda::ptr ptr = nullptr; if (BCL::rank() == 0) { ptr = BCL::cuda::alloc(BCL::nprocs());

0 码力 | 128 页 | 2.03 MB | 1 年前
3
vLLM v0.5.5 Documentation

Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: GPTQ, AWQ, INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer DOCUMENTATION ## 1.1 Installation vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries. ## 1.1.1 Requirements - OS: Linux - Python: 3.8-3.12 - GPU: compute capability 10 -y $ conda activate myenv $ # Install vLLM with CUDA 12.1. $ pip install vllm ``` Note: As of now, vLLM's binaries are compiled with CUDA 12.1 and public PyTorch release versions by default

0 码力 | 193 页 | 1.22 MB | 5 月前
5

共 297 条前往

页

分类

语言

格式

Bringing Existing Code to CUDA Using constexpr and std::pmr

C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程

C++高性能并行编程与优化 - 课件 - 09 CUDA C++ 流体仿真实战

PyTorch Release Notes

深度学习与PyTorch入门实战 - 02. 开发环境安装

4 Python机器学习性能优化

Bridging the Gap: Writing Portable Programs for CPU and GPU

Taro: Task graph-based Asynchronous Programming Using C++ Coroutine

POCOAS in C++: A Portable Abstraction for Distributed Data Structures

vLLM v0.5.5 Documentation

搜索

分类

语言

格式