Bringing Existing Code to CUDA Using constexpr and std::pmr## +21 ## Bringing Existing Code to CUDA Using constexpr and std::pmr BOWIE OWENS 20 21 October 24-29 ## Outline • Introduction • Memory • Host vs Device Functions • Return on Investment • Concluding principles from introductory CUDA examples to an existing project that has a meaningful amount of non-trivial code. • Provide some guidance to people about to embark on using CUDA to speed up existing software Easier Introduction to CUDA void add_cpu(int n, float* x, float* y) { for (int i = 0; i < n; i++) y[i] = x[i] + y[i]; } TEST_CASE("cppcon-0", "[CUDA]") { int N = 1 <<0 码力 | 51 页 | 3.68 MB | 1 年前3
C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程## CUDA 开启的 GPU 编程 by 彭于斌 (@archibate) 往期录播:https://www.bilibili.com/video/BV1fa411r7zp 课程 PPT 和代码:https://github.com/parallel101/course ## 前置条件 • 学过 C/C++ 语言编程。 - 理解 malloc/free 之类的概念。 • 熟悉 STL 英伟达 GTX900 及以上显卡。 • CUDA 11 及以上。 • CMake 3.18 及以上。  我负责监督你学习 第 0 章:Hello, world! ## CMake 中启用 CUDA 支持 - 最新版的 CMake(3 后面加上 CUDA 即可启用。 - 然后在 add_executable 里直接加你的 .cu 文件,和 .cpp 一样。 cmake_minimum_required(VERSION 3.10) set(CMAKE_CXX_STANDARD 17) set(CMAKE_BUILD_TYPE Release) project(hellocuda LANGUAGES CXX CUDA) add_executable(main0 码力 | 142 页 | 13.52 MB | 2 年前3
C++高性能并行编程与优化 - 课件 - 09 CUDA C++ 流体仿真实战## CUDA C++ 流体仿真实战 by 彭于斌 (@archibate) 往期录播:https://www.bilibili.com/video/BV16b4y1E74f 课程 PPT 和代码:https://github.com/parallel101/course ## CUDA 纹理对象 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index g-guide/index.html#texture-and-surface-memory ## CUDA 多维数组:封装 • cudaMalloc3DArray 用于分配一个三维数组。各维度上的大小通过 cudaExtent 指定,方便起见我们的 C++ 封装类用了 uint3 表示大小。 - GPU 的多维数组有特殊的数据排布来保障访存的高效,和我们 CPU 那样简单地行主序或列主序(如 return m_cuArray; } ~CudaArray() { checkCudaErrors(cudaFreeArray(m_cuArray)); } ## CUDA 表面对象:封装 - 要访问一个多维数组,必须先创建一个表面对象(cudaSurfaceObject_t)。 - 考虑到多维数组始终是需要通过表面对象来访问的,这里我们让表面对象继承自多维数组。0 码力 | 58 页 | 14.90 MB | 2 年前3
PyTorch Release Notescontainer image. The container also includes the following: Ubuntu 22.04 including Python 3.10 NVIDIA CUDA $ ^{®} $ 12.1.1 NVIDIA cuBLAS 12.1.3.1 NVIDIA cuDNN 8.9.3 NVIDIA NCCL 2.18.3 NVIDIA RAPIDS $ 0.10.0+96ed6fc PyTorch quantization wheel 2.1.2 ## Driver Requirements Release 23.07 is based on CUDA 12.1.1, which requires NVIDIA Driver release 530 or later. However, if you are running on a data center The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, R460, and R520 drivers, which are not forward-compatible with CUDA 12.1.0 码力 | 365 页 | 2.94 MB | 2 年前3
深度学习与PyTorch入门实战 - 02. 开发环境安装ANACONDA  NVIDIA. CUDA $ ^{®} $ PC — ## ANACONDA #### Anaconda3 5.3.1 (64-bit) Setup ## ANACONDA Advanced Installation < Back Skip Cancel ### CUDA 10.0 NVIDIA显卡 NVIDIA 安装程序 ## CUDA Visual Studio Integration No supported version of Visual Studio was found. Some components of the CUDA Toolkit will not work properly functionality. I understand, and wish to continue the installation regardless. 后退(B) NEXT 取消(C) ## CUDA 安装确认 |File|Home|Share|View|Manage|Application Tools|bin|| |---|---|---|---|---|---|---|---| ## 路径添加到PATH0 码力 | 14 页 | 729.50 KB | 2 年前3
4 Python机器学习性能优化37d7f9feafcf0f03f1/p25_2.jpg) PYTHON 30th ## 4 动手优化 ## 多线程服务器的问题 · 每个请求单独进GPU,利用率不高 · 大量请求并行,CUDA会爆 • wrk截图  ## service-streamer 56c39f0055537d7f9feafcf0f03f1/p29_2.jpg) ## model inference optimize · 终于到了我们直觉的优化部分 - 先补了补GPU和Cuda的知识 · 几个可以选择的方案: 1. 买更多更贵的机器——fp16、v100、cpu化 2. 优化算法——知识蒸馏 3. 优化实现——jit/TensorRT ## PyTorch jit /p34_2.jpg)  ## cuda优化 ## • 更高效的kernel函数实现,替代默认导出的算子 是估值真正的锚,单个万亿参数模型迁移代价 $50-200M+12-18个月。 维度 层级 分 核心理由 Computing Die Silicon 8 AMD MI400/450有竞争力,缺系统整合 Networking(Spectrum-X) Silicon 9 Training场景NVDA主导 Rack-scale(NVL72) System 10 AMD Helios慢1-2季即错过整代 CUDA+cuDNN+TensorRT Software 10 生态锁定无解 框架集成(PyTorch/JAX) Software 9 新模型首发支持NVDA AI Factory(NIM/NeMo) colspan=“2”>结构转折点 BOTTOM LINE CUDA生态的货币价值 = 400万开发者×500万维护应用×每年 $30-40B第三方投资。这套软件护城河才是NVDA估值的锚芯片可以被追赶,生态几乎不可能。 风险矩阵:五大风险优先级 当前股价反映「FY30后推理ASIC迁移可能把评分拉到7.5-8.0 0 码力 | 14 页 | 795.19 KB | 1 月前3
Bridging the Gap: Writing Portable Programs for CPU and GPUthe Gap: Writing Portable Programs for CPU and GPU using CUDA Thomas Mejstrik Sebastian Woblistin ## Content ## 1 Motivation Audience etc.. Cuda crash course Quiz time 2 Patterns Oldschool ___host___ Conditional function body ■ constexpr everything ☐ Disable Cuda warnings __host__ __device__ template 3 The dark path Function dispatch triple 4 Cuda proposal Conditional ___host___ ___device___ Forbid cross function calls ## Motivation 1 Motivation ☐ Audience etc.. Cuda crash course Quiz time Patterns 3 The dark path 4 Cuda proposal Motivation: Audience etc.. ## Audience etc ☐ Ask questions0 码力 | 124 页 | 4.10 MB | 1 年前3
Taro: Task graph-based Asynchronous Programming Using C++ Coroutinehpp=""> #includecuda.hpp=""> taro::Taro taro{NUM_THREADS}; auto cuda = taro.cuda_scheduler(NUM_STREAMS); auto task_a = taro.emplace([&]) { cuda.wait([&](cudaStream_t stream) synchronize }; auto task_b = taro.emplace([&]) -> taro::Coro { cpu_work_b1(); co_await cuda.suspend_callback([&](cudaStream_t stream) { kernel_b1 << 32, 256, 0, stream >>>(); hpp=""> #include cuda.hpp=""> taro::Taro taro{NUM_THREADS}; auto cuda = taro.cuda_scheduler(NUM_STREAMS); auto task_a = taro.emplace([&]) { cuda.wait([&](cudaStream_t stream) 0 码力 | 84 页 | 8.82 MB | 1 年前3
POCOAS in C++: A Portable Abstraction for Distributed Data Structureslibraries offer GPU-initiated communication - NVSHMEM will utilize both GPUDirect RDMA and NVLink CUDA-Aware MPI  GASNet-EX libraries offer GPU-initiated communication - NVSHMEM will utilize both GPUDirect RDMA and NVLink CUDA-Aware MPI  GASNet-EX __global__ void kernel(BCL::cuda::ptrptr) { size_t tid = ...; ptr[tid] = tid; } BCL::cuda::ptr ptr = nullptr; if (BCL::rank() == 0) { ptr = BCL::cuda::alloc (BCL::nprocs()); 0 码力 | 128 页 | 2.03 MB | 1 年前3
共 287 条
- 1
- 2
- 3
- 4
- 5
- 6
- 29













