CUDA - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Bringing Existing Code to CUDA Using constexpr and std::pmr

principles from introductory CUDA examples to an existing project that has a meaningful amount of non-trivial code. • Provide some guidance to people about to embark on using CUDA to speed up existing software float* y) { for (int i = 0; i < n; i++) y[i] = x[i] + y[i]; } TEST_CASE("cppcon-0", "[CUDA]") { int N = 1 << 20; float* x = new float[N]; float* y = new float[N]; for (int add_cpu(N, x, y); delete[] x; delete[] y; } An Even Easier Introduction to CUDA 4 |TEST_CASE("cppcon-1", "[CUDA]") { int N = 1 << 20; float* x; float* y; cudaMallocManaged(&x, N*sizeof(float));

0 码力 | 51 页 | 3.68 MB | 6 月前
3
C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程

CUDA 开启的 GPU 编程 by 彭于斌（ @archibate ）往期录播： https://www.bilibili.com/video/BV1fa411r7zp 课程 PPT 和代码： https://github.com/parallel101/course 前置条件 • 学过 C/C++ 语言编程。 • 理解 malloc/free 之类的概念。 • 熟悉 STL 中的容器、函数模板等。中的容器、函数模板等。 • 英伟达 GTX900 及以上显卡。 • CUDA 11 及以上。 • CMake 3.18 及以上。我负责监督你学习第 0 章： Hello, world! CMake 中启用 CUDA 支持 • 最新版的 CMake （ 3.18 以上），只需在 LANGUAGES 后面加上 CUDA 即可启用。 • 然后在 add_executable 里直接加你 cn/docs/IO/51635/NVIDIA_CUDA_Programming_Guide_1.1_chs.pdf CUDA 编译器兼容 C++17 • CUDA 的语法，基本完全兼容 C++ 。包括 C+ +17 新特性，都可以用。甚至可以把任何一个 C++ 项目的文件后缀名全部改成 .cu ，都能编译出来。 • 这是 CUDA 的一大好处， CUDA 和 C++ 的关系就像 C++ 和

0 码力 | 142 页 | 13.52 MB | 1 年前
3
C++高性能并行编程与优化 - 课件 - 09 CUDA C++ 流体仿真实战

CUDA C++ 流体仿真实战 by 彭于斌（ @archibate ）往期录播： https://www.bilibili.com/video/BV16b4y1E74f 课程 PPT 和代码： https://github.com/parallel101/course CUDA 纹理对象 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index g-guide/index.html#texture-and-surface-memory CUDA 多维数组：封装 • cudaMalloc3DArray 用于分配一个三维数组。各维度上的大小通过 cudaExtent 指定，方便起见我们的 C++ 封装类用了 uint3 表示大小。 • GPU 的多维数组有特殊的数据排布来保障访存的高效，和我们 CPU 那样简单地行主序或列主序（如序或列主序（如 a[x + nx * y] ）的多维数组不一样。 • 随后可用 cudaMemcpy3D 在 GPU 的三维数组和 CPU 的三维数组之间拷贝数据。 CUDA 表面对象：封装 • 要访问一个多维数组，必须先创建一个表面对象（ cudaSurfaceObject_t ）。 • 考虑到多维数组始终是需要通过表面对象来访问的，这里我们让表面对象继承自多维数组。 •

0 码力 | 58 页 | 14.90 MB | 1 年前
3
Bridging the Gap: Writing Portable Programs for CPU and GPU

the Gap: Writing Portable Programs for CPU and GPU using CUDA Thomas Mejstrik Sebastian Woblistin 2/66Content 1 Motivation Audience etc.. Cuda crash course Quiz time 2 Patterns Oldschool host device everywhere Conditional function body constexpr everything Disable Cuda warnings host device template 3 The dark path Function dispatch triple 4 Cuda proposal Conditional host device Forbid bad cross function dark path Cuda proposal Thank you Motivation 1 Motivation Audience etc.. Cuda crash course Quiz time 2 Patterns 3 The dark path 4 Cuda proposal5/66 Motivation Patterns The dark path Cuda proposal

0 码力 | 124 页 | 4.10 MB | 6 月前
3
Taro: Task graph-based Asynchronous Programming Using C++ Coroutine

Polling D C 1 #include 2 #include cuda.hpp> 3 4 taro::Taro taro{NUM_THREADS}; 5 auto cuda = taro.cuda_scheduler(NUM_STREAMS); 6 29Taro’s Programming Model – Example com/dian-lun-lin/taro A B Callback Wait Polling D C 7 auto task_a = taro.emplace([&]() { 8 cuda.wait([&](cudaStream_t stream) { 9 kernel_a1<<<32, 256, 0, stream>>>(); 10 }); // synchronize 11 7 auto task_a = taro.emplace([&]() { 8 cuda.wait([&](cudaStream_t stream) { 9 kernel_a1<<<32, 256, 0, stream>>>(); 10 }); // synchronize 11 }); CUDA stream for offloading GPU kernels 32Taro’s

0 码力 | 84 页 | 8.82 MB | 6 月前
3
POCOAS in C++: A Portable Abstraction for Distributed Data Structures

very fast intra-node transfers GPU GPU Fast Intra- Node Fabric DataGPU Communication Libraries CUDA-Aware MPI NVSHMEM ROC_SHMEM - Communication libraries offering increasing support for GPU-to-GPU will utilize both GPUDirect RDMA and NVLink GASNet-EX Memory KindsGPU Communication Libraries CUDA-Aware MPI NVSHMEM ROC_SHMEM - Communication libraries offering increasing support for GPU-to-GPU = BCL::broadcast(ptr, 0); ptr[BCL::rank()] = BCL::rank(); BCL::cuda::ptr ptr = nullptr; if (BCL::rank() == 0) { ptr = BCL::cuda::alloc(BCL::nprocs()); } ptr = BCL::broadcast(ptr, 0); ptr[BCL::rank()]

0 码力 | 128 页 | 2.03 MB | 6 月前
3
AnEditor Can Do That?

CMake Presets support 3. ARM and ARM64 support (Raspberry Pi, Surface Pro X, Apple Silicon) 4. CUDA IntelliSense and GPU debuggingVisual Studio Code What’s new? 1. GitHub Codespaces (coding from your CMake Presets support 3. ARM and ARM64 support (Raspberry Pi, Surface Pro X, Apple Silicon) 4. CUDA IntelliSense and GPU debugging 5. Disassembly View while debugging Preview!Visual Studio Code What’s CMake Presets support 3. ARM and ARM64 support (Raspberry Pi, Surface Pro X, Apple Silicon) 4. CUDA IntelliSense and GPU debugging 5. Disassembly View while debugging Preview!Visual Studio Code What’s

0 码力 | 71 页 | 2.53 MB | 6 月前
3
Heterogeneous Modern C++ with SYCL 2020

Gordon Brown Principal Product Owner, oneAPI & Automotive Currently leading team developing HIP & CUDA backends for DPC++ Background in C++ programming models for heterogeneous systems Worked on ComputeCpp pointers can work naturally without buffers or accessors • Simplifies porting from most code (e.g. CUDA, C++) • Parallel Reductions • Added built-in reduction operation to avoid boilerplate code and achieve LLVM/Clang Part of oneAPI ComputeCpp Multiple Backends triSYCL Open source test bed hipSYCL CUDA and HIP/ROCm Any CPU Any CPU Intel CPUs Intel GPUs Intel FPGAs Intel CPUs Intel GPUs Intel FPGAs

0 码力 | 114 页 | 7.94 MB | 6 月前
3
Khronos APIs for Heterogeneous Compute and Safety: SYCL and SYCL SC

vec_add<<<64, 64>>>(a, b, c); cudaMemcpy(d_a, h_a, size, cudaMemcpyDeviceToHost); Examples: - OpenCL, CUDA, OpenMP, SYCL 2020 Implementation: - Data is moved to the device via explicit copy APIs Here is moved to the device implicitly via cross host CPU / device data structures Here we’re using CUDA as an example● Unified shared memory provides an alternative pointer-based data management model to learn Starting from C++ Easy to add SYCL to existing C++ software Starting from CUDA Easy to port from CUDA to SYCL: keep performance on GPUs Starting from another language SPIR-V standard enables

0 码力 | 82 页 | 3.35 MB | 6 月前
3
C++高性能并行编程与优化 - 课件 - 11 现代 CMake 进阶指南

Makefile 启动时会把每个文件都检测一遍，浪费很多时间。特别是有很多文件，但是实际需要构建的只有一小部分，从而是 I/O Bound 的时候， Ninja 的速度提升就很明显。然而某些专利公司的 CUDA toolkit 在 Windows 上只允许用 MSBuild 构建，不能用 Ninja （怕不是和 Bill Gates 有什么交易）第 1 章：添加源文件一个 .cpp 源文件用于测试指定了该项目使用了哪些编程语言。 • 目前支持的语言包括： • C ： C 语言 • CXX ： C++ 语言 • ASM ：汇编语言 • Fortran ：老年人的编程语言 • CUDA ：英伟达的 CUDA （ 3.8 版本新增） • OBJC ：苹果的 Objective-C （ 3.16 版本新增） • OBJCXX ：苹果的 Objective-C++ （ 3.16 版本新增） CXX_STANDARD 或是全局变量 CMAKE_CXX_STANDARD 来设置 -std=c++17 这个 flag ， CMake 会在配置阶段检测编译器是否支持 C++17 。 CUDA 的 -arch=sm_75 也是同理，请使用 CUDA_ARCHITECTURES 属性。再说了 -std=c++17 只是 GCC 编译器的选项，无法跨平台用于 MSVC 编译器。假如你一定要用动态链接库（ Windows

0 码力 | 166 页 | 6.54 MB | 1 年前
3

共 49 条前往

页

分类

语言

格式

Bringing Existing Code to CUDA Using constexpr and std::pmr

C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程

C++高性能并行编程与优化 - 课件 - 09 CUDA C++ 流体仿真实战

Bridging the Gap: Writing Portable Programs for CPU and GPU

Taro: Task graph-based Asynchronous Programming Using C++ Coroutine

POCOAS in C++: A Portable Abstraction for Distributed Data Structures

AnEditor Can Do That?

Heterogeneous Modern C++ with SYCL 2020

Khronos APIs for Heterogeneous Compute and Safety: SYCL and SYCL SC

C++高性能并行编程与优化 - 课件 - 11 现代 CMake 进阶指南