GPU资源管理 - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Bridging the Gap: Writing Portable Programs for CPU and GPU

1/66Bridging the Gap: Writing Portable Programs for CPU and GPU using CUDA Thomas Mejstrik Sebastian Woblistin 2/66Content 1 Motivation Audience etc.. Cuda crash course Quiz time 2 Patterns Oldschool Motivation Patterns The dark path Cuda proposal Thank you Why write programs for CPU and GPU Difference CPU/GPU Algorithms are designed differently Latency/Throughput Memory bandwidth Number of cores Motivation Patterns The dark path Cuda proposal Thank you Why write programs for CPU and GPU Difference CPU/GPU Why it makes sense? Library/Framework developers Embarrassingly parallel algorithms User

0 码力 | 124 页 | 4.10 MB | 6 月前
3
C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程

CUDA 开启的 GPU 编程 by 彭于斌（ @archibate ）往期录播： https://www.bilibili.com/video/BV1fa411r7zp 课程 PPT 和代码： https://github.com/parallel101/course 前置条件 • 学过 C/C++ 语言编程。 • 理解 malloc/free 之类的概念。 • 熟悉 STL 中的容器、函数模板等。做不到的。编写一段在 GPU 上运行的代码 • 定义函数 kernel ，前面加上 __global__ 修饰符，即可让他在 GPU 上执行。 • 不过调用 kernel 时，不能直接 kernel() ，而是要用 kernel<<<1, 1>>>() 这样的三重尖括号语法。为什么？这里面的两个 1 有什么用？稍后会说明。 • 运行以后，就会在 GPU 上执行 printf 了。 kernel 函数在 GPU 上执行，称为核函数，用 __global__ 修饰的就是核函数。没有反应？同步一下！ • 然而如果直接编译运行刚刚那段代码，是不会打印出 Hello, world! 的。 • 这是因为 GPU 和 CPU 之间的通信，为了高效，是异步的。也就是 CPU 调用 kernel<<<1, 1>>>() 后，并不会立即在 GPU 上执行完毕，再返回。实际上只是把

0 码力 | 142 页 | 13.52 MB | 1 年前
3
POCOAS in C++: A Portable Abstraction for Distributed Data Structures

CPU vFast GPU vvFast PCI Bus (or other fabric)GPUs as a First-Class Computing Resource CPU GPU PCI Bus (or other fabric) NIC - Historically, network comm. was CPU-centric 1) Direct GPU access to Infiniband allows GPU-to-GPU network transfers 2) Fast in-node fabrics like NVLink, Infinity Fabric allow very fast intra-node transfers DataGPUs as a First-Class Computing Resource CPU GPU PCI Bus (or fabric) NIC Data - Historically, network comm. was CPU-centric 1) Direct GPU access to Infiniband allows GPU-to-GPU network transfers 2) Fast in-node fabrics like NVLink, Infinity Fabric allow

0 码力 | 128 页 | 2.03 MB | 6 月前
3
Taro: Task graph-based Asynchronous Programming Using C++ Coroutine

B" : GPU operation 9Existing TGPSs on Heterogenous Computing - Challenge A C D B! B" 5 task_b = sched.emplace([](&){ 6 // CPU code; // GPU code; 7 }); // CPU thread blocks until GPU finishes B" : GPU operation 10Existing TGPSs on Heterogenous Computing - Challenge A C D B! B" 5 task_b = sched.emplace([](&){ 6 // CPU code; // GPU code; 7 }); // CPU thread blocks until GPU finishes operation B" : GPU operation Atomic execution per task 11Existing TGPSs on Heterogenous Computing - Challenge CPU A B! C Idle GPU D B" Runtime A C D B! B" Assume one CPU and one GPU B! : CPU operation

0 码力 | 84 页 | 8.82 MB | 6 月前
3
Heterogeneous Modern C++ with SYCL 2020

http://wongmichael.com/about ● C++11 book in Chinese: https://www.amazon.cn/dp/B00ETOV2OQ We build GPU compilers for some of the most powerful supercomputers in the world 34 Nevin “:-)” Liber nliber@anl Attribution 4.0 International License SYCL Single Source C++ Parallel Programming GPU FPGA DSP Custom Hardware GPU CPU CPU CPU Standard C++ Application Code C++ Libraries ML Frameworks give better performance on complex apps and libs than hand-coding AI/Tensor HW GPU FPGA DSP Custom Hardware GPU CPU CPU CPU AI/Tensor HW Other BackendsSYCL 2020 is here! Open Standard for

0 码力 | 114 页 | 7.94 MB | 6 月前
3
Bringing Existing Code to CUDA Using constexpr and std::pmr

cudaFree(x); cudaFree(y); } An Even Easier Introduction to CUDA 5 |__global__ void add_gpu(int n, float* x, float* y) { for (int i = 0; i < n; i++) y[i] = x[i] + y[i]; } TEST_CASE("cppcon-1" TEST_CASE("cppcon-1", "[CUDA]") { // … } An Even Easier Introduction to CUDA 6 |__global__ void add_gpu(int n, float* x, float* y) { for (int i = 0; i < n; i++) y[i] = x[i] + y[i]; } TEST_CASE("cppcon-1" 20; float* x; float* y; // … add_gpu<<<1, 1>>>(N, x, y); // … } An Even Easier Introduction to CUDA 7 |__global__ void add_gpu(int n, float* x, float* y) { for (int i = 0;

0 码力 | 51 页 | 3.68 MB | 6 月前
3
Distributed Ranges: A Model for Building Distributed Data Structures, Algorithms, and Views

involve experimental prototypes and early research.Problem: writing parallel programs is hard - Multi-GPU, multi-CPU systems require partitioning data - Users must manually split up data amongst GPUs / execution necessary. CPU NIC GPU GPU GPU GPU Xe LinkMulti-GPU Systems - NUMA regions: - 4+ GPUs - 2+ CPUs CPU NIC GPU GPU GPU GPU Xe LinkMulti-GPU Systems - NUMA regions: - 4+ GPUs more memory domains - Software needed to reduce complexity CPU NIC GPU Tile 1 Tile 0 GPU Tile 1 Tile 0 GPU Tile 1 Tile 0 GPU Tile 1 Tile 0 Xe LinkProject Goals - Offer high-level, standard C++

0 码力 | 127 页 | 2.06 MB | 6 月前
3
C++高性能并行编程与优化 - 课件 - 06 TBB 开启的并行编程之旅

编译器如何自动优化：从汇编角度看 C++ 5.C++11 起的多线程编程：从 mutex 到无锁并行 6.并行编程常用框架： OpenMP 与 Intel TBB 7.被忽视的访存优化：内存带宽与 cpu 缓存机制 8.GPU 专题： wrap 调度，共享内存， barrier 9.并行算法实战： reduce ， scan ，矩阵乘法等 10.存储大规模三维数据的关键：稀疏数据结构 11.物理仿真实战：邻居搜索表实现位时代过去了）至少 2 核 4 线程（并行课…）英伟达家显卡（ GPU 专题）软件要求： Visual Studio 2019 （ Windows 用户） GCC 9 及以上（ Linux 用户） CMake 3.12 及以上（跨平台作业） Git 2.x （作业上传到 GitHub ） CUDA Toolkit 10.0 以上（ GPU 专题）第 0 章：从并发到并行摩尔定律：停止增长了吗？，其中 n 是元素个数改进的并行缩并（ GPU ） • 刚才那种方式对 c 比较大的情况不友好，最后一个串行的 for 还是会消耗很多时间。 • 因此可以用递归的模式，每次只使数据缩小一半，这样基本每次都可以看做并行的 for ，只需 log2(n) 次并行 for 即可完成缩并。 • 这种常用于核心数量很多，比如 GPU 上的缩并。结论：改进后的并行缩并的时间复杂度为

0 码力 | 116 页 | 15.85 MB | 1 年前
3
AnEditor Can Do That?

support 3. ARM and ARM64 support (Raspberry Pi, Surface Pro X, Apple Silicon) 4. CUDA IntelliSense and GPU debuggingVisual Studio Code What’s new? 1. GitHub Codespaces (coding from your browser!) 2. CMake support 3. ARM and ARM64 support (Raspberry Pi, Surface Pro X, Apple Silicon) 4. CUDA IntelliSense and GPU debugging 5. Disassembly View while debugging Preview!Visual Studio Code What’s new? 1. GitHub Codespaces support 3. ARM and ARM64 support (Raspberry Pi, Surface Pro X, Apple Silicon) 4. CUDA IntelliSense and GPU debugging 5. Disassembly View while debugging Preview!Visual Studio Code What’s new? 1. GitHub Codespaces

0 码力 | 71 页 | 2.53 MB | 6 月前
3
Powered by AI: A Cambrian Explosion for C++ Software Development Tools

ALLOCATED MEMORY USAGE GPU UTIL %, PEAK MEMORY (MB/s) MEMORY PYTHON NATIVECPU PYTHON NATIVE SYS% AVERAGE & PEAK COPY VOLUME OVER TIME, % OF MEM ALLOCATED MEMORY USAGE GPU UTIL %, PEAK MEMORY PYTHON NATIVE SYS% AVERAGE & PEAK COPY VOLUME OVER TIME, % OF MEM ALLOCATED MEMORY USAGE GPU UTIL %, PEAK MEMORY (MB/s) MEMORY PYTHON NATIVEwriting your code in Python profiling your PYTHON NATIVE SYS% AVERAGE & PEAK COPY VOLUME OVER TIME, % OF MEM ALLOCATED MEMORY USAGE GPU UTIL %, PEAK MEMORY (MB/s) MEMORY PYTHON NATIVE AI-powered optimizations!AI-powered optimizations

0 码力 | 128 页 | 23.40 MB | 6 月前
3

共 47 条前往

页

分类

语言

格式

Bridging the Gap: Writing Portable Programs for CPU and GPU

C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程

POCOAS in C++: A Portable Abstraction for Distributed Data Structures

Taro: Task graph-based Asynchronous Programming Using C++ Coroutine

Heterogeneous Modern C++ with SYCL 2020

Bringing Existing Code to CUDA Using constexpr and std::pmr

Distributed Ranges: A Model for Building Distributed Data Structures, Algorithms, and Views

C++高性能并行编程与优化 - 课件 - 06 TBB 开启的并行编程之旅

AnEditor Can Do That?

Powered by AI: A Cambrian Explosion for C++ Software Development Tools