ARM32 CPU - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Bridging the Gap: Writing Portable Programs for CPU and GPU

1/66Bridging the Gap: Writing Portable Programs for CPU and GPU using CUDA Thomas Mejstrik Sebastian Woblistin 2/66Content 1 Motivation Audience etc.. Cuda crash course Quiz time 2 Patterns Oldschool afterwards7/66 Motivation Patterns The dark path Cuda proposal Thank you Why write programs for CPU and GPU Difference CPU/GPU Algorithms are designed differently Latency/Throughput Memory bandwidth Number of talk7/66 Motivation Patterns The dark path Cuda proposal Thank you Why write programs for CPU and GPU Difference CPU/GPU Why it makes sense? Library/Framework developers Embarrassingly parallel algorithms

0 码力 | 124 页 | 4.10 MB | 6 月前
3
C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程

• 这是因为 GPU 和 CPU 之间的通信，为了高效，是异步的。也就是 CPU 调用 kernel<<<1, 1>>>() 后，并不会立即在 GPU 上执行完毕，再返回。实际上只是把 kernel 这个任务推送到 GPU 的执行队列上，然后立即返回，并不会等待执行完毕。 • 因此可以调用 cudaDeviceSynchronize() ，让 CPU 陷入等待，等 GPU 完成队列的所有任务后再返回。从而完成队列的所有任务后再返回。从而能够在 main 退出前等到 kernel 在 GPU 上执行完。定义在 GPU 上的设备函数 • __global__ 用于定义核函数，他在 GPU 上执行，从 CPU 端通过三重尖括号语法调用，可以有参数，不可以有返回值。 • 而 __device__ 则用于定义设备函数，他在 GPU 上执行，但是从 GPU 上调用的，而且不需要三重尖括号，和普通函数用起来一符号，和性能优化意义上的内联无关。 • 优化意义上的内联指把函数体直接放到调用者那里去。 • 因此 CUDA 编译器提供了一个“私货”关键字： __inline__ 来声明一个函数为内联。不论是 CPU 函数还是 GPU 都可以使用，只要你用的 CUDA 编译器。 GCC 编译器相应的私货则是 __attribute__((“inline”)) 。 • 注意声明为 __inline__ 不一定就保证内联了，如果函数太大编

0 码力 | 142 页 | 13.52 MB | 1 年前
3
Au Units

3Example: “CPU ticks” time units constexpr uint64_t CPU_CLOCK_HZ = 400'000'000; // API to implement: std::chrono::nanoseconds elapsed_time(uint64_t num_cpu_ticks); 1 2 3 4 8Example: “CPU ticks” time constexpr uint64_t CPU_CLOCK_HZ = 400'000'000; // API to implement: std::chrono::nanoseconds elapsed_time(uint64_t num_cpu_ticks); 1 2 3 4 std::chrono::nanoseconds elapsed_time(uint64_t num_cpu_ticks) { 0, CPU_CLOCK_HZ>; return std::chrono::nanoseconds{ num_cpu_ticks * NS_PER_TICK::num / NS_PER_TICK::den }; } 1 2 3 4 5 6 8.1Example: “CPU ticks” time units constexpr uint64_t CPU_CLOCK_HZ

0 码力 | 191 页 | 22.37 MB | 6 月前
3
Taro: Task graph-based Asynchronous Programming Using C++ Coroutine

B! : CPU operation B" : GPU operation 9Existing TGPSs on Heterogenous Computing - Challenge A C D B! B" 5 task_b = sched.emplace([](&){ 6 // CPU code; // GPU code; 7 }); // CPU thread B! : CPU operation B" : GPU operation 10Existing TGPSs on Heterogenous Computing - Challenge A C D B! B" 5 task_b = sched.emplace([](&){ 6 // CPU code; // GPU code; 7 }); // CPU thread finishes B! : CPU operation B" : GPU operation Atomic execution per task 11Existing TGPSs on Heterogenous Computing - Challenge CPU A B! C Idle GPU D B" Runtime A C D B! B" Assume one CPU and one

0 码力 | 84 页 | 8.82 MB | 6 月前
3
POCOAS in C++: A Portable Abstraction for Distributed Data Structures

structure ops CPU NIC CPU NIC DRAM DRAMAdvantages of PGAS - Asynchronous - RDMA operations executed by NIC - Allows irregular, one-sided access - Maps well to data structure ops CPU NIC CPU NIC structure ops CPU NIC CPU NIC DRAM DRAMAdvantages of PGAS - Asynchronous - RDMA operations executed by NIC - Allows irregular, one-sided access - Maps well to data structure ops CPU NIC CPU NIC structure ops CPU NIC CPU NIC DRAM DRAMAdvantages of PGAS - Asynchronous - RDMA operations executed by NIC - Allows irregular, one-sided access - Maps well to data structure ops CPU NIC CPU NIC

0 码力 | 128 页 | 2.03 MB | 6 月前
3
How Meta Made Debugging Async Code Easier with Coroutines and Senders

:0Walking the stack CPU ret* prev* data frame* instr* process_fileWalking the stack CPU ret* prev* data ret* prev* data frame* instr* process_file coro::resumeWalking the stack CPU ret* prev* data ret* prev* data ret* prev* data frame* instr* process_file coro::resume ...Walking the stack CPU ret* prev* data ret* prev* data ret* prev* data frame* instr* process_file coro::resume ...0 le) () at main.cpp:70Walking the stack CPU ret* prev* data ret* prev* data ret* prev* data frame* instr* process_file coro::resume ...Walking the stack CPU ret* prev* data ret* prev* data ret*

0 码力 | 131 页 | 907.41 KB | 6 月前
3
C++高性能并行编程与优化 - 课件 - 07 深入浅出访存优化

章：内存带宽 cpu-bound 与 memory-bound • 通常来说，并行只能加速计算的部分，不能加速内存读写的部分。 • 因此，对 fill 这种没有任何计算量，纯粹只有访存的循环体，并行没有加速效果。称为内存瓶颈（ memory-bound ）。 • 而 sine 这种内部需要泰勒展开来计算，每次迭代计算量很大的循环体，并行才有较好的加速效果。称为计算瓶颈（ cpu- bound 计算太简单，数据量又大，并行只带来了多线程调度的额外开销。 • 小彭老师经验公式： 1 次浮点读写 ≈ 8 次浮点加法 • 如果矢量化成功（ SSE ）： 1 次浮点读写 ≈ 32 次浮点加法 • 如果 CPU 有 4 核且矢量化成功： 1 次浮点读写 ≈ 128 次浮点加法常见操作所花费的时间 • 图中加法 (add) 和乘法 (mul) 都指的整数。 • 区别是浮点的乘法和加法基本是一样速度。 • funcA 用了 2 核就饱和。 • funcB 用了 4 核才饱和。 • funcC 用了 6 核才饱和。 • 结论：要想利用全部 CPU 核心，避免 mem-bound ，需要 func 里有足够的计算量。 • 当核心数量越多， CPU 计算能力越强，相对之下来不及从内存读写数据，从而越容易 mem-bound 。 1 2 4 6 8 10 0 50 100 150

0 码力 | 147 页 | 18.88 MB | 1 年前
3
Branchless Programming in C++

and performance ● Understanding the hardware and using it efficiently – Computing resources of a CPU – Pipelining – Branch prediction and hardware loop unrolling ● Conditional code vs efficiency evaluations/second ● Optimized: 570M evaluations/secondBranchless Computing 5 USE ALL OF THE CPU HARDWARE ALL THE TIME ● What determines performance? ● Optimal algorithm: – get the result with COMPUTING RESOURCES OF A CPU unsigned long v1[N], v2[N]; unsigned long a = 0; for (size_t i = 0; i < N; ++i) { a += v1[i]*v2[i]; }Branchless Computing 8 COMPUTING RESOURCES OF A CPU unsigned long v1[N]

0 码力 | 61 页 | 9.08 MB | 6 月前
3
C++ Exceptions for Smaller Firmware

0x3FD4 obj[1/3] 0x3FD0 obj[2/3] 0x3FCC obj[3/3] 0x3FC8 bar()+14 0x3FC4 [bar] r4 Virtual ARM CPU Reg Value R0 XXXX R1 XXXX R2 XXXX R3 XXXX R4 ? R5 ? R6 – R7 – R8 – R9 – R10 – R11 – R12 0x3FD4 obj[1/3] 0x3FD0 obj[2/3] 0x3FCC obj[3/3] 0x3FC8 bar()+14 0x3FC4 [bar] r4 Virtual ARM CPU Reg Value R0 XXXX R1 XXXX R2 XXXX R3 XXXX R4 ? R5 ? R6 – R7 – R8 – R9 – R10 – R11 – R12 exception runtime needs to remember during unwinding) Thrown Object Returned address 109struct cortex_m_cpu { register_t r0; // Remove? register_t r1; // Remove? register_t r2; // Remove? register_t

0 码力 | 237 页 | 6.74 MB | 6 月前
3
C++高性能并行编程与优化 - 课件 - 性能优化之无分支编程 Branchless Programming

高效？既然无分支更高效，我要怎样优化才能让我的程序变成无分支的呢？那就来看本期性能优化专题课吧！分支预测成败对性能的影响排序为什么对有分支的版本影响那么大为什么需要流水线 • 为了高效， CPU 的内部其实是一个流水线 (pipeline) 。流水线的目的是能把原本串行的一系列指令并行化。为了理解为什么需要流水线，我们先反过来，假设没有流水线，会有什么坏处。 • 例如，右边你今天早上的任务清单。 + 15 + 30 + 20 = 85 分钟嘛！可以，不过这是在你每次只做一件事的情况下，例如你烧开水时就站在旁边干瞪眼，什么也不做，其实完全可以在烧开水的同时洗脸刷牙呀！原始的 CPU 也是这样， ALU 在运算的时候指令解码单元就在旁边干瞪眼，要等 ALU 跑完写回寄存器来指令解码单元才开始继续工作，很低效。任务时间占用资源洗脸 5 分钟眼睛，嘴巴，手眼睛吃饭 30 分钟嘴巴，手拉粑粑 20 分钟屁股洗脸刷牙烧开水吃饭看比站拉粑粑 5 5 10 20 条件跳转指令 • 让不占用相同资源的任务同时进行，这也是 CPU 流水线的初衷。但理想是美好的，现实是骨感的，对于程序来说，指令不只是一个个简单的任务，有时候我们需要做判断，来决定要执行的具体任务，这就是分支，在汇编语言中体现为条件跳转指令。 •

0 码力 | 47 页 | 8.45 MB | 1 年前
3

共 158 条前往

页

分类

语言

格式

Bridging the Gap: Writing Portable Programs for CPU and GPU

C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程

Au Units

Taro: Task graph-based Asynchronous Programming Using C++ Coroutine

POCOAS in C++: A Portable Abstraction for Distributed Data Structures

How Meta Made Debugging Async Code Easier with Coroutines and Senders

C++高性能并行编程与优化 - 课件 - 07 深入浅出访存优化

Branchless Programming in C++

C++ Exceptions for Smaller Firmware

C++高性能并行编程与优化 - 课件 - 性能优化之无分支编程 Branchless Programming