-
## How and When You Should Measure CPU Overhead of eBPF Programs
eBPF Summit
## Why should I profile eBPF programs?
## CI variance tracking
●●●●
name
TCPLatency/eBPF/kprobe/sys_bind
TCPLatency/eB Benchmarking + CI/CD
– Sampling profiler in production
## How does it work?
## - Adds ~20ns of overhead per run
☐ ☐ ☐
// pseudo-code
if (bpf_stats_enabled) {
u64 start = sched_clock();
run_ebpf_program();
0 码力 |
20 页 |
2.04 MB
| 1 年前 3
-
## +24
## Hidden Overhead of a Function API
## OLEKSANDR BACHERIKOV
## What we do at Snap with C++

Neural style transfer modern CPU instruction cache. As a result, the hardware spends a considerable amount of processing time — nearly 30 percent, in many cases — getting an instruction stream from memory to the CPU.”
## Disclaimer: rbx
| ldp x29, x30, [sp], #32 | | ret 0 |
## Negative-overhead abstraction!
## C++ Core Guidelines
F.20: For “out” output values, prefer return values to output 0 码力 |
158 页 |
2.46 MB
| 1 年前 3
-
## GCN
## 如何用Go模拟CPU

蒙卓
华为-2012实验室
工程师
## 成为盘古?
让这个世界里面的人(程序)无法察觉
这个世界是创造出来的
## 目录
• 计算机的演化历史 - 硬件计算到冯诺伊曼架构
• 构建虚拟世界
• 6502汇编器与链接器
• 未来目标
1970年程序员
CPU 80KHz 单核
内存 64KB 手编磁芯

老娘把你送上月球
2021年程序员
CPU 2,400,000KHz 4核
内存 8,000,000KB DDR3 为啥现在程序员好像更弱了?
· 因为我们处在最好也是最坏的时代
• 抽象多且环环嵌套
• 硬件过于复杂
• 软件基于操作系统等复杂概念
· 真的快且便宜
## Go模拟CPU
• 如何用Go实现冯诺伊曼架构CPU?
• 简单:一个循环+一个大数组
读取当前指令执行指令下一条指令
## 模拟目标 - MOS 6502
• 诞生于1975年
• MOS 6502应用范围广
· 资料多且易获得
0 码力 |
42 页 |
7.10 MB
| 2 年前 3
-
Programs for CPU and GPU
## THOMAS MEJSTRIK
## DIMETOR

FWF
## Bridging the Gap: Writing Portable Programs for CPU and GPU SYCL, ROCm, Vulkan, ...
☐ You can tell me about afterwards
## Why write programs for CPU and GPU
## ☐ Difference CPU/GPU Algorithms are designed differently
☐ Latency/Throughput
☐ Memory bandwidth
☐ radar” - Problem
☐ Why it makes sense?
☐ Scope of the talk
## Why write programs for CPU and GPU
## ☐ Difference CPU/GPU
☐ Why it makes sense?
Library/Framework developers
☐ Embarrassingly parallel
0 码力 |
124 页 |
4.10 MB
| 1 年前 3
-
## Designing an ultra low-overhead multithreading runtime for Nim
Mamy Ratsimbazafy
mamy@numforge.co
## Hello!
## I am Mamy Ratsimbazafy
During the day blockchain/Ethereum 2 developer (in Nim)
During Sources of overhead and runtime design
Minimum viable runtime plan in a weekend
## Understanding the design space
Concurrency vs parallelism, latency vs throughput
Cooperative vs preemptive, IO vs CPU
## Parallelism - Atomics
Transactional memory
- Message-passing
## I O-tasks vs CPU-tasks
## I O-tasks:
Latency optimized
- async/await
## CPU-tasks:
Throughput optimized
- spawn/sync
Doing both in the same
0 码力 |
37 页 |
556.64 KB
| 1 年前 3
-
## +23
## I s std::mdspan a Zero-overhead Abstraction?
## OLEKSANDR BACHERIKOV
## I s std::mdspan a Zero-overhead Abstraction?
Oleksandr Bacherikov
Snap Inc
## What is std::mdspan?
It's a view
Wrong!
std::layout_stride supports only all strides specified at runtime.
If we target zero overhead, we have to specify one of the strides as 1 at compile time.
What does the Standard offer us instead
0 码力 |
75 页 |
1.04 MB
| 1 年前 3
-
cd2064a1322/p12_1.jpg)
2.7 (Old) Startup
CPU Usage

2.8 (New) Startup
CPU Usage
## Startup Breakdown
Enumerate asset 60110a4e5decd2064a1322/p17_1.jpg)
## High
CPU Time
Single threaded code
Inefficient algorithms
Branch misprediction, cache misses
Spin locks
## High
CPU Time
Single threaded code
Inefficient algorithms rouping:
Function / Call Stack | | Function / Call Stack | CPU Time | Wait Time by Utilization ▼ | Wait Count | Module | | | 0 码力 |
76 页 |
2.22 MB
| 1 年前 3
-
Accelerate Istio-CNI with ebpf
Xu Yizhou & Guo Ruijing
## Agenda
• Istio-CNI
• tcp/ip stack overhead between sidecar and service
• Background knowledge of ebpf
• Acceleration for Inbound/Outbound/Envoy [Image](/uploads/documents/5/a/b/b/5abb1b8f1b8f9d74adba9f84c56cea7a/p3_1.jpg)
## Tcp/ip stack overhead between sidecar and service
Overhead sidecar traffic from 3 scopes
• Inbound
• Outbound
• Envoy to Envoy(same host)
0 码力 |
15 页 |
658.90 KB
| 1 年前 3
-
technical details and surprising conclusions that virtual functions can be actually faster. Since CPU architectures are mentioned, I'd expect to see deep assembly profiling.
## Ok, some assembly is But I have another computer
## Different CPUs
## Laptop:
Model name: Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
Thread(s) per core: 2
Core(s) per socket: 4
Stepping: 12
## Desktop:
Thread(s) per core: /9/2/1092c89fc888067fdbc59ca7369237f9/p14_1.jpg)
## Conclusions
## Relevant factors
• CPU manufacturer
• CPU version
• Precise code path
• Temperature(?)
• OS interrupts(?)
- Compiler optimization
0 码力 |
20 页 |
1.19 MB
| 1 年前 3
-
TVM@AliOS
## PRESENTATION AGENDA
☑ TVM @ AliOS Overview
TVM @ AliOS ARM CPU
TVM @ AliOS Hexagon DSP
TVM @ AliOS Intel GPU
☑ Misc
## PART ONE TVM @ AliOS Overview
## AliOS Overview
• AliOS (www.alios 驱动万物智能
## PART TWO AliOS TVM @ ARM CPU
## AliOS TVM@ARM CPU
• Support TFLite (Open Source and Upstream Master)
• Optimize on INT8 & FP32
## AliOS TVM @ ARM CPU INT8
Convolution
• NHWC layout
• AliOS TVM @ ARM CPU INT8
TVM / QNNPACK Speed Up @ Mobilenet V2 @ rasp 3b+ AARCH64

## AliOS TVM @ ARM CPU INT8
Depthwise
0 码力 |
27 页 |
4.86 MB
| 1 年前 3