Functional 0 码力 |
116 页 |
15.85 MB
| 2 年前 3 }
}
## I ntel Intrinsics Guide
• _mm 系列指令出自 头文件。
· 指令的文档可以看这个网站:
• https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
- 里面有详细说明每个指令对应的汇编,方便理解的伪代码,延迟和花费的时钟周期等。
"std: " << is_aligned << std::endl;
}
}
## tbbmalloc 的主要好处:能保证 64 字节对齐
• tbb::cache_aligned_ allocator 的最大好处在于他分配的内存地址,永远会对齐到缓存行(64 字节),对 SIMD 而言可以用 _mm_load_ps 而不是 _mm_loadu_ps 5; i++) {
std::vectortbb::cache_aligned_allocator> arr(n);
bool is_aligned = (uintptr_t)arr.data() % 64 == 0;
std::cout << "tbb: " << is_aligned << 0 码力 |
147 页 |
18.88 MB
| 2 年前 3 spall https://handmade.network/p/333/spall/
geiger https://github.com/david-grs/geiger
Intel IACA https://www.intel.com/content/www/us/en/developer/articles/tool/architecture-code-analyzer.html
## The Obligatory • browsable log!
• filterable log!
• no thread sync needed!
☑ tbb worker 56
☑ tbb worker 55
☑ tbb worker 54
☑ tbb worker 53
☑ tbb worker 52
Filter messages Clear
Time
3s 909,796,918ns
3s 913 No FP16 (HALF-PREC
TBB engage
reading runtime_texture_binary_header_v4015
reading runtime_texture_binary_header_v4015
reading runtime_texture_binary_header_v4015
TBB engage
TBB engage
TDD
## Plots 0 码力 |
84 页 |
8.70 MB
| 1 年前 3 https://github.com/david-grs/geiger
Xpedite https://github.com/morganstanley/Xpedite
Intel IACA https://www.intel.com/content/www/us/en/developer/articles/tool/architecture-code-analyzer.html
## The Obligatory log!
Filter messages Clear
• filterable log!
• no thread sync needed!
☑ tbb worker 55
☑ tbb worker 54
☑ tbb worker 53
☑ tbb worker 52
eclair_vk_framebuffer: Creating persistent pixels, la
eclair_vk_framebuffer: No FP16 (HALF-PREC
TBB engage
reading runtime_texture_binary_header_v4015
reading runtime_texture_binary_header_v4015
reading runtime_texture_binary_header_v4015
TBB engage
TBB engage
TDD
## Plots 0 码力 |
85 页 |
6.51 MB
| 1 年前 3 PRESENTATION AGENDA
☑ TVM @ AliOS Overview
TVM @ AliOS ARM CPU
TVM @ AliOS Hexagon DSP
TVM @ AliOS Intel GPU
☑ Misc
## PART ONE TVM @ AliOS Overview
## AliOS Overview
• AliOS (www.alios.cn) is a newly r0 = #0;
jumpr r31
}
## PART FOUR AliOS TVM @ Intel GPU
## AliOS TVM @ Intel GPU
• Implement the schedule from scratch
• Leverage Intel Subgroup Extension
## Subgroups

## AliOS TVM @ Intel GPU
GEMM Hardware Efficiency @ Intel Apollo Lake GPU
,比如 TBB 这个包,就包含了 tbb, tbbmalloc, tbbmalloc_proxy 这三个组件。
- 因此为避免冲突,每个包都享有一个独立的名字空间,以 :: 的分割(和 C++ 还挺像的)。
• 你可以指定要用哪几个组件:
• find package(TBB REQUIRED COMPONENTS tbb tbbmalloc REQUIRED) REQUIRED)
• target link libraries(myexec PUBLIC TBB::tbb TBB::tbbmalloc)
## 第三方库 - 常用 package 列表
1. fmt::fmt
2. spdlog::spdlog
3. range-v3::range-v3
4. TBB::tbb
5. OpenVDB::openvdb
6. Boost::iostreams
7 0 码力 |
32 页 |
11.40 MB
| 2 年前 3 8ea9d54e/p14_4.jpg)
## Benchmarks:
## Test Environment
• Ubuntu 24.04
GCC 13.2.0
- 13th Gen Intel(R) Core(TM) i9-13900HX
- Hyperthreading disabled
- All 16 “efficient” cores isolated
## Three Three established MPMC queues were benchmarked along with Work Contracts:
Boost “lock free”
TBB "concurrent_queue"
MoodyCamel "ConcurrentQueue"
• Work Contracts
## Caution: Slideware bench(std::forward(task), numThreads, numTasks);
bench<tbb_queue>(std::forward(task), numThreads, numTasks);
bench(std:: 0 码力 |
142 页 |
2.80 MB
| 1 年前 3
|