parallel algorithm course 07

parallel algorithm

note

Author

Furyton

Published

April 27, 2022

tensor core

math limited, memory limited

GPU

L1 caches are independent, L2 caches are shared

expected: \(T_\text{math}\ge T_\text{mem}\)

\[ \frac{\#\text{opt}}{\text{math bandwidth}}\ge \frac{\#\text{bytes}}{\text{mem bandwidth}} \]

Arithmetic intensity \(\frac{\#\text{opt}}{\#\text{bytes}}\)

math limited when arithmetic intensity \(\ge\frac{\text{math band}}{\text{mem band}}\)

split, independent ++

user cache

Tensor Cores: accelerate dot-product, matrix multiplies, size should better be a mulitple of 128 bits

multiplication, read a tile at a time to L1 cache

quantization: size should matches the hardware

running time is stagewise increasing as size increases

small batch \(\to\) low usage of GPU

DFS, post-order

how to parallelize

hint: list-ranking

adjacency list