parallel algorithm course 07
tensor core
math limited, memory limited
GPU
L1 caches are independent, L2 caches are shared
expected: \(T_\text{math}\ge T_\text{mem}\)
\[ \frac{\#\text{opt}}{\text{math bandwidth}}\ge \frac{\#\text{bytes}}{\text{mem bandwidth}} \]
Arithmetic intensity \(\frac{\#\text{opt}}{\#\text{bytes}}\)
math limited when arithmetic intensity \(\ge\frac{\text{math band}}{\text{mem band}}\)
key ideas
split, independent ++
user cache
Tensor Cores: accelerate dot-product, matrix multiplies, size should better be a mulitple of 128 bits
how to do split
key: tile
multiplication, read a tile at a time to L1 cache
quantization: size should matches the hardware
running time is stagewise increasing as size increases
small batch \(\to\) low usage of GPU
post-order reversal
DFS, post-order
how to parallelize
hint: list-ranking
tree \(\to\) list, Eular Tour, add points
initialization, upward point -1, downward point +1
list scan
list \(\to\) tree
adjacency list