parallel algorithm course 07

parallel algorithm
note
Author

Furyton

Published

April 27, 2022

tensor core

math limited, memory limited


GPU

L1 caches are independent, L2 caches are shared

expected: \(T_\text{math}\ge T_\text{mem}\)

\[ \frac{\#\text{opt}}{\text{math bandwidth}}\ge \frac{\#\text{bytes}}{\text{mem bandwidth}} \]

Arithmetic intensity \(\frac{\#\text{opt}}{\#\text{bytes}}\)

math limited when arithmetic intensity \(\ge\frac{\text{math band}}{\text{mem band}}\)

key ideas

split, independent ++

user cache

Tensor Cores: accelerate dot-product, matrix multiplies, size should better be a mulitple of 128 bits

how to do split

key: tile

multiplication, read a tile at a time to L1 cache

quantization: size should matches the hardware

running time is stagewise increasing as size increases

small batch \(\to\) low usage of GPU

post-order reversal

DFS, post-order

how to parallelize

hint: list-ranking

  1. tree \(\to\) list, Eular Tour, add points

  2. initialization, upward point -1, downward point +1

  3. list scan

  4. list \(\to\) tree

adjacency list