parallel algorithm course 10

parallel algorithm

note

Author

Furyton

Published

June 8, 2022

final exam

review

DAG
work-span analysis
- ideal: \(T_1(n)=T_s(n)\), \(T_\infty(n)=\Theta(\log^k(n))\)
only consider shared data memory model

algorithm：

associative

reduce:n -> 1
scan: n -> n
- compact: n -> m
list ranking: n -> n, input becomes linked list, GRAPH, BFS, DFS, find independent
sorting, sample sort

goal:

optimal complexity
independent ++, sync –, don’t use sync too much, has cost, e.g. print message during parallel
e.g. histgram, lock in each bucket
symetrics –, repeat occurance, break symetric, use randomize

thinking:

pattern

openacc

profile driven programming

incremental programming

CPU -> GPU -> Unified memory -> data parallel -> loop -> blocking

blocking:

Analyse

Parallel

Optimize

data movement, manual management
large matrix, sometimes we don’t need load all the elements
loop mapping: tell compile how loop maps the level of parallel, e.g. vetor length
vector 32*n -> n warps
blocking, 流水线并行
- input compute output
- compute -> futher split -> multi device
tile <- GPU 2-level cache usage
collapse
memory access pattern, keep continuous access

GPU hardware