parallel algorithm course 08
openACC
GPU not sharable
openacc.org
openacc best programming
three levels
data
optimize
Unified virtual memory, auto, slow
remove the “managed” suboption to the -ta compiler
data clauses
allocate with copy
copyin {list}: cpu \(\to\) gpu
copyout {list}: gpu \(\to\) cpu
copy {list}
allocate without copy
create {list}
delete {list}
only copy
- present {list}: only copy if present when copyin
pragma
#pragam acc data copyin (a[0:nelem]) copyout(b[s/4:3*s/4])
run functions in each devices, and do copyin, copyout in the functions
slow
we can copyin and copyout together
data
#pragma acc data
{
#pragma acc parallel loop
...
#pragma acc parallel loop
...
}
copyin and copyout in each loops are done in one clause
enter data
: copyin, create
exit data
: copyout, delete, finalize(destroy a variable not regarding its reference)
class M{
() {
M= new double[N];
v #pragma acc enter data create(v)
}
~M() {
#pragma acc exit data delete(v)
{
(v);
free}
}
}
running with explicit data management
data racing update between cpu and gpu
();
do_somthing_on_device
#pragma acc update host(a)
();
do_something_on_host
#pragma acc update device(a)
reduce IO between cpu and GPU
optimize loops
optimize matvec loops
default iteration is 128 for GPU
three level
Gang, workers, vectors
one SMP \(\to\) 64 cores, 16 warps(1 warp = 32 threads), at most 2048 threads
- set vector size as 32 k
worker = warp
#pragma acc parallel vector_length(32)
#pragma acc loop gang worker
for (int i = 0; i < n; i++)
#pragma acc loop vector // vector loop
for (int j = 0; j < i; j++)
vector loop: has to specify vector_length
or num_workers
collapse clause
#pragma acc parallel loop collapse(2)
for (i)
for (j)
equiv
#pragma acc parallel loop
for (ij)
the tile clause
operate on smaller blocks of the operation to exploit data locality
stride-1 memory access
continus access
a[0][i][j]