parallel algorithm course 08

parallel algorithm
note
Author

Furyton

Published

May 11, 2022

openACC

GPU not sharable

openacc.org

openacc best programming


three levels

data

optimize

Unified virtual memory, auto, slow

remove the “managed” suboption to the -ta compiler

data clauses

allocate with copy

  • copyin {list}: cpu \(\to\) gpu

  • copyout {list}: gpu \(\to\) cpu

  • copy {list}

allocate without copy

  • create {list}

  • delete {list}

only copy

  • present {list}: only copy if present when copyin

pragma

#pragam acc data copyin (a[0:nelem]) copyout(b[s/4:3*s/4])

run functions in each devices, and do copyin, copyout in the functions

slow

we can copyin and copyout together


data


#pragma acc data
{
    #pragma acc parallel loop
    ...
    #pragma acc parallel loop
    ...
}

copyin and copyout in each loops are done in one clause


enter data: copyin, create

exit data: copyout, delete, finalize(destroy a variable not regarding its reference)

class M{
    M() {
        v = new double[N];
        #pragma acc enter data create(v)
    }
    ~M() {
        #pragma acc exit data delete(v)
        {
            free(v);
        }
    }
}

running with explicit data management

data racing update between cpu and gpu

do_somthing_on_device();

#pragma acc update host(a)

do_something_on_host();

#pragma acc update device(a)

reduce IO between cpu and GPU

optimize loops

optimize matvec loops

default iteration is 128 for GPU

three level

Gang, workers, vectors

one SMP \(\to\) 64 cores, 16 warps(1 warp = 32 threads), at most 2048 threads

  • set vector size as 32 k

worker = warp

#pragma acc parallel vector_length(32)
#pragma acc loop gang worker
for (int i = 0; i < n; i++)
    #pragma acc loop vector // vector loop
    for (int j = 0; j < i; j++)

vector loop: has to specify vector_length or num_workers

collapse clause

#pragma acc parallel loop collapse(2)
for (i)
    for (j)

equiv

#pragma acc parallel loop
for (ij)

the tile clause

operate on smaller blocks of the operation to exploit data locality


stride-1 memory access

continus access

a[0][i][j]