parallel algorithm course 04
synchronization has costs
simple list traversal
// linked list
p = head;
while (p) {
    process(p);
    p = p->next;
}compute all p in advance
#pragma omp parallel for schedule(static, 1)
    for (int i = 0; i < count; ++i)
        process(parr[i]);task
#pragma omp parallel {
    // not often used
    #pragma omp task
    foo();
    #pragma omp barrier
    // more common case
    #pragma omp single {
        #pragma omp task
        bar();
    }
}example (with bug)
int fib(int n) {
    int x, y; // private
    if (n < 2) return n;
    #pragma omp task 
    // need `shared(x)`. by default, task copy the whole stack, 
    // x is in different address
    x = fib(n-1);
    #pragma omp task // need `shared(y)`, default is private in task
    y = fib(n-2);
    #pragma omp taskwait
    return x+y; // wrong here!!!
}
#pragma omp parallel single
fib(n);two important points
barrier
data env
example 2 (with bug)
list ml;
Element *e;
#pragma omp parallel
#pragma omp single
{
    for (e=ml->first; e; e=e->next)
    #pragma omp task // need firstprivate(e)
        process(e); // has data racing
}return to first example
#pragma omp parallel
{
    #pragma omp single
    {
        node* p = head;
        while(p) {
            #pragma omp task firstprivate(p)
                process(p);
            p = p->next;
        }
    }
}
merge sort
directly using parallel
\(T_1=2T_1(\frac{n}{2})+O(n)=O(n\cdot log(n)\)
\(T_\infty=T_\infty(\frac{n}{2})+O(n)=O(n)\)
use a differernt merge method
parallel merge
say here we have two sorted seqs (\(\le\))
\(A=a_1,\dots,a_n\) and \(B=b_1,\dots,b_n\)
let \(t=a_{\lceil \frac{n}{2}\rceil}\), split A and B into sub arr \(A_1,A_2,B_1,B_2\) s.t. \(A_1\le t, B_1\le t\) and \(A_2\gt t, B_2\gt t\)
so seq \(A_1,B_1,A_2,B_2\) is semi-merged by t, then go on with \(merge(A_1,B_1)\) and \(merge(A_2,B_2)\)
random number generating problem
problem
example: approximate \(\pi\) using Monto Carlo
seed(SEED);
#pragma omp parallel for private(x,y) reduction (+:n_in_circle)
for(int i = 0; i < num_trials; i++) {
    x = random();
    y = random();
    if (x*x + y*y <= r*r) n_in_circle ++;
}
pi = 4.0 * (1.0 * n_in_circle / num_trials);Linear Congruential Generator(LCG) (has data racing problem)
int random_last = 0;
// need 
// #pragma omp threadprivate(random_last)
int random() {
    random_next = (MULTIPLIER * random_last + ADDEND) % PMOD;
    random_last = random_next;
    return random_next;
}parameter: (one possible parameter suite) MULTIPLIER=1366, ADDEND=150889, PMOD=714025
has a period (less than PMOD)
single thread and multi-thread has different accuracy (multi-thread has poor performance)
data racing on var random_last
solution 1
add #pragma omp threadprivate(random_last) under int random_last=0
has a local copy of random_last when creating a thread
similar to firstprivate
problem: still poor than single thread version
different random range in threads maybe overlap
solution 2
Leap Frog method
generate different seeds for each thread
situation for ADDEND=0
#pragma omp single
{
    nthreads = ...
    iseed = PMOD / MULTIPLIER;
    pseed[0] = iseed;
    mult_n = MULTIPLIER;
    for (int i = 0; i < nthreads; i++) {
        iseed = (unsigned long long) ((MULTIPLIER * iseed) % PMOD);
        pseed[i] = iseed;
        mult_n = (mult_n * MULTIPLIER) % PMOD;
    }
    id = ...
    random_last = (unsigned long long) pseed[id];
}