1 An Overview of OpenMP

(1)

1 An Overview of OpenMP

IWOMP 2011 Chicago, IL, USA

June 13-15, 2011

Ruud van der Pas

Senior Staff Engineer SPARC Microelectronics

Oracle

Santa Clara, CA, USA

(2)

2 Outline

• Getting Started with OpenMP

• Using OpenMP

• What's New in OpenMP 3.1

(3)

<Insert Picture Here>

Getting Started With

OpenMP

(4)

4 0 L

1 L

P L

Shared Memory

(5)

5 http://www.openmp.org

http://www.compunity.org

(6)

6 http://www.openmp.org

(7)

7 Shameless Plug - “Using OpenMP”

“Using OpenMP”

Portable Shared Memory Parallel Programming

Chapman, Jost, van der Pas MIT Press, 2008

ISBN-10: 0-262-53302-2

ISBN-13: 978-0-262-53302-7

List price: 35 $US

(8)

8 All 41 examples are available NOW!

As well as a forum on http://www.openmp.org

Download the examples and discuss in forum:

http://www.openmp.org/wp/2009/04/

download-book-examples-and-discuss

(9)

9 What is OpenMP?

❑ De-facto standard Application Programming Interface (API) to write shared memory parallel applications in C, C++, and Fortran

❑ Consists of Compiler Directives, Run time routines and Environment variables

❑ Specification maintained by the OpenMP

Architecture Review Board (http://www.openmp.org)

❑ Version 3.0 has been released May 2008

● The upcoming 3.1 release will be released soon

(10)

10 OpenMP is widely supported by industry, as well as the

academic community

(11)

11 When to consider OpenMP?

❑ Using an automatically parallelizing compiler:

● It can not find the parallelism

✔ The data dependence analysis is not able to determine whether it is safe to parallelize ( or not)

● The granularity is not high enough

✔ The compiler lacks information to parallelize at the highest possible level

❑ Not using an automatically parallelizing compiler:

● No choice, other than doing it yourself

(12)

12 Advantages of OpenMP

❑ Good performance and scalability

● If you do it right ....

❑ De-facto and mature standard

❑ An OpenMP program is portable

● Supported by a large number of compilers

❑ Requires little programming effort

❑ Allows the program to be parallelized incrementally

(13)

13 OpenMP and Multicore

OpenMP is ideally suited for multicore architectures

Memory and threading model map naturally Lightweight

Mature

Widely available and used

(14)

14 The OpenMP Memory Model

T

private

T

private

T

private

T

private

T

private

Shared Memory

✔ All threads have access to the same, globally shared, memory

✔ Data can be shared or private

✔ Shared data is accessible by all threads

✔ Private data can only be accessed by the thread that owns it

✔ Data transfer is transparent to the programmer

✔ Synchronization takes place,

but it is mostly implicit

(15)

15

Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011 An Overview of OpenMP

Data-sharing Attributes

• In an OpenMP program, data needs to be “labeled”

• Essentially there are two basic types:

– Shared - There is only one instance of the data

• Threads can read and write the data simultaneously unless protected through a specific construct

• All changes made are visible to all threads

– But not necessarily immediately, unless enforced ...

– Private - Each thread has a copy of the data

• No other thread can access this data

• Changes only visible to the thread owning the data

(16)

16 Private and shared clauses

✔ No storage association with original object

✔ All references are to the local object

✔ Values are undefined on entry and exit

✔ Data is accessible by all threads in the team

✔ All threads access the same address space

private (list)

shared (list)

(17)

17 About storage association

• Private variables are undefined on entry and exit of the parallel region

• A private variable within a parallel region has no storage association with the same variable outside of the region

• Use the firstprivate and lastprivate clauses to override this behavior

• We illustrate these concepts with an example

(18)

18 The firstprivate and lastprivate clauses

fi rstprivate (list)

✔ All variables in the list are initialized with the value the original object had before entering the parallel construct

✔ The thread that executes the sequentially last iteration or section updates the value of the objects in the list

lastprivate (list)

(19)

19 Example firstprivate

n = 2; indx = 4;

#pragma omp parallel default(none) private(i,TID) \ firstprivate(indx) shared(n,a)

{ TID = omp_get_thread_num();

**indx = indx + n*TID;**

for(i=indx; i<indx+n; i++) a[i] = TID + 1;

**} /-- End of parallel region --/**

0 1 2 3 4 5 6 7 8 9

1 1 2 2 3 3

index value

TID = 0 TID = 1 TID = 2

(20)

20 Example lastprivate

#pragma omp parallel for default(none) lastprivate(a) for (int i=0; i<n; i++)

{ ...

a = i + 1;

...

} // End of parallel region

**b = 2 * a; // value of b is 2*n**

(21)

21 The default clause

default (none | shared | private | threadprivate )

✔ No implicit defaults; have to scope all variables explicitly

none

✔ All variables are shared

✔ The default in absence of an explicit "default" clause

✔ All variables are private to the thread

✔ Includes common block data, unless THREADPRIVATE

Fortran

shared

private

✔ All variables are private to the thread; pre-initialized

fi rstprivate

C/C++

default ( none | shared )

(22)

22 The OpenMP Execution Model

Fork and Join Model

Master Thread

Worker Threads Parallel

region

Synchronization Parallel

region

Worker Threads

Synchronization

(23)

23 Defining Parallelism in OpenMP

❑ OpenMP Team := Master + Workers

❑ A Parallel Region is a block of code executed by all threads simultaneously

☞ The master thread always has thread ID 0

☞ Thread adjustment (if enabled) is only done before entering a parallel region

☞ Parallel regions can be nested, but support for this is implementation dependent

☞ An "if" clause can be used to guard the parallel region; in

case the condition evaluates to "false", the code is executed

serially

(24)

24 The Parallel Region

!$omp parallel [clause[[,] clause] ...]

"this code is executed in parallel"

!$omp end parallel (implied barrier)

#pragma omp parallel [clause[[,] clause] ...]

{

"this code is executed in parallel"

} // End of parall section (note: implied barrier)

A parallel region is a block of code executed by all

threads in the team

(25)

25 Parallel Region - An Example/1

#include <stdlib.h>

#include <stdio.h>

**int main(int argc, char *argv[]) {**

printf("Hello World\n");

return(0);

}

(26)

26 Parallel Region - An Example/1

#include <stdlib.h>

#include <stdio.h>

**int main(int argc, char *argv[]) { #pragma omp parallel**

{

printf("Hello World\n");

} // End of parallel region return(0);

}

(27)

27 Parallel Region - An Example/2

$ cc -xopenmp -fast hello.c

$ export OMP_NUM_THREADS=2

$ ./a.out Hello World Hello World

$ export OMP_NUM_THREADS=4

$ ./a.out Hello World Hello World Hello World Hello World

$

(28)

28 The if clause

✔ Only execute in parallel if expression evaluates to true

✔ Otherwise, execute serially

if (scalar expression)

#pragma omp parallel if (n > some_threshold) \ shared(n,x,y) private(i)

{

#pragma omp for

for (i=0; i<n; i++) x[i] += y[i];

**} /-- End of parallel region --/**

(29)

29 Nested Parallelism

Master Thread

Outer parallel region Nested parallel

region

Note: nesting level can be arbitrarily deep

3-way parallel 9-way parallel 3-way

parallel Outer parallel

region

(30)

30 Nested Parallelism Support/1

❑ Environment variable and runtime routines to set/get the maximum number of nested active parallel

regions

OMP_MAX_ACTIVE_LEVELS

omp_set_max_active_levels() omp_get_max_active_levels()

❑ Environment variable and runtime routine to set/get the maximum number of OpenMP threads available to the program

OMP_THREAD_LIMIT

omp_get_thread_limit()

(31)

31 Nested Parallelism Support/2

❑ Per-task internal control variables

● Allow, for example, calling

omp_set_num_threads() inside a parallel

region to control the team size for next level of parallelism

❑ Library routines to determine

● Depth of nesting

omp_get_level()

omp_get_active_level()

● IDs of parent/grandparent etc. threads

omp_get_ancestor_thread_num(level)

● Team sizes of parent/grandparent etc. teams

omp_get_team_size(level)

(32)

32 A More Elaborate Example

scale = sum(a,0,n) + sum(z,0,n) + f;

....

Statement is executed by all threads

f = 1.0;

Statement is executed

by all threads

#pragma omp parallel if (n>limit) default(none) \ shared(n,a,b,c,x,y,z) private(f,i,scale) {

**} /-- End of parallel region --/**

p a ra lle l r e g io n

#pragma omp barrier synchronization

....

for (i=0; i<n; i++) z[i] = x[i] + y[i];

#pragma omp for nowait

parallel loop

(work is distributed)

for (i=0; i<n; i++) a[i] = b[i] + c[i];

#pragma omp for nowait

parallel loop

(work is distributed)

(33)

<Insert Picture Here>

Using OpenMP

(34)

34 Using OpenMP

• We have just seen a glimpse of OpenMP

• To be practically useful, much more functionality is needed

• Covered in this section:

– Many of the language constructs

– Features that may be useful or needed when running an OpenMP application

• Note that the tasking concept is covered in a separate

section

(35)

35 Components of OpenMP

Directives Runtime

environment

Environment variables



Number of threads



Scheduling type



Dynamic thread adjustment



Nested parallelism



Stacksize



Idle threads



Active levels



Thread limit



Number of threads



Scheduling type



Dynamic thread adjustment



Nested parallelism



Stacksize



Idle threads



Active levels



Thread limit



Parallel region



Worksharing constructs



Tasking



Synchronization



Data-sharing attributes



Parallel region



Worksharing constructs



Tasking



Synchronization



Data-sharing attributes



Number of threads



Thread ID



Dynamic thread adjustment



Nested parallelism



Schedule



Active levels



Thread limit



Nesting level



Ancestor thread



Team size



Wallclock timer



Locking



Number of threads



Thread ID



Dynamic thread adjustment



Nested parallelism



Schedule



Active levels



Thread limit



Nesting level



Ancestor thread



Team size



Wallclock timer



Locking

(36)

36 Directive format

❑ Fortran: directives are case insensitive

● Syntax: sentinel directive [clause [[,] clause]...]

● The sentinel is one of the following:

✔ **!$OMP or C$OMP or *$OMP (fixed format)**

✔ !$OMP (free format)

❑ Continuation: follows the language syntax

❑ Conditional compilation: !$ or C$ -> 2 spaces

❑ C: directives are case sensitive

● Syntax: #pragma omp directive [clause [clause] ...]

❑ Continuation: use \ in pragma

❑ Conditional compilation: _OPENMP macro is set

(37)

37 sum = 0.0

!$omp parallel default(none) &

!$omp shared(n,x) private(i)

!$omp do reduction (+:sum) do i = 1, n

sum = sum + x(i) end do

!$omp end do

!$omp end parallel **print *,sum**

The reduction clause - Example

Variable SUM is a shared variable

☞ Care needs to be taken when updating shared variable SUM

☞ With the reduction clause, the OpenMP compiler

generates code such that a race condition is avoided

(38)

38 The reduction clause

✔ Reduction variable(s) must be shared variables

✔ A reduction is defined as:

C/C++

x = x operator expr x = expr operator x

x = intrinsic (x, expr_list) x = intrinsic (expr_list, x)

x = x operator expr x = expr operator x x++, ++x, x--, --x x <binop> = expr

Fortran C/C++

✔ Note that the value of a reduction variable is undefined from the moment the first thread reaches the clause till the operation has completed

✔ The reduction can be hidden in a function call

Check the docs for details

reduction ( operator: list )

reduction ( [operator | intrinsic] ) : list ) Fortran

(39)

39 Fortran - Allocatable Arrays

• Fortran allocatable arrays whose status is

“currently allocated” are allowed to be specified as

private, lastprivate, firstprivate, reduction, or copyprivate

integer, allocatable,dimension (:) :: A integer i

allocate (A(n))

!$omp parallel private (A) do i = 1, n

A(i) = i end do

...

!$omp end parallel

(40)

40 Barrier/1

Suppose we run each of these two loops in parallel over i:

This may give us a wrong answer (one day)

Why ?

for (i=0; i < N; i++) a[i] = b[i] + c[i];

for (i=0; i < N; i++)

d[i] = a[i] + b[i];

(41)

41 Barrier/2

We need to have updated all of a[ ] fi rst, before using a[ ] *

All threads wait at the barrier point and only continue when all threads have reached the barrier point

wait ! barrier

*) If there is the guarantee that the mapping of iterations onto threads is identical for both loops, there will not be a data race in this case

for (i=0; i < N; i++) a[i] = b[i] + c[i];

for (i=0; i < N; i++)

d[i] = a[i] + b[i];

(42)

42 Barrier/3

time

Barrier Region

idle idle idle

!$omp barrier

#pragma omp barrier

Barrier syntax in OpenMP:

(43)

43 When to use barriers ?

❑ If data is updated asynchronously and data integrity is at risk

❑ Examples:

● Between parts in the code that read and write the same section of memory

● After one timestep/iteration in a solver

❑ Unfortunately, barriers tend to be expensive and also may not scale to a large number of processors

❑ Therefore, use them with care

(44)

44 The nowait clause

❑ To minimize synchronization, some directives support the optional nowait clause

● If present, threads do not synchronize/wait at the end of that particular construct

❑ In C, it is one of the clauses on the pragma

❑ In Fortran, it is appended at the closing part of the construct

!$omp do : :

!$omp end do nowait

#pragma omp for nowait {

:

}

(45)

45 The Worksharing Constructs

☞ The work is distributed over the threads

☞ Must be enclosed in a parallel region

☞ Must be encountered by all threads in the team, or none at

☞ all No implied barrier on entry; implied barrier on exit (unless nowait is specifi ed)

☞ A work-sharing construct does not launch any new threads

#pragma omp for {

....

}

!$OMP DO ....

!$OMP END DO

#pragma omp for {

....

}

!$OMP DO ....

!$OMP END DO

#pragma omp sections {

....

}

!$OMP SECTIONS ....

!$OMP END SECTIONS

#pragma omp sections {

....

}

!$OMP SECTIONS ....

!$OMP END SECTIONS

#pragma omp single {

....

}

!$OMP SINGLE ....

!$OMP END SINGLE

#pragma omp single {

....

}

!$OMP SINGLE ....

!$OMP END SINGLE

(46)

46 The Workshare construct

Fortran has a fourth worksharing construct:

!$OMP WORKSHARE

<array syntax>

!$OMP END WORKSHARE [NOWAIT]

!$OMP WORKSHARE

<array syntax>

!$OMP END WORKSHARE [NOWAIT]

Example:

!$OMP WORKSHARE

A(1:M) = A(1:M) + B(1:M)

!$OMP END WORKSHARE NOWAIT

!$OMP WORKSHARE

A(1:M) = A(1:M) + B(1:M)

!$OMP END WORKSHARE NOWAIT

(47)

47 The omp for/do directive

!$omp do [clauses]

do ...

<code-block>

end do

!$omp end do[nowait]

#pragma omp for [clauses]

for (...) {

<code-block>

}

The iterations of the loop are distributed over the threads

(48)

48 The omp for directive - Example

#pragma omp parallel default(none)\

shared(n,a,b,c,d) private(i) {

#pragma omp for nowait

**} /-- End of parallel region --/**

(implied barrier)

for (i=0; i<n; i++) d[i] = 1.0/c[i];

for (i=0; i<n-1; i++)

b[i] = (a[i] + a[i+1])/2;

(49)

49 C++: Random Access Iterator Loops

void iterator_example() {

std::vector vec(23);

std::vector::iterator it;

#pragma omp for default(none)shared(vec)

for (it = vec.begin(); it < vec.end(); it++) {

**// do work with *it //**

} }

Parallelization of random access iterator loops is supported

(50)

50 Loop Collapse

• Allows parallelization of perfectly nested loops without using nested parallelism

• The collapse clause on for/do loop indicates how many loops should be collapsed

• Compiler forms a single loop and then parallelizes it

!$omp parallel do collapse(2) ...

do i = il, iu, is

do j = jl, ju, js

do k = kl, ku, ks ...

end do end do

end do

!$omp end parallel do

(51)

51 The schedule clause/1

schedule ( static | dynamic | guided | auto [, chunk] ) schedule ( runtime )

✔ Distribute iterations in blocks of size "chunk" over the threads in a round-robin fashion

✔ In absence of "chunk", each thread executes approx. N/P chunks for a loop of length N and P threads

● Details are implementation defined

✔ Under certain conditions, the assignment of iterations to threads is the same across multiple loops in the same parallel region

static [, chunk]

(52)

52 The schedule clause/2

Thread 0 1 2 3

no chunk* 1-4 5-8 9-12 13-16 chunk = 2 1-2 3-4 5-6 7-8

9-10 11-12 13-14 15-16

Example static schedule

Loop of length 16, 4 threads:

*) The precise distribution is implementation defi ned

(53)

53 The schedule clause/3

✔ Fixed portions of work; size is controlled by the value of chunk

✔ When a thread finishes, it starts on the next portion of work

✔ Same dynamic behavior as "dynamic", but size of the portion of work decreases exponentially

✔ Iteration scheduling scheme is set at runtime through environment variable OMP_SCHEDULE

dynamic [, chunk]

guided [, chunk]

runtime

✔ The compiler (or runtime system) decides what is best to use;

choice could be implementation dependent

auto

(54)

54 Experiment - 500 iterations, 4 threads

0 50 100 150 200 250 300 350 400 450 500

3 2 1 0 3 2 1 0 3 2 1 0

static dynamic, 5

guided, 5

Iteration Number

T h re a d I D

(55)

55 Schedule Kinds Functions

❑ Makes schedule(runtime) more general

❑ Can set/get schedule it with library routines:

omp_set_schedule() omp_get_schedule()

❑ Also allows implementations to add their own

schedule kinds

(56)

56 Parallel sections

#pragma omp sections [clauses]

{

#pragma omp section {....}

....

}

Individual section blocks are executed

in parallel

!$omp sections [clauses]

!$ omp section {....}

....

!$omp end sections [nowait]

(57)

57 The Sections Directive - Example

#pragma omp parallel default(none)\

shared(n,a,b,c,d) private(i) { #pragma omp sections nowait

{

#pragma omp section

**} /-- End of sections --/**

**} /-- End of parallel region --/**

for (i=0; i<n; i++) d[i] = 1.0/c[i];

for (i=0; i<n-1; i++)

b[i] = (a[i] + a[i+1])/2;

(58)

58 Overlap I/O and Processing/1

Input Thread Output Thread

0 1 0

2 1 0

3 2 1

4 3 2

5 4 3

5 4

5 Processing Thread(s)

Ti m e

(59)

59 Overlap I/O and Processing/2

#pragma omp parallel sections {

#pragma omp section {

for (int i=0; i<N; i++) { (void) read_input(i);

(void) signal_read(i);

} }

#pragma omp section {

for (int i=0; i<N; i++) { (void) wait_read(i);

(void) process_data(i);

(void) signal_processed(i);

} }

#pragma omp section {

for (int i=0; i<N; i++) { (void) wait_processed(i);

(void) write_output(i);

} }

**} /-- End of parallel sections --/**

Processing Thread(s)

Input Thread

Output Thread

(60)

60 The Single Directive

!$omp single [private][firstprivate]

<code-block>

!$omp end single [copyprivate][nowait]

Only one thread in the team executes the code enclosed

#pragma omp single [private][firstprivate] \ [copyprivate][nowait]

{

<code-block>

}

(61)

61 #pragma omp parallel \ shared (A)

{

...

#pragma omp single nowait {"read A[0..N-1]";}

...

#pragma omp barrier “use A”

}

...

"read A[0..N-1]";

...

Single processor region/1

Only one thread executes the single region

This construct is ideally suited for I/O or

initializations

Original Code

Parallel Version

(62)

62 Single processor region/2

time

single processor region

Other threads

wait if there is

a barrier here

(63)

63 Combined work-sharing constructs

#pragma omp parallel

#pragma omp for for (...)

!$omp parallel do ...

!$omp end parallel do

#pragma omp parallel for for (....)

!$omp parallel

!$omp sections ...

!$omp end sections

!$omp end parallel

#pragma omp parallel

#pragma omp sections { ...}

!$omp parallel sections ...

!$omp end parallel sections

#pragma omp parallel sections

{ ... } Single PARALLEL sections

!$omp parallel

!$omp workshare ...

!$omp end workshare

!$omp end parallel

!$omp parallel workshare ...

!$omp end parallel workshare Single WORKSHARE loop

!$omp parallel

!$omp do ...

!$omp end do

!$omp end parallel

Single PARALLEL loop

(64)

64 Orphaning

♦ The OpenMP specifi cation does not restrict worksharing and synchronization directives (omp for, omp single,

critical, barrier, etc.) to be within the lexical extent of a parallel region. These directives can be orphaned

♦ That is, they can appear outside the lexical extent of a parallel region

:

#pragma omp parallel {

:

(void) dowork();

: }

:

void dowork() {

:

#pragma omp for

for (int i=0;i<n;i++) {

: }

orphaned work-sharing

directive

(65)

65 More on orphaning

♦ When an orphaned worksharing or synchronization directive is encountered in the sequential part of the program (outside the dynamic extent of any parallel region), it is executed by the master thread only. In effect, the directive will be ignored

(void) dowork(); !- Sequential FOR #pragma omp parallel

{

(void) dowork(); !- Parallel FOR }

void dowork() { #pragma omp for for (i=0;....) { :

} }

(66)

66 Example - Parallelizing Bulky Loops

**for (i=0; i<n; i++) /* Parallel loop */**

{

a = ...

b = ... a ..

c[i] = ....

...

for (j=0; j<m; j++) {

<a lot more code in this loop>

}

...

}

(67)

67 Step 1: “Outlining”

**for (i=0; i<n; i++) /* Parallel loop */**

{

(void) FuncPar(i,m,c,...) }

void FuncPar(i,m,c,....)

{ **float a, b; /* Private data */**

int j;

a = ...

b = ... a ..

c[i] = ....

...

for (j=0; j<m; j++) {

<a lot more code in this loop>

}

...

} Still a sequential program

Should behave identically

Easy to test for correctness

But, parallel by design

(68)

68 Step 2: Parallelize

**for (i=0; i<n; i++) /* Parallel loop */**

{

(void) FuncPar(i,m,c,...) **} /-- End of parallel for --/**

Minimal scoping required Less error prone

#pragma omp parallel for private(i) shared(m,c,..)

void FuncPar(i,m,c,....)

{ **float a, b; /* Private data */**

int j;

a = ...

b = ... a ..

c[i] = ....

...

for (j=0; j<m; j++) {

<a lot more code in this loop>

}

...

(69)

69 Additional Directives/1

!$omp master

<code-block>

!$omp end master

#pragma omp master {<code-block>}

!$omp atomic

#pragma omp atomic

!$omp critical [(name)]

<code-block>

!$omp end critical [(name)]

#pragma omp critical [(name)]

{<code-block>}

(70)

70 The Master Directive

!$omp master

<code-block>

!$omp end master

Only the master thread executes the code block:

#pragma omp master

{<code-block>} There is no implied barrier on entry or

exit !

(71)

71 Critical Region/1

If sum is a shared variable, this loop can not run in parallel by simply using a “#pragma omp for”

All threads execute the update, but only one at a time will

do so

#pragma omp parallel for for (i=0; i < n; i++){

...

#pragma omp critical {sum += a[i];}

...

}

for (i=0; i < n; i++){

...

sum += a[i];

...

}

(72)

72 Critical Region/2

❑ Useful to avoid a race condition, or to perform I/O (but that still has random order)

❑ Be aware that there is a cost associated with a critical region

time

critical region

(73)

73 Critical and Atomic constructs

!$omp critical [(name)]

<code-block>

!$omp end critical [(name)]

Critical: All threads execute the code, but only one at a time:

#pragma omp critical [(name)]

{<code-block>} There is no implied barrier on entry or

exit !

!$omp atomic

<statement>

#pragma omp atomic

<statement>

Atomic: only the loads and store are atomic ....

This is a lightweight, special

form of a critical section #pragma omp atomic

a[indx[i]] += b[i];

(74)

74 Additional Directives/2

!$omp ordered

<code-block>

!$omp end ordered

#pragma omp ordered {<code-block>}

!$omp flush [(list)]

#pragma omp flush [(list)]

(75)

75 Additional Directives/2

The enclosed block of code is executed in the order in which iterations would be executed sequentially:

May introduce serialization

(could be expensive)

!$omp ordered

<code-block>

!$omp end ordered

#pragma omp ordered {<code-block>}

Ensure that all threads in a team have a consistent view of certain objects in memory:

In the absence of a list, all visible variables are

fl ushed

!$omp flush [(list)]

#pragma omp flush

[(list)]

(76)

76 The flush directive

X = 0 while (X == 0) {

“wait”

} X = 1

Thread A Thread B

If shared variable X is kept within a register, the modification

may not be made visible to the other thread(s)

(77)

77 Implied Flush Regions/1

• During a barrier region

• At exit from worksharing regions, unless a nowait is present

• At entry to and exit from parallel, critical, ordered and parallel worksharing regions

• During omp_set_lock and omp_unset_lock regions

• During omp_test_lock, omp_set_nest_lock, omp_unset _nest_lock and omp_test_nest_lock regions, if the region causes the lock to be set or unset

• Immediately before and after every task scheduling point

(78)

78 Implied Flush Regions/2

• At entry to and exit from atomic regions, where the list contains only the variable updated in the atomic construct

• A flush region is not implied at the following locations:

– At entry to a worksharing region

– At entry to or exit from a master region

(79)

79 OpenMP and Global Data

(80)

80 program global_data ....

include "global.h"

....

!$omp parallel do private(j) do j = 1, n

call suba(j) end do

!$omp end parallel do ...

Global data - An example

subroutine suba(j) ...

include "global.h"

...

do i = 1, m b(i) = j end do

do i = 1, m a(i,j) = func_call(b(i)) end do

return end

Data Race !

common /work/a(m,n),b(m)

fi le global.h

(81)

81 Global data - A Data Race!

call suba(1)

Thread 1

call suba(2)

Thread 2

S h a re d

subroutine suba(j=1)

....

do i = 1, m

a(i,1)=func_call(b(i)) end do

do i = 1, m b(i) = 1 end do

subroutine suba(j=2)

....

do i = 1, m

a(i,2)=func_call(b(i)) end do

do i = 1, m

b(i) = 2

end do

(82)

82 integer, parameter::

nthreads=4

common /work/a(m,n)

common /tprivate/b(m,nthreads)

Example - Solution

fi le global_ok.h

☞

By expanding array B, we can give each thread unique access to it's storage area

☞

Note that this can also be done using dynamic memory

(allocatable, malloc, ....) program global_data ....

include "global_ok.h"

....

!$omp parallel do private(j) do j = 1, n

call suba(j) end do

!$omp end parallel do ...

subroutine suba(j) ...

include "global_ok.h"

...

TID = omp_get_thread_num()+1 do i = 1, m

b(i,TID) = j end do

do i = 1, m

a(i,j)=func_call(b(i,TID)) end do

return

end

(83)

83 About global data

• Global data is shared and requires special care

• A problem may arise in case multiple threads access the same memory section simultaneously:

– Read-only data is no problem

– Updates have to be checked for race conditions

• It is your responsibility to deal with this situation

• In general one can do the following:

– Split the global data into a part that is accessed in serial parts only and a part that is accessed in parallel

– Manually create thread private copies of the latter – Use the thread ID to access these private copies

• Alternative: Use OpenMP's threadprivate directive

(84)

84 The threadprivate directive

❑ Thread private copies of the designated global variables and common blocks are created

❑ Several restrictions and rules apply when doing this:

● The number of threads has to remain the same for all the parallel regions (i.e. no dynamic threads)

✔

Oracle implementation supports changing the number of threads

● Initial data is undefined, unless copyin is used

● ...

❑ Check the documentation when using threadprivate !

!$omp threadprivate (/cb/ [,/cb/] ...)

#pragma omp threadprivate (list)

(85)

85 common /work/a(m,n) common /tprivate/b(m)

!$omp

threadprivate(/tprivate/)

Example - Solution 2

fi le global_ok2.h

☞

The compiler creates thread private copies of array B, to give each

thread unique access to it's storage area

☞

Note that the number of copies is automatically adjusted to the

number of threads

program global_data ....

include "global_ok2.h"

....

!$omp parallel do private(j) do j = 1, n

call suba(j) end do

!$omp end parallel do ...

stop end

subroutine suba(j) ...

include "global_ok2.h"

...

do i = 1, m b(i) = j end do

do i = 1, m

a(i,j) = func_call(b(i)) end do

return

end

(86)

86 The copyin clause

copyin (list)

✔ Applies to THREADPRIVATE common blocks only

✔ At the start of the parallel region, data of the master thread is copied to the thread private copies

common /cblock/velocity

common /fields/xfield, yfield, zfield

! create thread private common blocks

!$omp threadprivate (/cblock/, /fields/)

!$omp parallel &

!$omp default (private) &

!$omp copyin ( /cblock/, zfield )

Example:

(87)

87 C++ and Threadprivate

❑ As of OpenMP 3.0, it has been clarified where/how threadprivate objects are constructed and destructed

❑ Allow C++ static class members to be threadprivate

class T { public:

static int i;

#pragma omp threadprivate(i) ...

};

(88)

88 OpenMP Runtime Routines

(89)

89 OpenMP Runtime Functions/1

Name Functionality

omp_set_num_threads Set number of threads

omp_get_num_threads Number of threads in team

omp_get_max_threads Max num of threads for parallel region omp_get_thread_num Get thread ID

omp_get_num_procs Maximum number of processors omp_in_parallel Check whether in parallel region

omp_set_dynamic Activate dynamic thread adjustment

(but implementation is free to ignore this)

omp_get_dynamic Check for dynamic thread adjustment omp_set_nested Activate nested parallelism

(but implementation is free to ignore this)

omp_get_nested Check for nested parallelism omp_get_wtime Returns wall clock time

omp_get_wtick Number of seconds between clock ticks

C/C++ : Need to include fi le <omp.h>

Fortran : Add “use omp_lib” or include fi le “omp_lib.h”

(90)

90 OpenMP Runtime Functions/2

Name Functionality

omp_set_schedule Set schedule (if “runtime” is used) omp_get_schedule Returns the schedule in use

omp_get_thread_limit Max number of threads for program omp_set_max_active_levels Set number of active parallel regions omp_get_max_active_levels Number of active parallel regions

omp_get_level Number of nested parallel regions omp_get_active_level Number of nested active par. regions omp_get_ancestor_thread_num Thread id of ancestor thread

omp_get_team_size (level) Size of the thread team at this level

C/C++ : Need to include fi le <omp.h>

Fortran : Add “use omp_lib” or include fi le “omp_lib.h”

(91)

91 OpenMP locking routines

❑ Locks provide greater flexibility over critical sections and atomic updates:

●

Possible to implement asynchronous behavior

●

Not block structured

❑ The so-called lock variable, is a special variable:

●

C/C++: type omp_lock_t and omp_nest_lock_t for nested locks

●

Fortran: type INTEGER and of a KIND large enough to hold an address

❑ Lock variables should be manipulated through the API only

❑ It is illegal, and behavior is undefined, in case a lock

variable is used without the appropriate initialization

(92)

92 Nested locking

❑ Simple locks: may not be locked if already in a locked state

❑ Nestable locks: may be locked multiple times by the same thread before being unlocked

❑ In the remainder, we discuss simple locks only

❑ The interface for functions dealing with nested locks is similar (but using nestable lock variables):

Simple locks Nestable locks

omp_init_lock omp_init_nest_lock

omp_destroy_lock omp_destroy_nest_lock omp_set_lock omp_set_nest_lock

omp_unset_lock omp_unset_nest_lock

omp_test_lock omp_test_nest_lock

(93)

93 OpenMP locking example

Other Work

parallel region - begin

TID = 0 TID = 1

Protected Region

acquire lock

release lock

Protected Region

acquire lock

release lock

Other Work

parallel region - end

♦ The protected region

contains the update of a shared variable

♦ One thread acquires the lock and performs the update

♦ Meanwhile, the other thread performs some other work

♦ When the lock is released again, the

other thread performs

the update

(94)

94 Locking Example - The Code

Program Locks ....

Call omp_init_lock (LCK)

!$omp parallel shared(LCK)

Do While ( omp_test_lock (LCK) .EQV. .FALSE. ) Call Do_Something_Else()

End Do

Call Do_Work()

Call omp_unset_lock (LCK)

!$omp end parallel

Call omp_destroy_lock (LCK) Stop

End

Initialize lock variable

Check availability of lock

(also sets the lock)

Release lock again

Remove lock association

(95)

95 Example output for 2 threads

TID: 1 at 09:07:27 => entered parallel region

TID: 1 at 09:07:27 => done with WAIT loop and has the lock TID: 1 at 09:07:27 => ready to do the parallel work

TID: 1 at 09:07:27 => this will take about 18 seconds TID: 0 at 09:07:27 => entered parallel region

TID: 0 at 09:07:27 => WAIT for lock - will do something else for 5 seconds TID: 0 at 09:07:32 => WAIT for lock - will do something else for 5 seconds TID: 0 at 09:07:37 => WAIT for lock - will do something else for 5 seconds TID: 0 at 09:07:42 => WAIT for lock - will do something else for 5 seconds TID: 1 at 09:07:45 => done with my work

TID: 1 at 09:07:45 => done with work loop - released the lock TID: 1 at 09:07:45 => ready to leave the parallel region

TID: 0 at 09:07:47 => done with WAIT loop and has the lock TID: 0 at 09:07:47 => ready to do the parallel work

TID: 0 at 09:07:47 => this will take about 18 seconds TID: 0 at 09:08:05 => done with my work

TID: 0 at 09:08:05 => done with work loop - released the lock TID: 0 at 09:08:05 => ready to leave the parallel region

Done at 09:08:05 - value of SUM is 1100

Note: program has been instrumented to get this information

Used to check the answer

(96)

96 OpenMP Environment Variables

(97)

97 OpenMP Environment Variables

Note:

The names are in uppercase, the values are case insensitive

OpenMP environment variable Default for Oracle Solaris Studio 1

static, “N/P”

OMP_DYNAMIC { TRUE | FALSE } TRUE

OMP_NESTED { TRUE | FALSE } FALSE

OMP_STACKSIZE size [B|K|M|G] 4 MB (32 bit) / 8 MB (64-bit) OMP_WAIT_POLICY [ACTIVE | PASSIVE] PASSIVE

OMP_MAX_ACTIVE_LEVELS 4

OMP_THREAD_LIMIT 1024

OMP_NUM_THREADS n

OMP_SCHEDULE “schedule,[chunk]”

(98)

98 Implementing the Fork-Join Model

Use the OMP_WAIT_POLICY environment variable to control the behaviour of idle

threads

?

worker

threads worker threads

?

barrier parallel

region

worker

threads worker threads barrier

parallel region

master

thread

(99)

99 About the Stack

**void myfunc(float *Aglobal)** {

int Alocal;

...

} Alocal

Aglobal

#omp parallel shared(Aglobal) {

(void) myfunc(&Aglobal);

}

Variable Alocal is in private memory, managed by the thread owning it, and stored on the so-called stack

Thread

Alocal

Thread

Alocal

Thread

Alocal

Thread

(100)

<Insert Picture Here>

Tasking In OpenMP

(101)

101 Tasking in OpenMP

• When any thread encounters a task construct, a new explicit task is generated

– Tasks can be nested

• Execution of explicitly generated tasks is assigned to one of the threads in the current team

– This is subject to the thread's availability and thus could be immediate or deferred until later

• Completion of the task can be guaranteed using a task

synchronization construct