1 An Overview of OpenMP
IWOMP 2011 Chicago, IL, USA
June 13-15, 2011
Ruud van der Pas
Senior Staff Engineer SPARC Microelectronics
Oracle
Santa Clara, CA, USA
2 Outline
• Getting Started with OpenMP
• Using OpenMP
• What's New in OpenMP 3.1
<Insert Picture Here>
Getting Started With
OpenMP
4
0 L
1 L
P L
Shared Memory
5
http://www.openmp.org
http://www.compunity.org
6
http://www.openmp.org
7 Shameless Plug - “Using OpenMP”
“Using OpenMP”
Portable Shared Memory Parallel Programming
Chapman, Jost, van der Pas MIT Press, 2008
ISBN-10: 0-262-53302-2
ISBN-13: 978-0-262-53302-7
List price: 35 $US
8
All 41 examples are available NOW!
As well as a forum on http://www.openmp.org
Download the examples and discuss in forum:
http://www.openmp.org/wp/2009/04/
download-book-examples-and-discuss
9 What is OpenMP?
❑ De-facto standard Application Programming Interface (API) to write shared memory parallel applications in C, C++, and Fortran
❑ Consists of Compiler Directives, Run time routines and Environment variables
❑ Specification maintained by the OpenMP
Architecture Review Board (http://www.openmp.org)
❑ Version 3.0 has been released May 2008
● The upcoming 3.1 release will be released soon
10
OpenMP is widely supported by industry, as well as the
academic community
11 When to consider OpenMP?
❑ Using an automatically parallelizing compiler:
● It can not find the parallelism
✔ The data dependence analysis is not able to determine whether it is safe to parallelize ( or not)
● The granularity is not high enough
✔ The compiler lacks information to parallelize at the highest possible level
❑ Not using an automatically parallelizing compiler:
● No choice, other than doing it yourself
12 Advantages of OpenMP
❑ Good performance and scalability
● If you do it right ....
❑ De-facto and mature standard
❑ An OpenMP program is portable
● Supported by a large number of compilers
❑ Requires little programming effort
❑ Allows the program to be parallelized incrementally
13 OpenMP and Multicore
OpenMP is ideally suited for multicore architectures
Memory and threading model map naturally Lightweight
Mature
Widely available and used
14 The OpenMP Memory Model
T
private
T
private
T
private
T
private
T
private
Shared Memory
✔ All threads have access to the same, globally shared, memory
✔ Data can be shared or private
✔ Shared data is accessible by all threads
✔ Private data can only be accessed by the thread that owns it
✔ Data transfer is transparent to the programmer
✔ Synchronization takes place,
but it is mostly implicit
15
Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011 An Overview of OpenMP
Data-sharing Attributes
• In an OpenMP program, data needs to be “labeled”
• Essentially there are two basic types:
– Shared - There is only one instance of the data
• Threads can read and write the data simultaneously unless protected through a specific construct
• All changes made are visible to all threads
– But not necessarily immediately, unless enforced ...
– Private - Each thread has a copy of the data
• No other thread can access this data
• Changes only visible to the thread owning the data
16 Private and shared clauses
✔ No storage association with original object
✔ All references are to the local object
✔ Values are undefined on entry and exit
✔ Data is accessible by all threads in the team
✔ All threads access the same address space
private (list)
shared (list)
17
Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011 An Overview of OpenMP
About storage association
• Private variables are undefined on entry and exit of the parallel region
• A private variable within a parallel region has no storage association with the same variable outside of the region
• Use the firstprivate and lastprivate clauses to override this behavior
• We illustrate these concepts with an example
18 The firstprivate and lastprivate clauses
fi rstprivate (list)
✔ All variables in the list are initialized with the value the original object had before entering the parallel construct
✔ The thread that executes the sequentially last iteration or section updates the value of the objects in the list
lastprivate (list)
19 Example firstprivate
n = 2; indx = 4;
#pragma omp parallel default(none) private(i,TID) \ firstprivate(indx) shared(n,a)
{ TID = omp_get_thread_num();
indx = indx + n*TID;
for(i=indx; i<indx+n; i++) a[i] = TID + 1;
} /*-- End of parallel region --*/
0 1 2 3 4 5 6 7 8 9
1 1 2 2 3 3
index value
TID = 0 TID = 1 TID = 2
20 Example lastprivate
#pragma omp parallel for default(none) lastprivate(a) for (int i=0; i<n; i++)
{ ...
a = i + 1;
...
} // End of parallel region
b = 2 * a; // value of b is 2*n
21 The default clause
default (none | shared | private | threadprivate )
✔ No implicit defaults; have to scope all variables explicitly
none
✔ All variables are shared
✔ The default in absence of an explicit "default" clause
✔ All variables are private to the thread
✔ Includes common block data, unless THREADPRIVATE
Fortran
shared
private
✔ All variables are private to the thread; pre-initialized
fi rstprivate
C/C++
default ( none | shared )
22 The OpenMP Execution Model
Fork and Join Model
Master Thread
Worker Threads Parallel
region
Synchronization Parallel
region
Worker Threads
Synchronization
23 Defining Parallelism in OpenMP
❑ OpenMP Team := Master + Workers
❑ A Parallel Region is a block of code executed by all threads simultaneously
☞ The master thread always has thread ID 0
☞ Thread adjustment (if enabled) is only done before entering a parallel region
☞ Parallel regions can be nested, but support for this is implementation dependent
☞ An "if" clause can be used to guard the parallel region; in
case the condition evaluates to "false", the code is executed
serially
24 The Parallel Region
!$omp parallel [clause[[,] clause] ...]
"this code is executed in parallel"
!$omp end parallel (implied barrier)
#pragma omp parallel [clause[[,] clause] ...]
{
"this code is executed in parallel"
} // End of parall section (note: implied barrier)
A parallel region is a block of code executed by all
threads in the team
25 Parallel Region - An Example/1
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("Hello World\n");
return(0);
}
26 Parallel Region - An Example/1
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[]) { #pragma omp parallel
{
printf("Hello World\n");
} // End of parallel region return(0);
}
27 Parallel Region - An Example/2
$ cc -xopenmp -fast hello.c
$ export OMP_NUM_THREADS=2
$ ./a.out Hello World Hello World
$ export OMP_NUM_THREADS=4
$ ./a.out Hello World Hello World Hello World Hello World
$
28 The if clause
✔ Only execute in parallel if expression evaluates to true
✔ Otherwise, execute serially
if (scalar expression)
#pragma omp parallel if (n > some_threshold) \ shared(n,x,y) private(i)
{
#pragma omp for
for (i=0; i<n; i++) x[i] += y[i];
} /*-- End of parallel region --*/
29 Nested Parallelism
Master Thread
Outer parallel region Nested parallel
region
Note: nesting level can be arbitrarily deep
3-way parallel 9-way parallel 3-way
parallel Outer parallel
region
30 Nested Parallelism Support/1
❑ Environment variable and runtime routines to set/get the maximum number of nested active parallel
regions
OMP_MAX_ACTIVE_LEVELS
omp_set_max_active_levels() omp_get_max_active_levels()
❑ Environment variable and runtime routine to set/get the maximum number of OpenMP threads available to the program
OMP_THREAD_LIMIT
omp_get_thread_limit()
31 Nested Parallelism Support/2
❑ Per-task internal control variables
● Allow, for example, calling
omp_set_num_threads() inside a parallel
region to control the team size for next level of parallelism
❑ Library routines to determine
● Depth of nesting
omp_get_level()
omp_get_active_level()
● IDs of parent/grandparent etc. threads
omp_get_ancestor_thread_num(level)
● Team sizes of parent/grandparent etc. teams
omp_get_team_size(level)
32 A More Elaborate Example
scale = sum(a,0,n) + sum(z,0,n) + f;
....
Statement is executed by all threads
f = 1.0;
Statement is executedby all threads
#pragma omp parallel if (n>limit) default(none) \ shared(n,a,b,c,x,y,z) private(f,i,scale) {
} /*-- End of parallel region --*/
p a ra lle l r e g io n
#pragma omp barrier synchronization
....
for (i=0; i<n; i++) z[i] = x[i] + y[i];
#pragma omp for nowait
parallel loop
(work is distributed)
for (i=0; i<n; i++) a[i] = b[i] + c[i];
#pragma omp for nowait
parallel loop
(work is distributed)
<Insert Picture Here>
Using OpenMP
34 Using OpenMP
• We have just seen a glimpse of OpenMP
• To be practically useful, much more functionality is needed
• Covered in this section:
– Many of the language constructs
– Features that may be useful or needed when running an OpenMP application
• Note that the tasking concept is covered in a separate
section
35 Components of OpenMP
Directives Runtime
environment
Environment variables
Number of threads
Scheduling type
Dynamic thread adjustment
Nested parallelism
Stacksize
Idle threads
Active levels
Thread limit
Number of threads
Scheduling type
Dynamic thread adjustment
Nested parallelism
Stacksize
Idle threads
Active levels
Thread limit
Parallel region
Worksharing constructs
Tasking
Synchronization
Data-sharing attributes
Parallel region
Worksharing constructs
Tasking
Synchronization
Data-sharing attributes
Number of threads
Thread ID
Dynamic thread adjustment
Nested parallelism
Schedule
Active levels
Thread limit
Nesting level
Ancestor thread
Team size
Wallclock timer
Locking
Number of threads
Thread ID
Dynamic thread adjustment
Nested parallelism
Schedule
Active levels
Thread limit
Nesting level
Ancestor thread
Team size
Wallclock timer
Locking
36 Directive format
❑ Fortran: directives are case insensitive
● Syntax: sentinel directive [clause [[,] clause]...]
● The sentinel is one of the following:
✔ !$OMP or C$OMP or *$OMP (fixed format)
✔ !$OMP (free format)
❑ Continuation: follows the language syntax
❑ Conditional compilation: !$ or C$ -> 2 spaces
❑ C: directives are case sensitive
● Syntax: #pragma omp directive [clause [clause] ...]
❑ Continuation: use \ in pragma
❑ Conditional compilation: _OPENMP macro is set
37
sum = 0.0
!$omp parallel default(none) &
!$omp shared(n,x) private(i)
!$omp do reduction (+:sum) do i = 1, n
sum = sum + x(i) end do
!$omp end do
!$omp end parallel print *,sum
The reduction clause - Example
Variable SUM is a shared variable
☞ Care needs to be taken when updating shared variable SUM
☞ With the reduction clause, the OpenMP compiler
generates code such that a race condition is avoided
38 The reduction clause
✔ Reduction variable(s) must be shared variables
✔ A reduction is defined as:
C/C++
x = x operator expr x = expr operator x
x = intrinsic (x, expr_list) x = intrinsic (expr_list, x)
x = x operator expr x = expr operator x x++, ++x, x--, --x x <binop> = expr
Fortran C/C++
✔ Note that the value of a reduction variable is undefined from the moment the first thread reaches the clause till the operation has completed
✔ The reduction can be hidden in a function call
Check the docs for details
reduction ( operator: list )
reduction ( [operator | intrinsic] ) : list ) Fortran
39
Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011 An Overview of OpenMP
Fortran - Allocatable Arrays
• Fortran allocatable arrays whose status is
“currently allocated” are allowed to be specified as
private, lastprivate, firstprivate, reduction, or copyprivate
integer, allocatable,dimension (:) :: A integer i
allocate (A(n))
!$omp parallel private (A) do i = 1, n
A(i) = i end do
...
!$omp end parallel
40 Barrier/1
Suppose we run each of these two loops in parallel over i:
This may give us a wrong answer (one day)
Why ?
for (i=0; i < N; i++) a[i] = b[i] + c[i];
for (i=0; i < N; i++)
d[i] = a[i] + b[i];
41 Barrier/2
We need to have updated all of a[ ] fi rst, before using a[ ] *
All threads wait at the barrier point and only continue when all threads have reached the barrier point
wait ! barrier
*) If there is the guarantee that the mapping of iterations onto threads is identical for both loops, there will not be a data race in this case
for (i=0; i < N; i++) a[i] = b[i] + c[i];
for (i=0; i < N; i++)
d[i] = a[i] + b[i];
42 Barrier/3
time
Barrier Region
idle idle idle
!$omp barrier
#pragma omp barrier
Barrier syntax in OpenMP:
43 When to use barriers ?
❑ If data is updated asynchronously and data integrity is at risk
❑ Examples:
● Between parts in the code that read and write the same section of memory
● After one timestep/iteration in a solver
❑ Unfortunately, barriers tend to be expensive and also may not scale to a large number of processors
❑ Therefore, use them with care
44 The nowait clause
❑ To minimize synchronization, some directives support the optional nowait clause
● If present, threads do not synchronize/wait at the end of that particular construct
❑ In C, it is one of the clauses on the pragma
❑ In Fortran, it is appended at the closing part of the construct
!$omp do : :
!$omp end do nowait
#pragma omp for nowait {
:
}
45 The Worksharing Constructs
☞ The work is distributed over the threads
☞ Must be enclosed in a parallel region
☞ Must be encountered by all threads in the team, or none at
☞ all No implied barrier on entry; implied barrier on exit (unless nowait is specifi ed)
☞ A work-sharing construct does not launch any new threads
#pragma omp for {
....
}
!$OMP DO ....
!$OMP END DO
#pragma omp for {
....
}
!$OMP DO ....
!$OMP END DO
#pragma omp sections {
....
}
!$OMP SECTIONS ....
!$OMP END SECTIONS
#pragma omp sections {
....
}
!$OMP SECTIONS ....
!$OMP END SECTIONS
#pragma omp single {
....
}
!$OMP SINGLE ....
!$OMP END SINGLE
#pragma omp single {
....
}
!$OMP SINGLE ....
!$OMP END SINGLE
46 The Workshare construct
Fortran has a fourth worksharing construct:
!$OMP WORKSHARE
<array syntax>
!$OMP END WORKSHARE [NOWAIT]
!$OMP WORKSHARE
<array syntax>
!$OMP END WORKSHARE [NOWAIT]
Example:
!$OMP WORKSHARE
A(1:M) = A(1:M) + B(1:M)
!$OMP END WORKSHARE NOWAIT
!$OMP WORKSHARE
A(1:M) = A(1:M) + B(1:M)
!$OMP END WORKSHARE NOWAIT
47 The omp for/do directive
!$omp do [clauses]
do ...
<code-block>
end do
!$omp end do[nowait]
#pragma omp for [clauses]
for (...) {
<code-block>
}
The iterations of the loop are distributed over the threads
48 The omp for directive - Example
#pragma omp parallel default(none)\
shared(n,a,b,c,d) private(i) {
#pragma omp for nowait
#pragma omp for nowait
} /*-- End of parallel region --*/
(implied barrier)
for (i=0; i<n; i++) d[i] = 1.0/c[i];
for (i=0; i<n-1; i++)
b[i] = (a[i] + a[i+1])/2;
49 C++: Random Access Iterator Loops
void iterator_example() {
std::vector vec(23);
std::vector::iterator it;
#pragma omp for default(none)shared(vec)
for (it = vec.begin(); it < vec.end(); it++) {
// do work with *it //
} }
Parallelization of random access iterator loops is supported
50 Loop Collapse
• Allows parallelization of perfectly nested loops without using nested parallelism
• The collapse clause on for/do loop indicates how many loops should be collapsed
• Compiler forms a single loop and then parallelizes it
!$omp parallel do collapse(2) ...
do i = il, iu, is
do j = jl, ju, js
do k = kl, ku, ks ...
end do end do
end do
!$omp end parallel do
51 The schedule clause/1
schedule ( static | dynamic | guided | auto [, chunk] ) schedule ( runtime )
✔ Distribute iterations in blocks of size "chunk" over the threads in a round-robin fashion
✔ In absence of "chunk", each thread executes approx. N/P chunks for a loop of length N and P threads
● Details are implementation defined
✔ Under certain conditions, the assignment of iterations to threads is the same across multiple loops in the same parallel region
static [, chunk]
52 The schedule clause/2
Thread 0 1 2 3
no chunk* 1-4 5-8 9-12 13-16 chunk = 2 1-2 3-4 5-6 7-8
9-10 11-12 13-14 15-16
Example static schedule
Loop of length 16, 4 threads:
*) The precise distribution is implementation defi ned
53 The schedule clause/3
✔ Fixed portions of work; size is controlled by the value of chunk
✔ When a thread finishes, it starts on the next portion of work
✔ Same dynamic behavior as "dynamic", but size of the portion of work decreases exponentially
✔ Iteration scheduling scheme is set at runtime through environment variable OMP_SCHEDULE
dynamic [, chunk]
guided [, chunk]
runtime
✔ The compiler (or runtime system) decides what is best to use;
choice could be implementation dependent
auto
54 Experiment - 500 iterations, 4 threads
0 50 100 150 200 250 300 350 400 450 500
3 2 1 0 3 2 1 0 3 2 1 0
static dynamic, 5
guided, 5
Iteration Number
T h re a d I D
55 Schedule Kinds Functions
❑ Makes schedule(runtime) more general
❑ Can set/get schedule it with library routines:
omp_set_schedule() omp_get_schedule()
❑ Also allows implementations to add their own
schedule kinds
56 Parallel sections
#pragma omp sections [clauses]
{
#pragma omp section {....}
#pragma omp section {....}
....
}
Individual section blocks are executed
in parallel
!$omp sections [clauses]
!$ omp section {....}
!$ omp section {....}
....
!$omp end sections [nowait]
57 The Sections Directive - Example
#pragma omp parallel default(none)\
shared(n,a,b,c,d) private(i) { #pragma omp sections nowait
{
#pragma omp section
#pragma omp section
} /*-- End of sections --*/
} /*-- End of parallel region --*/
for (i=0; i<n; i++) d[i] = 1.0/c[i];
for (i=0; i<n-1; i++)
b[i] = (a[i] + a[i+1])/2;
58 Overlap I/O and Processing/1
Input Thread Output Thread
0
1 0
2 1 0
3 2 1
4 3 2
5 4 3
5 4
5 Processing Thread(s)
Ti m e
59
Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011 An Overview of OpenMP
Overlap I/O and Processing/2
#pragma omp parallel sections {
#pragma omp section {
for (int i=0; i<N; i++) { (void) read_input(i);
(void) signal_read(i);
} }
#pragma omp section {
for (int i=0; i<N; i++) { (void) wait_read(i);
(void) process_data(i);
(void) signal_processed(i);
} }
#pragma omp section {
for (int i=0; i<N; i++) { (void) wait_processed(i);
(void) write_output(i);
} }
} /*-- End of parallel sections --*/
Processing Thread(s)
Input Thread
Output Thread
60 The Single Directive
!$omp single [private][firstprivate]
<code-block>
!$omp end single [copyprivate][nowait]
Only one thread in the team executes the code enclosed
#pragma omp single [private][firstprivate] \ [copyprivate][nowait]
{
<code-block>
}
61
#pragma omp parallel \ shared (A)
{
...
#pragma omp single nowait {"read A[0..N-1]";}
...
#pragma omp barrier “use A”
}
...
"read A[0..N-1]";
...
Single processor region/1
Only one thread executes the single region
This construct is ideally suited for I/O or
initializations
Original Code
Parallel Version
62 Single processor region/2
time
single processor region
Other threads
wait if there is
a barrier here
63 Combined work-sharing constructs
#pragma omp parallel
#pragma omp for for (...)
!$omp parallel do ...
!$omp end parallel do
#pragma omp parallel for for (....)
!$omp parallel
!$omp sections ...
!$omp end sections
!$omp end parallel
#pragma omp parallel
#pragma omp sections { ...}
!$omp parallel sections ...
!$omp end parallel sections
#pragma omp parallel sections
{ ... } Single PARALLEL sections
!$omp parallel
!$omp workshare ...
!$omp end workshare
!$omp end parallel
!$omp parallel workshare ...
!$omp end parallel workshare Single WORKSHARE loop
!$omp parallel
!$omp do ...
!$omp end do
!$omp end parallel
Single PARALLEL loop
64 Orphaning
♦ The OpenMP specifi cation does not restrict worksharing and synchronization directives (omp for, omp single,
critical, barrier, etc.) to be within the lexical extent of a parallel region. These directives can be orphaned
♦ That is, they can appear outside the lexical extent of a parallel region
:
#pragma omp parallel {
:
(void) dowork();
: }
:
void dowork() {
:
#pragma omp for
for (int i=0;i<n;i++) {
: }
: }
orphaned work-sharing
directive
65 More on orphaning
♦ When an orphaned worksharing or synchronization directive is encountered in the sequential part of the program (outside the dynamic extent of any parallel region), it is executed by the master thread only. In effect, the directive will be ignored
(void) dowork(); !- Sequential FOR #pragma omp parallel
{
(void) dowork(); !- Parallel FOR }
void dowork() { #pragma omp for for (i=0;....) { :
} }
66 Example - Parallelizing Bulky Loops
for (i=0; i<n; i++) /* Parallel loop */
{
a = ...
b = ... a ..
c[i] = ....
...
for (j=0; j<m; j++) {
<a lot more code in this loop>
}
...
}
67 Step 1: “Outlining”
for (i=0; i<n; i++) /* Parallel loop */
{
(void) FuncPar(i,m,c,...) }
void FuncPar(i,m,c,....)
{ float a, b; /* Private data */
int j;
a = ...
b = ... a ..
c[i] = ....
...
for (j=0; j<m; j++) {
<a lot more code in this loop>
}
...
} Still a sequential program
Should behave identically
Easy to test for correctness
But, parallel by design
68 Step 2: Parallelize
for (i=0; i<n; i++) /* Parallel loop */
{
(void) FuncPar(i,m,c,...) } /*-- End of parallel for --*/
Minimal scoping required Less error prone
#pragma omp parallel for private(i) shared(m,c,..)
void FuncPar(i,m,c,....)
{ float a, b; /* Private data */
int j;
a = ...
b = ... a ..
c[i] = ....
...
for (j=0; j<m; j++) {
<a lot more code in this loop>
}
...
69 Additional Directives/1
!$omp master
<code-block>
!$omp end master
#pragma omp master {<code-block>}
!$omp atomic
#pragma omp atomic
!$omp critical [(name)]
<code-block>
!$omp end critical [(name)]
#pragma omp critical [(name)]
{<code-block>}
70 The Master Directive
!$omp master
<code-block>
!$omp end master
Only the master thread executes the code block:
#pragma omp master
{<code-block>} There is no implied barrier on entry or
exit !
71 Critical Region/1
If sum is a shared variable, this loop can not run in parallel by simply using a “#pragma omp for”
All threads execute the update, but only one at a time will
do so
#pragma omp parallel for for (i=0; i < n; i++){
...
#pragma omp critical {sum += a[i];}
...
}
for (i=0; i < n; i++){
...
sum += a[i];
...
}
72 Critical Region/2
❑ Useful to avoid a race condition, or to perform I/O (but that still has random order)
❑ Be aware that there is a cost associated with a critical region
time
critical region
73 Critical and Atomic constructs
!$omp critical [(name)]
<code-block>
!$omp end critical [(name)]
Critical: All threads execute the code, but only one at a time:
#pragma omp critical [(name)]
{<code-block>} There is no implied barrier on entry or
exit !
!$omp atomic
<statement>
#pragma omp atomic
<statement>
Atomic: only the loads and store are atomic ....
This is a lightweight, special
form of a critical section #pragma omp atomic
a[indx[i]] += b[i];
74 Additional Directives/2
!$omp ordered
<code-block>
!$omp end ordered
#pragma omp ordered {<code-block>}
!$omp flush [(list)]
#pragma omp flush [(list)]
75 Additional Directives/2
The enclosed block of code is executed in the order in which iterations would be executed sequentially:
May introduce serialization
(could be expensive)
!$omp ordered
<code-block>
!$omp end ordered
#pragma omp ordered {<code-block>}
Ensure that all threads in a team have a consistent view of certain objects in memory:
In the absence of a list, all visible variables are
fl ushed
!$omp flush [(list)]
#pragma omp flush
[(list)]
76 The flush directive
X = 0 while (X == 0) {
“wait”
} X = 1
Thread A Thread B
If shared variable X is kept within a register, the modification
may not be made visible to the other thread(s)
77
Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011 An Overview of OpenMP
Implied Flush Regions/1
• During a barrier region
• At exit from worksharing regions, unless a nowait is present
• At entry to and exit from parallel, critical, ordered and parallel worksharing regions
• During omp_set_lock and omp_unset_lock regions
• During omp_test_lock, omp_set_nest_lock, omp_unset _nest_lock and omp_test_nest_lock regions, if the region causes the lock to be set or unset
• Immediately before and after every task scheduling point
78 Implied Flush Regions/2
• At entry to and exit from atomic regions, where the list contains only the variable updated in the atomic construct
• A flush region is not implied at the following locations:
– At entry to a worksharing region
– At entry to or exit from a master region
79
OpenMP and Global Data
80
program global_data ....
include "global.h"
....
!$omp parallel do private(j) do j = 1, n
call suba(j) end do
!$omp end parallel do ...
Global data - An example
subroutine suba(j) ...
include "global.h"
...
do i = 1, m b(i) = j end do
do i = 1, m a(i,j) = func_call(b(i)) end do
return end
Data Race !
common /work/a(m,n),b(m)
fi le global.h
81 Global data - A Data Race!
call suba(1)
Thread 1
call suba(2)
Thread 2
S h a re d
subroutine suba(j=1)
....
do i = 1, m
a(i,1)=func_call(b(i)) end do
do i = 1, m b(i) = 1 end do
subroutine suba(j=2)
....
do i = 1, m
a(i,2)=func_call(b(i)) end do
do i = 1, m
b(i) = 2
end do
82
integer, parameter::
nthreads=4
common /work/a(m,n)
common /tprivate/b(m,nthreads)
Example - Solution
fi le global_ok.h
☞
By expanding array B, we can give each thread unique access to it's storage area
☞
Note that this can also be done using dynamic memory
(allocatable, malloc, ....) program global_data ....
include "global_ok.h"
....
!$omp parallel do private(j) do j = 1, n
call suba(j) end do
!$omp end parallel do ...
subroutine suba(j) ...
include "global_ok.h"
...
TID = omp_get_thread_num()+1 do i = 1, m
b(i,TID) = j end do
do i = 1, m
a(i,j)=func_call(b(i,TID)) end do
return
end
83
Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011 An Overview of OpenMP
About global data
• Global data is shared and requires special care
• A problem may arise in case multiple threads access the same memory section simultaneously:
– Read-only data is no problem
– Updates have to be checked for race conditions
• It is your responsibility to deal with this situation
• In general one can do the following:
– Split the global data into a part that is accessed in serial parts only and a part that is accessed in parallel
– Manually create thread private copies of the latter – Use the thread ID to access these private copies
• Alternative: Use OpenMP's threadprivate directive
84 The threadprivate directive
❑ Thread private copies of the designated global variables and common blocks are created
❑ Several restrictions and rules apply when doing this:
● The number of threads has to remain the same for all the parallel regions (i.e. no dynamic threads)
✔
Oracle implementation supports changing the number of threads
● Initial data is undefined, unless copyin is used
● ...
❑ Check the documentation when using threadprivate !
!$omp threadprivate (/cb/ [,/cb/] ...)
#pragma omp threadprivate (list)
85 common /work/a(m,n) common /tprivate/b(m)
!$omp
threadprivate(/tprivate/)
Example - Solution 2
fi le global_ok2.h
☞
The compiler creates thread private copies of array B, to give each
thread unique access to it's storage area
☞
Note that the number of copies is automatically adjusted to the
number of threads
program global_data ....
include "global_ok2.h"
....
!$omp parallel do private(j) do j = 1, n
call suba(j) end do
!$omp end parallel do ...
stop end
subroutine suba(j) ...
include "global_ok2.h"
...
do i = 1, m b(i) = j end do
do i = 1, m
a(i,j) = func_call(b(i)) end do
return
end
86 The copyin clause
copyin (list)
✔ Applies to THREADPRIVATE common blocks only
✔ At the start of the parallel region, data of the master thread is copied to the thread private copies
common /cblock/velocity
common /fields/xfield, yfield, zfield
! create thread private common blocks
!$omp threadprivate (/cblock/, /fields/)
!$omp parallel &
!$omp default (private) &
!$omp copyin ( /cblock/, zfield )
Example:
87 C++ and Threadprivate
❑ As of OpenMP 3.0, it has been clarified where/how threadprivate objects are constructed and destructed
❑ Allow C++ static class members to be threadprivate
class T { public:
static int i;
#pragma omp threadprivate(i) ...
};
88
OpenMP Runtime Routines
89 OpenMP Runtime Functions/1
Name Functionality
omp_set_num_threads Set number of threads
omp_get_num_threads Number of threads in team
omp_get_max_threads Max num of threads for parallel region omp_get_thread_num Get thread ID
omp_get_num_procs Maximum number of processors omp_in_parallel Check whether in parallel region
omp_set_dynamic Activate dynamic thread adjustment
(but implementation is free to ignore this)
omp_get_dynamic Check for dynamic thread adjustment omp_set_nested Activate nested parallelism
(but implementation is free to ignore this)
omp_get_nested Check for nested parallelism omp_get_wtime Returns wall clock time
omp_get_wtick Number of seconds between clock ticks
C/C++ : Need to include fi le <omp.h>
Fortran : Add “use omp_lib” or include fi le “omp_lib.h”
90 OpenMP Runtime Functions/2
Name Functionality
omp_set_schedule Set schedule (if “runtime” is used) omp_get_schedule Returns the schedule in use
omp_get_thread_limit Max number of threads for program omp_set_max_active_levels Set number of active parallel regions omp_get_max_active_levels Number of active parallel regions
omp_get_level Number of nested parallel regions omp_get_active_level Number of nested active par. regions omp_get_ancestor_thread_num Thread id of ancestor thread
omp_get_team_size (level) Size of the thread team at this level
C/C++ : Need to include fi le <omp.h>
Fortran : Add “use omp_lib” or include fi le “omp_lib.h”
91 OpenMP locking routines
❑ Locks provide greater flexibility over critical sections and atomic updates:
●
Possible to implement asynchronous behavior
●
Not block structured
❑ The so-called lock variable, is a special variable:
●
C/C++: type omp_lock_t and omp_nest_lock_t for nested locks
●
Fortran: type INTEGER and of a KIND large enough to hold an address
❑ Lock variables should be manipulated through the API only
❑ It is illegal, and behavior is undefined, in case a lock
variable is used without the appropriate initialization
92 Nested locking
❑ Simple locks: may not be locked if already in a locked state
❑ Nestable locks: may be locked multiple times by the same thread before being unlocked
❑ In the remainder, we discuss simple locks only
❑ The interface for functions dealing with nested locks is similar (but using nestable lock variables):
Simple locks Nestable locks
omp_init_lock omp_init_nest_lock
omp_destroy_lock omp_destroy_nest_lock omp_set_lock omp_set_nest_lock
omp_unset_lock omp_unset_nest_lock
omp_test_lock omp_test_nest_lock
93 OpenMP locking example
Other Work
parallel region - begin
TID = 0 TID = 1
Protected Region
acquire lock
release lock
Protected Region
acquire lock
release lock
Other Work
parallel region - end
♦ The protected region
contains the update of a shared variable
♦ One thread acquires the lock and performs the update
♦ Meanwhile, the other thread performs some other work
♦ When the lock is released again, the
other thread performs
the update
94 Locking Example - The Code
Program Locks ....
Call omp_init_lock (LCK)
!$omp parallel shared(LCK)
Do While ( omp_test_lock (LCK) .EQV. .FALSE. ) Call Do_Something_Else()
End Do
Call Do_Work()
Call omp_unset_lock (LCK)
!$omp end parallel
Call omp_destroy_lock (LCK) Stop
End
Initialize lock variable
Check availability of lock
(also sets the lock)
Release lock again
Remove lock association
95 Example output for 2 threads
TID: 1 at 09:07:27 => entered parallel region
TID: 1 at 09:07:27 => done with WAIT loop and has the lock TID: 1 at 09:07:27 => ready to do the parallel work
TID: 1 at 09:07:27 => this will take about 18 seconds TID: 0 at 09:07:27 => entered parallel region
TID: 0 at 09:07:27 => WAIT for lock - will do something else for 5 seconds TID: 0 at 09:07:32 => WAIT for lock - will do something else for 5 seconds TID: 0 at 09:07:37 => WAIT for lock - will do something else for 5 seconds TID: 0 at 09:07:42 => WAIT for lock - will do something else for 5 seconds TID: 1 at 09:07:45 => done with my work
TID: 1 at 09:07:45 => done with work loop - released the lock TID: 1 at 09:07:45 => ready to leave the parallel region
TID: 0 at 09:07:47 => done with WAIT loop and has the lock TID: 0 at 09:07:47 => ready to do the parallel work
TID: 0 at 09:07:47 => this will take about 18 seconds TID: 0 at 09:08:05 => done with my work
TID: 0 at 09:08:05 => done with work loop - released the lock TID: 0 at 09:08:05 => ready to leave the parallel region
Done at 09:08:05 - value of SUM is 1100
Note: program has been instrumented to get this information
Used to check the answer
96
OpenMP Environment Variables
97 OpenMP Environment Variables
Note:
The names are in uppercase, the values are case insensitive
OpenMP environment variable Default for Oracle Solaris Studio 1
static, “N/P”
OMP_DYNAMIC { TRUE | FALSE } TRUE
OMP_NESTED { TRUE | FALSE } FALSE
OMP_STACKSIZE size [B|K|M|G] 4 MB (32 bit) / 8 MB (64-bit) OMP_WAIT_POLICY [ACTIVE | PASSIVE] PASSIVE
OMP_MAX_ACTIVE_LEVELS 4
OMP_THREAD_LIMIT 1024
OMP_NUM_THREADS n
OMP_SCHEDULE “schedule,[chunk]”
98 Implementing the Fork-Join Model
Use the OMP_WAIT_POLICY environment variable to control the behaviour of idle
threads
?
worker
threads worker threads
?
barrier parallel
region
worker
threads worker threads barrier
parallel region
master
thread
99 About the Stack
void myfunc(float *Aglobal) {
int Alocal;
...
} Alocal
Aglobal
#omp parallel shared(Aglobal) {
(void) myfunc(&Aglobal);
}
Variable Alocal is in private memory, managed by the thread owning it, and stored on the so-called stack
Thread
Alocal
Thread
Alocal
Thread
Alocal
Thread
<Insert Picture Here>
Tasking In OpenMP
101
Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011 An Overview of OpenMP