Parallel Computing in Fortran with OpenMP
January 27, 2009
Both the GNU and Intel Fortran compilers have native support for
OpenMP. All you need to do to create
parallel programs is to add some OpenMP directives, specially
formatted comments, to your code and compile the program with a
special flag to enable OpenMP. For gfortran
, compile the program
with -fopenmp
and for ifort
, the flag is -openmp
.
The trick to getting the most out of parallel processing is to run sections of code in parallel such that the additional overhead of starting new threads is outweighed by time required to perform the task. That is, it’s ideal to split the code into distinct tasks that are each sufficiently hard to compute on their own–ones that might take more than a couple of seconds to compute–rather than simple tasks that execute very fast.
In many cases you will have sections of code that are embarrassingly parallel. That is, a group of task need to be carried out multiple times and each task is independent of the others. Monte Carlo experiments are good examples of this. Each trial is independent of the rest. At the end, all of the trials simply need to be collected in any order so that summary statistics such as the mean and standard deviation can be calculated.
A Parallel Monte Carlo Experiment
Here is a short example program, mc1
, to get started.
Suppose we have a function called monte_carlo
which we need to run
several times with different data. Each run returns some result and
we would like to compute the mean of these results. In the example
below, monte_carlo(n)
draws n
Uniform(0,1) random numbers and
returns the sample mean.
program mc1
implicit none
integer, parameter :: nmc = 100 ! number of trials
integer, parameter :: n = 100000 ! sample size
real, dimension(nmc) :: mu ! result of each trial
real :: mean, stdev ! mean and standard deviation
integer :: j
!$OMP PARALLEL DO
do j = 1, nmc
print *, 'Experiment', j
mu(j) = monte_carlo(n)
end do
!$OMP END PARALLEL DO
mean = sum(mu) / dble(nmc)
stdev = sqrt( sum( (mu - mean)**2 ) ) / dble(nmc)
print *, 'mean', mean
print *, 'stdev', stdev
contains
! Draws n Uniform(0,1) random numbers and returns the sample mean
function monte_carlo(n) result(y)
integer, intent(in) :: n
integer :: i
real :: y, u
y = 0.d0
do i = 1, n
call random_number(u) ! draw from Uniform(0,1)
y = y + u ! sum the draws
end do
y = y / dble(n) ! return the sample mean
end function monte_carlo
end program mc1
Depending on your computer, because of the simplicity of this
experiment you will probably need to set n
very high in order to see
the difference in execution speed. In fact, for small values of
n
(even 100000 in my case) the parallel code is actually slower.
This illustrates the earlier point about the overhead of spawning
new threads. The chunks of code that are parallelized should be
sufficiently hard to keep the computational throughput high.
First, we can compile the program as usual and run it, executing the Monte Carlo experiments in order:
% gfortran -O3 -o mc1 mc1.f90
% ./mc1
Experiment 1
Experiment 2
...
Experiment 99
Experiment 100
mean 0.49996144
stdev 8.68058487E-05
Now, if we compile the program with OpenMP and run it, we see that the experiments seem to be allocated evenly across (in my case) two processors and are executed out of order:
% gfortran -O3 -fopenmp -o mc1 mc1.f90
% ./mc1
Experiment 1
Experiment 51
...
Experiment 49
Experiment 50
mean 0.49996150
stdev 9.25014756E-05
If this doesn’t work immediately on your system, you may need to set
the OMP_NUM_THREADS
environment variable depending on the number of
processors in your system. For example, in bash, zsh, and other
compatible shells for a system with 16 processors:
% export OMP_NUM_THREADS=16
The Perils of Parallel Pseudo-Random Number Generation
Note that the above answers are different! Even this simple example shows that parallel programs can behave differently than their serial counterparts in unforeseen ways. Why are the results different? In this case we are drawing pseudo-random numbers from a stream that depends on a seed. If the samples are just taken out of order the mean and standard deviation should be identical. However, if the samples themselves are different, because multiple processors are drawing random numbers from the same stream simultaneously, then the results will be different.
Even setting the seed to a predefined value before each experiment
will not fix the problem. You can verify this yourself with the
following set_seed
subroutine:
! Determines the dimension of the seed vector and sets the seed of
! the random number generator in a deterministic way.
subroutine set_seed(index)
integer, intent(in) :: index
integer :: i, n
integer, dimension(:), allocatable :: seed
call random_seed(SIZE = n)
allocate(seed(n))
seed = index + 37 * (/ (i - 1, i = 1, n) /)
call random_seed(PUT = seed)
end subroutine set_seed
Even after calling
set_seed(j)
before the call to monte_carlo
, the means and standard
deviations will differ. The problem is that there is actually only
one random number generator (RNG) and if one of the threads sets the
seed, it affects the other active threads. The states of are not
independent.
A Simple Parallel-Ready Random Number Generator
To fix this problem, each thread needs to have its own independent
RNG. There is no simple way to accomplish this with Fortran’s
random_number
intrinsic. Instead, we need a more sophisticated RNG
that can have multiple “instances.” Fortran’s user-defined types
provide a nice solution. If we create a type that will store the
state of each stream of random numbers, we can then build an array of
RNGs–one for each thread or experiment–each started at a different
seed.
For the purposes of illustration, a simple random number generator
based on the mzran
algorithm of Marsaglia and Zaman (1994) is
provided in module rng
below.
The module provides a type, rng_t
, and two procedures, rng_seed
and rng_uniform
for setting the seed (with a single integer)
and drawing a Uniform(0,1) random number.
module rng
implicit none
private
public :: rng_t, rng_seed, rng_uniform
! Dimension of the state
integer, parameter :: ns = 4
! Default seed vector
integer, parameter, dimension(ns) :: default_seed &
= (/ 521288629, 362436069, 16163801, 1131199299 /)
! A data type for storing the state of the RNG
type :: rng_t
integer, dimension(ns) :: state = default_seed
end type rng_t
contains
! Seeds the RNG using a single integer and a default seed vector.
subroutine rng_seed(self, seed)
type(rng_t), intent(inout) :: self
integer, intent(in) :: seed
self%state(1) = seed
self%state(2:ns) = default_seed(2:ns)
end subroutine rng_seed
! Draws a uniform real number on [0,1].
function rng_uniform(self) result(u)
type(rng_t), intent(inout) :: self
real :: u
integer :: imz
imz = self%state(1) - self%state(3)
if (imz < 0) imz = imz + 2147483579
self%state(1) = self%state(2)
self%state(2) = self%state(3)
self%state(3) = imz
self%state(4) = 69069 * self%state(4) + 1013904243
imz = imz + self%state(4)
u = 0.5d0 + 0.23283064d-9 * imz
end function rng_uniform
end module rng
At the beginning of our mc
program we now must add a use rng
statement to include the rng
module. In addition to the array
mu
, for storing the results, we now create an array of random
number generators:
type(rng_t), dimension(nmc) :: rng
There is thus one variable of type rng_t
for each experiment. The
rng_t
type simply stores the state of the linear congruential random
number generator, an integer vector of length four. This can be seen
in the source of the rng
module above.
The state is initialized to the default seed. Calling rng_seed(rng,
seed)
for an rng_t
variable rng
and an integer seed
will set
the first element of the state to seed
and the remaining elements
to the default values. Each value of seed
will thus result in an
independent stream of pseudo-random draws obtained via calls to
rng_uniform(rng)
.
Now, before each Monte Carlo trial we set the seed of the j
-th RNG
to a random prime number, say 932117, plus the index of the
experiment, j
:
call rng_seed(rng(j), 932117 + j)
We modify the monte_carlo
subroutine to accept the rng
as an
argument and use rng_uniform
instead of random_number
:
! Draws n Uniform(0,1) random numbers and returns the sample mean
function monte_carlo(rng, n) result(y)
type(rng_t), intent(inout) :: rng
integer, intent(in) :: n
integer :: i
real :: y, u
y = 0.d0
do i = 1, n
u = rng_uniform(rng) ! draw from Uniform(0,1)
y = y + u ! sum the draws
end do
y = y / dble(n) ! return the sample mean
end function monte_carlo
Compiling the program is only slightly more complicated due to the additional file. As expected now, the serial and parallel versions produce exactly the same results:
% gfortran -O3 -o mc3 rng.f90 mc3.f90
% ./mc3
Experiment 1
Experiment 2
...
Experiment 99
Experiment 100
mean 0.50012332
stdev 8.55281323E-05
% gfortran -O3 -fopenmp -o mc3 rng.f90 mc3.f90
% ./mc3
Experiment 51
Experiment 1
...
Experiment 49
Experiment 50
mean 0.50012332
stdev 8.55281323E-05