# Parallel Computing in Fortran with OpenMP

January 27, 2009

Both the GNU and Intel Fortran compilers have native support for
OpenMP. All you need to do to create
parallel programs is to add some OpenMP directives, specially
formatted comments, to your code and compile the program with a
special flag to enable OpenMP. For `gfortran`

, compile the program
with `-fopenmp`

and for `ifort`

, the flag is `-openmp`

.

The trick to getting the most out of parallel processing is to run sections of code in parallel such that the additional overhead of starting new threads is outweighed by time required to perform the task. That is, it’s ideal to split the code into distinct tasks that are each sufficiently hard to compute on their own–ones that might take more than a couple of seconds to compute–rather than simple tasks that execute very fast.

In many cases you will have sections of code that are *embarrassingly
parallel*. That is, a group of task need to be carried out multiple
times and each task is independent of the others. Monte Carlo
experiments are good examples of this. Each trial is independent of
the rest. At the end, all of the trials simply need to be collected
in any order so that summary statistics such as the mean and standard
deviation can be calculated.

## A Parallel Monte Carlo Experiment

Here is a short example program, `mc1`

, to get started.
Suppose we have a function called `monte_carlo`

which we need to run
several times with different data. Each run returns some result and
we would like to compute the mean of these results. In the example
below, `monte_carlo(n)`

draws `n`

Uniform(0,1) random numbers and
returns the sample mean.

```
program mc1
implicit none
integer, parameter :: nmc = 100 ! number of trials
integer, parameter :: n = 100000 ! sample size
real, dimension(nmc) :: mu ! result of each trial
real :: mean, stdev ! mean and standard deviation
integer :: j
!$OMP PARALLEL DO
do j = 1, nmc
print *, 'Experiment', j
mu(j) = monte_carlo(n)
end do
!$OMP END PARALLEL DO
mean = sum(mu) / dble(nmc)
stdev = sqrt( sum( (mu - mean)**2 ) ) / dble(nmc)
print *, 'mean', mean
print *, 'stdev', stdev
contains
! Draws n Uniform(0,1) random numbers and returns the sample mean
function monte_carlo(n) result(y)
integer, intent(in) :: n
integer :: i
real :: y, u
y = 0.d0
do i = 1, n
call random_number(u) ! draw from Uniform(0,1)
y = y + u ! sum the draws
end do
y = y / dble(n) ! return the sample mean
end function monte_carlo
end program mc1
```

Depending on your computer, because of the simplicity of this
experiment you will probably need to set `n`

very high in order to see
the difference in execution speed. In fact, for small values of
`n`

(even 100000 in my case) the parallel code is actually *slower*.
This illustrates the earlier point about the overhead of spawning
new threads. The chunks of code that are parallelized should be
sufficiently hard to keep the computational throughput high.

First, we can compile the program as usual and run it, executing the Monte Carlo experiments in order:

```
% gfortran -O3 -o mc1 mc1.f90
% ./mc1
Experiment 1
Experiment 2
...
Experiment 99
Experiment 100
mean 0.49996144
stdev 8.68058487E-05
```

Now, if we compile the program with OpenMP and run it, we see that the experiments seem to be allocated evenly across (in my case) two processors and are executed out of order:

```
% gfortran -O3 -fopenmp -o mc1 mc1.f90
% ./mc1
Experiment 1
Experiment 51
...
Experiment 49
Experiment 50
mean 0.49996150
stdev 9.25014756E-05
```

If this doesn’t work immediately on your system, you may need to set
the `OMP_NUM_THREADS`

environment variable depending on the number of
processors in your system. For example, in bash, zsh, and other
compatible shells for a system with 16 processors:

```
% export OMP_NUM_THREADS=16
```

## The Perils of Parallel Pseudo-Random Number Generation

Note that the above answers are different! Even this simple example shows that parallel programs can behave differently than their serial counterparts in unforeseen ways. Why are the results different? In this case we are drawing pseudo-random numbers from a stream that depends on a seed. If the samples are just taken out of order the mean and standard deviation should be identical. However, if the samples themselves are different, because multiple processors are drawing random numbers from the same stream simultaneously, then the results will be different.

Even setting the seed to a predefined value before each experiment
will not fix the problem. You can verify this yourself with the
following `set_seed`

subroutine:

```
! Determines the dimension of the seed vector and sets the seed of
! the random number generator in a deterministic way.
subroutine set_seed(index)
integer, intent(in) :: index
integer :: i, n
integer, dimension(:), allocatable :: seed
call random_seed(SIZE = n)
allocate(seed(n))
seed = index + 37 * (/ (i - 1, i = 1, n) /)
call random_seed(PUT = seed)
end subroutine set_seed
```

Even after calling
`set_seed(j)`

before the call to `monte_carlo`

, the means and standard
deviations will differ. The problem is that there is actually only
one random number generator (RNG) and if one of the threads sets the
seed, it affects the other active threads. The states of are not
independent.

## A Simple Parallel-Ready Random Number Generator

To fix this problem, each thread needs to have its own independent
RNG. There is no simple way to accomplish this with Fortran’s
`random_number`

intrinsic. Instead, we need a more sophisticated RNG
that can have multiple “instances.” Fortran’s user-defined types
provide a nice solution. If we create a type that will store the
state of each stream of random numbers, we can then build an array of
RNGs–one for each thread or experiment–each started at a different
seed.

For the purposes of illustration, a simple random number generator
based on the `mzran`

algorithm of Marsaglia and Zaman (1994) is
provided in module `rng`

below.
The module provides a type, `rng_t`

, and two procedures, `rng_seed`

and `rng_uniform`

for setting the seed (with a single integer)
and drawing a Uniform(0,1) random number.

```
module rng
implicit none
private
public :: rng_t, rng_seed, rng_uniform
! Dimension of the state
integer, parameter :: ns = 4
! Default seed vector
integer, parameter, dimension(ns) :: default_seed &
= (/ 521288629, 362436069, 16163801, 1131199299 /)
! A data type for storing the state of the RNG
type :: rng_t
integer, dimension(ns) :: state = default_seed
end type rng_t
contains
! Seeds the RNG using a single integer and a default seed vector.
subroutine rng_seed(self, seed)
type(rng_t), intent(inout) :: self
integer, intent(in) :: seed
self%state(1) = seed
self%state(2:ns) = default_seed(2:ns)
end subroutine rng_seed
! Draws a uniform real number on [0,1].
function rng_uniform(self) result(u)
type(rng_t), intent(inout) :: self
real :: u
integer :: imz
imz = self%state(1) - self%state(3)
if (imz < 0) imz = imz + 2147483579
self%state(1) = self%state(2)
self%state(2) = self%state(3)
self%state(3) = imz
self%state(4) = 69069 * self%state(4) + 1013904243
imz = imz + self%state(4)
u = 0.5d0 + 0.23283064d-9 * imz
end function rng_uniform
end module rng
```

At the beginning of our `mc`

program we now must add a `use rng`

statement to include the `rng`

module. In addition to the array
`mu`

, for storing the results, we now create an array of random
number generators:

```
type(rng_t), dimension(nmc) :: rng
```

There is thus one variable of type `rng_t`

for each experiment. The
`rng_t`

type simply stores the state of the linear congruential random
number generator, an integer vector of length four. This can be seen
in the source of the `rng`

module above.

The state is initialized to the default seed. Calling ```
rng_seed(rng,
seed)
```

for an `rng_t`

variable `rng`

and an integer `seed`

will set
the first element of the state to `seed`

and the remaining elements
to the default values. Each value of `seed`

will thus result in an
independent stream of pseudo-random draws obtained via calls to
`rng_uniform(rng)`

.

Now, before each Monte Carlo trial we set the seed of the `j`

-th RNG
to a random prime number, say 932117, plus the index of the
experiment, `j`

:

```
call rng_seed(rng(j), 932117 + j)
```

We modify the `monte_carlo`

subroutine to accept the `rng`

as an
argument and use `rng_uniform`

instead of `random_number`

:

```
! Draws n Uniform(0,1) random numbers and returns the sample mean
function monte_carlo(rng, n) result(y)
type(rng_t), intent(inout) :: rng
integer, intent(in) :: n
integer :: i
real :: y, u
y = 0.d0
do i = 1, n
u = rng_uniform(rng) ! draw from Uniform(0,1)
y = y + u ! sum the draws
end do
y = y / dble(n) ! return the sample mean
end function monte_carlo
```

Compiling the program is only slightly more complicated due to the additional file. As expected now, the serial and parallel versions produce exactly the same results:

```
% gfortran -O3 -o mc3 rng.f90 mc3.f90
% ./mc3
Experiment 1
Experiment 2
...
Experiment 99
Experiment 100
mean 0.50012332
stdev 8.55281323E-05
```

```
% gfortran -O3 -fopenmp -o mc3 rng.f90 mc3.f90
% ./mc3
Experiment 51
Experiment 1
...
Experiment 49
Experiment 50
mean 0.50012332
stdev 8.55281323E-05
```