Parallel Computing in Fortran with OpenMP

January 27, 2009

Both the GNU and Intel Fortran compilers have native support for OpenMP. All you need to do to create parallel programs is to add some OpenMP directives, specially formatted comments, to your code and compile the program with a special flag to enable OpenMP. For gfortran, compile the program with -fopenmp and for ifort, the flag is -openmp.

The trick to getting the most out of parallel processing is to run sections of code in parallel such that the additional overhead of starting new threads is outweighed by time required to perform the task. That is, it’s ideal to split the code into distinct tasks that are each sufficiently hard to compute on their own–ones that might take more than a couple of seconds to compute–rather than simple tasks that execute very fast.

In many cases you will have sections of code that are embarrassingly parallel. That is, a group of task need to be carried out multiple times and each task is independent of the others. Monte Carlo experiments are good examples of this. Each trial is independent of the rest. At the end, all of the trials simply need to be collected in any order so that summary statistics such as the mean and standard deviation can be calculated.

A Parallel Monte Carlo Experiment

Here is a short example program, mc1, to get started. Suppose we have a function called monte_carlo which we need to run several times with different data. Each run returns some result and we would like to compute the mean of these results. In the example below, monte_carlo(n) draws n Uniform(0,1) random numbers and returns the sample mean.

program mc1
  implicit none

  integer, parameter :: nmc = 100       ! number of trials
  integer, parameter :: n = 100000      ! sample size
  real, dimension(nmc) :: mu            ! result of each trial
  real :: mean, stdev                   ! mean and standard deviation
  integer :: j

  !$OMP PARALLEL DO
  do j = 1, nmc
     print *, 'Experiment', j
     mu(j) = monte_carlo(n)
  end do
  !$OMP END PARALLEL DO

  mean = sum(mu) / dble(nmc)
  stdev = sqrt( sum( (mu - mean)**2 ) ) / dble(nmc)

  print *, 'mean', mean
  print *, 'stdev', stdev

contains

  ! Draws n Uniform(0,1) random numbers and returns the sample mean
  function monte_carlo(n) result(y)
    integer, intent(in) :: n
    integer :: i
    real :: y, u

    y = 0.d0
    do i = 1, n
       call random_number(u)            ! draw from Uniform(0,1)
       y = y + u                        ! sum the draws
    end do
    y = y / dble(n)                     ! return the sample mean
  end function monte_carlo

end program mc1

Depending on your computer, because of the simplicity of this experiment you will probably need to set n very high in order to see the difference in execution speed. In fact, for small values of n (even 100000 in my case) the parallel code is actually slower. This illustrates the earlier point about the overhead of spawning new threads. The chunks of code that are parallelized should be sufficiently hard to keep the computational throughput high.

First, we can compile the program as usual and run it, executing the Monte Carlo experiments in order:

% gfortran -O3 -o mc1 mc1.f90
% ./mc1
Experiment           1
Experiment           2
...
Experiment          99
Experiment         100
mean  0.49996144
stdev  8.68058487E-05

Now, if we compile the program with OpenMP and run it, we see that the experiments seem to be allocated evenly across (in my case) two processors and are executed out of order:

% gfortran -O3 -fopenmp -o mc1 mc1.f90
% ./mc1
Experiment           1
Experiment          51
...
Experiment          49
Experiment          50
mean  0.49996150    
stdev  9.25014756E-05

If this doesn’t work immediately on your system, you may need to set the OMP_NUM_THREADS environment variable depending on the number of processors in your system. For example, in bash, zsh, and other compatible shells for a system with 16 processors:

% export OMP_NUM_THREADS=16

The Perils of Parallel Pseudo-Random Number Generation

Note that the above answers are different! Even this simple example shows that parallel programs can behave differently than their serial counterparts in unforeseen ways. Why are the results different? In this case we are drawing pseudo-random numbers from a stream that depends on a seed. If the samples are just taken out of order the mean and standard deviation should be identical. However, if the samples themselves are different, because multiple processors are drawing random numbers from the same stream simultaneously, then the results will be different.

Even setting the seed to a predefined value before each experiment will not fix the problem. You can verify this yourself with the following set_seed subroutine:

! Determines the dimension of the seed vector and sets the seed of
! the random number generator in a deterministic way.
subroutine set_seed(index)
  integer, intent(in) :: index
  integer :: i, n
  integer, dimension(:), allocatable :: seed

  call random_seed(SIZE = n)
  allocate(seed(n))
  seed = index + 37 * (/ (i - 1, i = 1, n) /)
  call random_seed(PUT = seed)
end subroutine set_seed

Even after calling set_seed(j) before the call to monte_carlo, the means and standard deviations will differ. The problem is that there is actually only one random number generator (RNG) and if one of the threads sets the seed, it affects the other active threads. The states of are not independent.

A Simple Parallel-Ready Random Number Generator

To fix this problem, each thread needs to have its own independent RNG. There is no simple way to accomplish this with Fortran’s random_number intrinsic. Instead, we need a more sophisticated RNG that can have multiple “instances.” Fortran’s user-defined types provide a nice solution. If we create a type that will store the state of each stream of random numbers, we can then build an array of RNGs–one for each thread or experiment–each started at a different seed.

For the purposes of illustration, a simple random number generator based on the mzran algorithm of Marsaglia and Zaman (1994) is provided in module rng below. The module provides a type, rng_t, and two procedures, rng_seed and rng_uniform for setting the seed (with a single integer) and drawing a Uniform(0,1) random number.

module rng
  implicit none

  private
  public :: rng_t, rng_seed, rng_uniform

  ! Dimension of the state
  integer, parameter :: ns = 4

  ! Default seed vector
  integer, parameter, dimension(ns) :: default_seed &
       = (/ 521288629, 362436069, 16163801, 1131199299 /)

  ! A data type for storing the state of the RNG
  type :: rng_t
     integer, dimension(ns) :: state = default_seed
  end type rng_t

contains

  ! Seeds the RNG using a single integer and a default seed vector.
  subroutine rng_seed(self, seed)
    type(rng_t), intent(inout) :: self
    integer, intent(in) :: seed
    self%state(1) = seed
    self%state(2:ns) = default_seed(2:ns)
  end subroutine rng_seed

  ! Draws a uniform real number on [0,1].
  function rng_uniform(self) result(u)
    type(rng_t), intent(inout) :: self
    real :: u
    integer :: imz

    imz = self%state(1) - self%state(3)

    if (imz < 0) imz = imz + 2147483579

    self%state(1) = self%state(2)
    self%state(2) = self%state(3)
    self%state(3) = imz
    self%state(4) = 69069 * self%state(4) + 1013904243
    imz = imz + self%state(4)
    u = 0.5d0 + 0.23283064d-9 * imz
  end function rng_uniform

end module rng

At the beginning of our mc program we now must add a use rng statement to include the rng module. In addition to the array mu, for storing the results, we now create an array of random number generators:

type(rng_t), dimension(nmc) :: rng

There is thus one variable of type rng_t for each experiment. The rng_t type simply stores the state of the linear congruential random number generator, an integer vector of length four. This can be seen in the source of the rng module above.

The state is initialized to the default seed. Calling rng_seed(rng, seed) for an rng_t variable rng and an integer seed will set the first element of the state to seed and the remaining elements to the default values. Each value of seed will thus result in an independent stream of pseudo-random draws obtained via calls to rng_uniform(rng).

Now, before each Monte Carlo trial we set the seed of the j-th RNG to a random prime number, say 932117, plus the index of the experiment, j:

call rng_seed(rng(j), 932117 + j)

We modify the monte_carlo subroutine to accept the rng as an argument and use rng_uniform instead of random_number:

! Draws n Uniform(0,1) random numbers and returns the sample mean
function monte_carlo(rng, n) result(y)
  type(rng_t), intent(inout) :: rng
  integer, intent(in) :: n
  integer :: i
  real :: y, u

  y = 0.d0
  do i = 1, n
     u = rng_uniform(rng)           ! draw from Uniform(0,1)
     y = y + u                      ! sum the draws
  end do
  y = y / dble(n)                   ! return the sample mean
end function monte_carlo

Compiling the program is only slightly more complicated due to the additional file. As expected now, the serial and parallel versions produce exactly the same results:

% gfortran -O3 -o mc3 rng.f90 mc3.f90
% ./mc3
Experiment           1
Experiment           2
...
Experiment          99
Experiment         100
mean  0.50012332
stdev  8.55281323E-05
% gfortran -O3 -fopenmp -o mc3 rng.f90 mc3.f90
% ./mc3
Experiment          51
Experiment           1
...
Experiment          49
Experiment          50
mean  0.50012332
stdev  8.55281323E-05

References