Writing Efficient Code

December 19, 2010

Writing efficient code for scientific applications in languages such as Fortran and C is often a matter of understanding technical details such as storage layout and loop order. I address several common efficiency issues below, however, one should always keep in mind the following quote from Donald E. Knuth (1974) regarding premature optimization:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Thus, knowing when to spend time on optimization is just as important as the optimization itself.

Access Array Elements in Storage Order

The order in which memory is accessed can have performance implications. For example, it is easier to access elements of a two-dimensional array in the order in which they are stored, since this involves incrementing a memory address. It adds complexity to access them out of order, since more calculations must be done to determine the memory location. The storage order depends on the particular compiler or language standard being used.

To make this more concrete, consider the following matrix:

A=[a 11 a 12 a 21 a 22]

Array storage in Fortran is column-major. That is to say, when stored in memory as a linear array, the matrix A will be arranged as (a 11,a 21,a 12,a 22). Thus, accessing the elements of A column-wise is most efficient. A useful way to remember this is that in Fortran, the first index varies fastest.

On the other hand, array storage in C is row-major, so the second index varies fastest. The matrix A above will be stored in memory as (a 11,a 12,a 21,a 22).

Accessing the memory in the order in which it is stored will be faster. In Fortran, this means writing loops like

do j = 1, n
    do i = 1, n
        ! operate on A(i,j) ...
    end do
end do

On the other hand, in C we want:

for (i = 0; i < n; i++ {
    for (j = 0; j < n; j++) {
        /* operate on A[i][j] ... */

Similarly indexed arrays can also be merged for more efficient access. Consider three vectors a, b, and c each of length n. Rather than using

real, dimension(n) :: a, b, c

it can be more efficient to use

real, dimension(3,n) :: abc

Pay careful attention to the order of the dimensions! In C we would want to switch the order:

double abc[n][3];

Avoid Conditionals in Loops

In the following example, due to Prof. Dr. Ulrich Rüde, sqrt must be called for each iteration and the conditional must be checked.

subroutine threshold(a, thresh, ic)
  real, dimension(n), intent(in) :: a
  real, intent(in) :: thresh
  integer, intent(out) :: ic
  real :: tt

  ic = 0
  tt = 0.d0
  do j = 1, n
     tt = tt + a(j) * a(j)
     if (sqrt(tt) >= thresh) then
        ic = j
     end if
  end if
end subroutine threshold

An alternative approach, which would allow for many optimizations (loop unrolling, CPU pipelining, less time spent evaluating the conditional) would involve adding tt in blocks (e.g., blocks of size 128) and checking the conditional after each block. When it the condition is met, the last block can be repeated to determine the value of ic.

Minimize Loop Overhead

There is overhead in setting up loops, for example, do loops in Fortran and for loops in C, that can have significant performance consequences for relatively short loops. Consider the following two loops. For n>m,

do i = 1, n
    do j = 1, m
        ! ...
    end do
end do

is less efficient than

do j = 1, m
    do i = 1, n
        ! ...
    end do
end do

The first variant requires setting up n+1 loops (once for the outer loop and n times for the inner loop) while the second variant only requires setting up m+1 loops. Paying attention to the loop ordering may help improve performance, but locality of reference concerns (i.e., column- or row–major access) are of first-order importance.

Avoid Subroutine Calls in Inner Loops

There is also overhead associated with function and subroutine calls that, when nested in the innermost loops, can become significant. Avoid such subroutine calls by inlining or restructuring code.