Version Control in Scientific Computing

December 21, 2010

Using a version control system (VCS) might very well be the single most important thing you can do right now to improve your scientific computing project. When used properly, a VCS allows you to organize code revisions in a logical, coherent way and can help ensure that your results will be easily reproducible.

Outside of a VCS, your source code is constantly in flux. Unless you keep very detailed notes or follow a strict filename-based version system, reproducing the exact output your program generated two weeks ago might be nearly impossible. With version control, they key to regenerating those results is simply keeping a record the revision number of the code base that produced them. This can be automated by writing the program in such a way that it prints this number automatically, say, in the log file. Reproducing any particular table then becomes a matter of “checking out” the revision of the code that generated it.

Fundamentals

Most version control systems, sometimes referred to as revision control systems (RCS) or source configuration management (SCM) systems, share several common features. First, there is a repository which contains the complete history of changes for a particular project. At various times throughout the development process, the developer checks in the code, creating a snapshot of the project. This snapshot is referred to as a revision and it can be checked out or restored at a later date. (Internally, for the purposes of storage efficiency, most systems simply store a delta or change set against the previous checked-in revision.)

Branches and tags are other common features you’ll find on almost every VCS. A branch is a single path of development and may contain, for example, the latest development version of a project. That development version may be based on another branch containing the stable version. Some projects also contain feature branches for testing new capabilities that are not yet ready to be included in the main branch. Branches can be merged into one another, say, when a feature is stable and ready to release. A tag is simply a short human-readable name for a particular revision. For example, a word or phrase like “Dec21” or “submitted” can be used as a convenient alias for revision number 274.

Centralized and Distributed Version Control Systems

This is a very exciting time in the version control world. For many years, things were fairly stable and there was a straightforward progression from RCS to CVS (Concurrent Versions System) and later to Subversion. These systems are centralized in the sense that there is a master server which holds the repository. Potentially many members of a development team may be granted access to the repository. Each developer works on his or her own checked out working copy of the project. When changes are ready to commit, the developer first checks out any new changes from the central repository and then commits the changes on top of the most recent version.

Recently, several new distributed systems were created which are currently competing fiercely for market share. One of the major benefits of distributed systems over centralized systems is that there is no central repository to administer. For a scientific project, which may have only a single developer, this is a major advantage which considerably reduces the cost of setting up and using a repository in terms of time and the amount of system administration knowledge required. Previously, setting up a CVS or Subversion repository and placing a project under version control required several administrative steps involving editing configuration files and issuing commands that are difficult to remember because they are used so infrequently. In a distributed version control system such as Git, the repository for each project is stored in the same directory as the code itself (usually in a hidden subdirectory). Initializing a project usually involves only a single short command. There need not be a “server” in the traditional sense either—your desktop computer is perfectly capable of hosting your own personal Git repository even without network connectivity.

Where to Start?

Unfortunately, the distinction between centralized and distributed systems raises the question of where to start. If you work on a Unix system, you probably already have RCS installed. If not, installing it from your favorite package manager is usually fairly straightforward. Although it is a bit dated, a single-user RCS system is dead simple to use and is a perfectly good place to start. It’s infinitely better using than no version control system at all.

The primary benefit of using RCS is that the learning curve is fairly shallow. A perfectly good RCS workflow is possible after learning only three or four commands. Furthermore, RCS is available on almost every Unix-based system under the sun. On the other hand, many of the limitations of RCS are not present in newer version control systems, resulting in systems that allow for very diverse workflows and easy sharing of files over the internet.

If you are feeling slightly more adventurous, consider a distributed version control system such as Git, written by Linus Torvalds, the creator of the Linux operating system. The important parts are written in C so it’s extremely fast. In contrast to systems like CVS or Subversion, it’s very simple to start a new Git repository for a project. One needs to simply issue the git init command in the project directory.

Git also has several other important advantages, even over other distributed version control systems:

Many of Git’s design choices encourage better programming practices by lowering the resistance to carrying them out. For example, by making it easier to create small patches and rearrange them in the most logical way, Git encourages you to keep your revision history clean. A series of very specific, single-topic patches is easier to comprehend than one large patch containing several changes.

Adding Revision Information to Program Output

In order to ensure that any particular set of results is easily reproducible, it is useful to design the program so that it automatically reports the revision information in the output. With Git, a short version of the commit hash (Git’s form of a revision number) can be obtained using the git rev-parse command:

$ git rev-parse --short HEAD
9c8bab3

This information can be incorporated into the build process and printed in program output and logs so that results can be linked back to a particular revision of the source. One way to do this is to have your Makefile generate a temporary include file, say revision.inc, and include it in your program. A simple Fortran Makefile using this approach might read something like this:

REV = $(shell git rev-parse --short HEAD)

hello: hello.f90
    echo "character(len=7), parameter :: revision = '$(REV)'" > revision.inc
    gfortran -o $@ $^

clean:
    -rm hello revision.inc

The first line obtains the revision information using a shell command and stores it in the variable $REV. The next section defines the hello target, it’s dependencies (here, only hello.f90), and the commands used to build it. Each time the program is built, the first command creates an include file called revision.inc which contains the relevant revision information. The second command builds the program using gfortran. The variable $@ is short for the target name, hello, which is used as the name of the executable. The variable $^ contains all dependencies, the source files. When fully expanded, this command would read

gfortran -o hello hello.f90

Now, the program only needs to include revision.inc, at which point it can use the variable revision as if it were defined in the program itself:

program hello
  implicit none
  include 'revision.inc'
  print *, 'Hello, world!'
  print *, 'Revision ', revision
end program hello

Each time hello is executed, the associated revision information is automatically printed, freeing you from remembering to note this information each time.

Thus, placing your project under version control not only ensures access to previously known good code in the event that an error is introduced, but when used properly, it can can also lead to more organized code, a transparent history of changes, and easily reproducible results.