Version Control in Scientific Computing
December 21, 2010
Using a version control system (VCS) might very well be the single most important thing you can do right now to improve your scientific computing project. When used properly, a VCS allows you to organize code revisions in a logical, coherent way and can help ensure that your results will be easily reproducible.
Outside of a VCS, your source code is constantly in flux. Unless you keep very detailed notes or follow a strict filename-based version system, reproducing the exact output your program generated two weeks ago might be nearly impossible. With version control, they key to regenerating those results is simply keeping a record the revision number of the code base that produced them. This can be automated by writing the program in such a way that it prints this number automatically, say, in the log file. Reproducing any particular table then becomes a matter of “checking out” the revision of the code that generated it.
Fundamentals
Most version control systems, sometimes referred to as revision control systems (RCS) or source configuration management (SCM) systems, share several common features. First, there is a repository which contains the complete history of changes for a particular project. At various times throughout the development process, the developer checks in the code, creating a snapshot of the project. This snapshot is referred to as a revision and it can be checked out or restored at a later date. (Internally, for the purposes of storage efficiency, most systems simply store a delta or change set against the previous checked-in revision.)
Branches and tags are other common features you’ll find on almost every VCS. A branch is a single path of development and may contain, for example, the latest development version of a project. That development version may be based on another branch containing the stable version. Some projects also contain feature branches for testing new capabilities that are not yet ready to be included in the main branch. Branches can be merged into one another, say, when a feature is stable and ready to release. A tag is simply a short human-readable name for a particular revision. For example, a word or phrase like “Dec21” or “submitted” can be used as a convenient alias for revision number 274.
Centralized and Distributed Version Control Systems
This is a very exciting time in the version control world. For many years, things were fairly stable and there was a straightforward progression from RCS to CVS (Concurrent Versions System) and later to Subversion. These systems are centralized in the sense that there is a master server which holds the repository. Potentially many members of a development team may be granted access to the repository. Each developer works on his or her own checked out working copy of the project. When changes are ready to commit, the developer first checks out any new changes from the central repository and then commits the changes on top of the most recent version.
Recently, several new distributed systems were created which are currently competing fiercely for market share. One of the major benefits of distributed systems over centralized systems is that there is no central repository to administer. For a scientific project, which may have only a single developer, this is a major advantage which considerably reduces the cost of setting up and using a repository in terms of time and the amount of system administration knowledge required. Previously, setting up a CVS or Subversion repository and placing a project under version control required several administrative steps involving editing configuration files and issuing commands that are difficult to remember because they are used so infrequently. In a distributed version control system such as Git, the repository for each project is stored in the same directory as the code itself (usually in a hidden subdirectory). Initializing a project usually involves only a single short command. There need not be a “server” in the traditional sense either—your desktop computer is perfectly capable of hosting your own personal Git repository even without network connectivity.
Where to Start?
Unfortunately, the distinction between centralized and distributed systems raises the question of where to start. If you work on a Unix system, you probably already have RCS installed. If not, installing it from your favorite package manager is usually fairly straightforward. Although it is a bit dated, a single-user RCS system is dead simple to use and is a perfectly good place to start. It’s infinitely better using than no version control system at all.
The primary benefit of using RCS is that the learning curve is fairly shallow. A perfectly good RCS workflow is possible after learning only three or four commands. Furthermore, RCS is available on almost every Unix-based system under the sun. On the other hand, many of the limitations of RCS are not present in newer version control systems, resulting in systems that allow for very diverse workflows and easy sharing of files over the internet.
If you are feeling slightly more adventurous, consider a distributed
version control system such as Git, written by Linus Torvalds, the
creator of the Linux operating system.
The important parts are written in C so it’s extremely fast.
In contrast to systems like CVS or Subversion, it’s very simple to
start a new Git repository for a project.
One needs to simply issue the git init
command in the project
directory.
Git also has several other important advantages, even over other distributed version control systems:
- It is extremely fast so using it doesn’t feel like an interruption of the work at hand.
- It has a staging area for preparing commits leading to logical, clean, and organized repository histories.
- Easy and space-efficient branching allows quick, safe experimentation with new ideas.
Many of Git’s design choices encourage better programming practices by lowering the resistance to carrying them out. For example, by making it easier to create small patches and rearrange them in the most logical way, Git encourages you to keep your revision history clean. A series of very specific, single-topic patches is easier to comprehend than one large patch containing several changes.
Adding Revision Information to Program Output
In order to ensure that any particular set of results is easily
reproducible, it is useful to design the program so that it
automatically reports the revision information in the output.
With Git, a short version of the commit hash (Git’s form of a revision
number) can be obtained using the git rev-parse
command:
$ git rev-parse --short HEAD
9c8bab3
This information can be incorporated into the build process and
printed in program output and logs so that results can be linked back
to a particular revision of the source.
One way to do this is to have your Makefile generate a temporary
include file, say revision.inc
, and include
it in your program.
A simple Fortran Makefile using this approach might read something
like this:
REV = $(shell git rev-parse --short HEAD)
hello: hello.f90
echo "character(len=7), parameter :: revision = '$(REV)'" > revision.inc
gfortran -o $@ $^
clean:
-rm hello revision.inc
The first line obtains the revision information using a shell command
and stores it in the variable $REV
.
The next section defines the hello
target, it’s dependencies (here,
only hello.f90
), and the commands used to build it.
Each time the program is built, the first command creates an include
file called revision.inc
which contains the relevant revision
information.
The second command builds the program using gfortran
.
The variable $@
is short for the target name, hello
, which is used
as the name of the executable.
The variable $^
contains all dependencies, the source files.
When fully expanded, this command would read
gfortran -o hello hello.f90
Now, the program only needs to include revision.inc
, at which point
it can use the variable revision
as if it were defined in the
program itself:
program hello
implicit none
include 'revision.inc'
print *, 'Hello, world!'
print *, 'Revision ', revision
end program hello
Each time hello
is executed, the associated revision information is
automatically printed, freeing you from remembering to note this
information each time.
Thus, placing your project under version control not only ensures access to previously known good code in the event that an error is introduced, but when used properly, it can can also lead to more organized code, a transparent history of changes, and easily reproducible results.