Tips for Processing Large Datasets

February 26, 2008

Process files line by line whenever possible. Avoid loading entire files into memory. SAS is structured this way, while Stata loads everything into memory. Obviously with any general-purpose programming one can process files this way if needed.

If space is an issue, keep the raw data compressed and process the compressed files in place using pipes, removing the need to uncompress them on disk. In most languages, it is possible to read from a pipe (e.g., gzip output) just as one reads from a file.

Perl

In Perl, you can read from a pipe just as you would a file:

open (ZIP, 'unzip -p file.zip file.csv |') or die $!;
open (BZIP, 'bzip2 -dck file.csv.bz2 |') or die $!;
open (GZIP, 'gzip -dc file.csv.gz |') or die $!;

SAS

SAS also supports pipes. Suppose data.zip and data.csv.bz2 both contain a single CSV file:

filename fh1 pipe 'unzip -p data.zip *.csv';
filename fh2 pipe 'bzip2 -dc data.csv.bz2';

In the data step, you can use the pipe name as follows:

data fh1;
    infile fh1 dsd firstobs=2 lrecl=8192;
    [...]

You can also read multiple files from a single pipe in the shell. For example, with bash you can do the following:

filename fh pipe 'for i in *.zip; do
    unzip -p $i *.csv;
done';