Tips for Processing Large Datasets
February 26, 2008
Process files line by line whenever possible. Avoid loading entire files into memory. SAS is structured this way, while Stata loads everything into memory. Obviously with any general-purpose programming one can process files this way if needed.
If space is an issue, keep the raw data compressed and process the compressed files in place using pipes, removing the need to uncompress them on disk. In most languages, it is possible to read from a pipe (e.g., gzip output) just as one reads from a file.
Perl
In Perl, you can read from a pipe just as you would a file:
open (ZIP, 'unzip -p file.zip file.csv |') or die $!;
open (BZIP, 'bzip2 -dck file.csv.bz2 |') or die $!;
open (GZIP, 'gzip -dc file.csv.gz |') or die $!;
SAS
SAS also supports pipes. Suppose data.zip
and data.csv.bz2
both
contain a single CSV file:
filename fh1 pipe 'unzip -p data.zip *.csv';
filename fh2 pipe 'bzip2 -dc data.csv.bz2';
In the data step, you can use the pipe name as follows:
data fh1;
infile fh1 dsd firstobs=2 lrecl=8192;
[...]
You can also read multiple files from a single pipe in the shell. For example, with bash you can do the following:
filename fh pipe 'for i in *.zip; do
unzip -p $i *.csv;
done';