Calgary Corpus - Benchmarks

Benchmarks

The Calgary corpus was a commonly used benchmark for data compression in the 1990s. Results were most commonly listed in bits per byte (bpb) for each file and then summarized by averaging. More recently, it has been common to just add the compressed sizes of all of the files. This is called a weighted average because it is equivalent to weighting the compression ratios by the original file sizes. The UCLC benchmark by Johan de Bock uses this method.

For some data compressors it is possible to compress the corpus smaller by combining the inputs into an uncompressed archive (such as a tar file) before compression because of mutual information between the text files. In other cases, the compression is worse because the compressor handles nonuniform statistics poorly. This method was used in a benchmark in the online book Data Compression Explained by Matt Mahoney .

The table below shows the compressed sizes of the 14 file Calgary corpus using both methods for some popular compression programs. Options, when used, select best compression. For a more complete list, see the above benchmarks.

Compressor	Options	As 14 separate files	As a tar file
Uncompressed		3,141,622	3,152,896
compress		1,272,772	1,319,521
Info-ZIP 2.32	-9	1,020,781	1,023,042
gzip 1.3.5	-9	1,017,624	1,022,810
bzip2 1.0.3	-9	828,347	860,097
7-zip 9.12b		848,687	824,573
ppmd Jr1	-m256 -o16	740,737	754,243
ppmonstr J		675,485	669,497

Read more about this topic: Calgary Corpus