Order in archive and compression ratio (1)
The following example was recently added to the Waf directory to experiment with lots of tasks created by make-like rules. The tasks from the example will perform the following operations:
- Create compressed archives from the same files taken in random order (shuffled)
- Measure the compressed file sizes, and add the values into a main data file
- Compute the file size distribution from the main data file
- Create a gnuplot script to represent the distribution
- Use the gnuplot script to create pictures
The files used in the example are the python files from Waf, and the archives are created in the tar format, which are compressed by gzip or bzip2. After creating a bit more than 300000 compressed files, the distribution of the gzip files will look like the following:
The file sizes range from 80282 bytes to 83605 bytes. A significant file reduction can then be obtained by carefully changing the input file order.
For the same number of compressed files, the distribution of the bzip2 files will be the following:
The files created range from 68532 to 68807 bytes, so the size variation is not significant in this example. Yet, the shape of the distribution is much more interesting, and its causes remain mysterious. For example, the same shape may be obtained by compressing the concatenation of the file contents, so the tar file format can be discarded. Also, the distribution curve keeps its shape by adding more data points.
Have a try yourself by downloading waf 1.6 (old branch) and the example.