An rsync friendly gzip

Up: Martijn's Homepage  
Prev: Packet-Shaping-HOWTO   Next: utftpd - A tiny TFTP server  

By Martijn van Oosterhout <kleptog@svana.org>

Friday November 19, 1999

Preamble

One thing that has annoyed me for quite a while is that .deb files are not rsync-able. If I have an older version of the deb lying around, I can't use it to speed-up the down-load of the next one. Over a 33.6K modem this is not a good experience. Of course, this is part of the bigger problem that a compressed file simply does not rsync very well.

After I talked to Andrew Tridgell about this, he said he knew about this problem and a way of fixing it but he'd never got around to doing it. It involved making gzip restart it's internal compression table whenever a rolling checksum on the data said so.

This document discusses a very simple (and slow) implementation of this and the results. It consists of a little wrapper program named rgzip which runs gzip to do the real work.

The files

As a test file, I used a tarred up copy of the 'c' directory in my home dir. This was then compressed with both gzip and my gzip wrapper. The compressed files were created as follows:
kleptog/~>tar czvf /tmp/c.tar.gz c 
kleptog/~>tar c --use-compress-program c/rgzip -vf /tmp/c.tar.gz2 c
To use rsync, I had a slightly older version stored on the remote machine, groki. Below you can see the generated files and the fact that the decompression correctly reproduces the same original tar file.
[10:56:52] 1 groki kleptog ~>gzip -d <c.tar.gz | md5sum ; gzip -d <c.tar.gz2 | md5sum ; md5sum <c.tar
ab4cb30ea1e63caa194fde4752ee4af1
ab4cb30ea1e63caa194fde4752ee4af1
ab4cb30ea1e63caa194fde4752ee4af1
[10:57:41] 2 groki kleptog ~>ls -l c.tar*
-rw-rw-r--   1 kleptog  kleptog  25917440 Nov 18 10:51 c.tar
-rw-r-----   1 kleptog  kleptog  12350888 Nov 18 10:46 c.tar.gz
-rw-r-----   1 kleptog  kleptog  12767825 Nov 18 10:43 c.tar.gz2
Note that the rgzipped version is about 3% larger. For this version of rgzip this seems to be typical for large files. Future alterations may improve this.

On the local machine, the same tar file but it is newer and has a few updates in it. Here again we have the 3 files, all containing the same information.

kleptog//tmp>gzip -d <c.tar.gz | md5sum ; gzip -d <c.tar.gz2 | md5sum ; md5sum <c.tar
4323a24976b9a869054037d771b4f1d1
4323a24976b9a869054037d771b4f1d1
4323a24976b9a869054037d771b4f1d1
kleptog//tmp>ls -l c.tar*
-rw-rw----   1 kleptog  kleptog  25917440 Nov 18 10:59 c.tar
-rw-rw----   1 kleptog  kleptog  12350909 Nov 18 10:55 c.tar.gz
-rw-rw----   1 kleptog  kleptog  12768336 Nov 18 10:56 c.tar.gz2

The tests

Here are the results from just rsyncing the uncompressed tar file. As you can see, most of the data matched and the speedup was 233 times.

[11:18:16] 11 groki kleptog ~>./rsync -e ssh -b -v --stats kleptog:/tmp/c.tar .
c.tar

Number of files: 1
Number of files transferred: 1
Total file size: 25917440 bytes
Total transferred file size: 25917440 bytes
Literal data: 10304 bytes
Matched data: 25907136 bytes
File list size: 24
Total bytes written: 60404
Total bytes read: 50632

wrote 60404 bytes  read 50632 bytes  9655.30 bytes/sec
total size is 25917440  speedup is 233.41
Now we do the normal gzipped tar file. As you can see here, only 60K of the 10M file matched, probably the part before the first changed file. The speedup of using rsync was marginal.
[11:18:49] 12 groki kleptog ~>./rsync -e ssh -b -v --stats kleptog:/tmp/c.tar.gz .
c.tar.gz

Number of files: 1
Number of files transferred: 1
Total file size: 12350909 bytes
Total transferred file size: 12350909 bytes
Literal data: 10834317 bytes
Matched data: 1516592 bytes
File list size: 27
Total bytes written: 60188
Total bytes read: 10840768

wrote 60188 bytes  read 10840768 bytes  130550.37 bytes/sec
total size is 12350909  speedup is 1.13
Now we do the special rsync friendly gzip command. Here we see that the amount of matched data is much higher, over 95% of the file. Overall only about one-twentieth of the data needed to be transferred.
[11:22:07] 13 groki kleptog ~>./rsync -e ssh -b -v --stats kleptog:/tmp/c.tar.gz2 .
c.tar.gz2

Number of files: 1
Number of files transferred: 1
Total file size: 12768336 bytes
Total transferred file size: 12768336 bytes
Literal data: 556671 bytes
Matched data: 12211665 bytes
File list size: 28
Total bytes written: 60644
Total bytes read: 595491

wrote 60644 bytes  read 595491 bytes  57055.22 bytes/sec
total size is 12768336  speedup is 19.46
As you can see, even a simple version of this approach produces amazingly good results.

The algorithm

The algorithm used in this version is extremely simple and totally stupid. The design is bad enough that it should not be used in any production system. On the other hand, it gives us some good figures to work with.

The method as it is at the moment is very simple. It forks off a separate gzip process to process every block. The block boundaries are determined by the rolling hash. The hash is calculated as follows:

for each char
  hash = (hash << 1) ^ char
When the bottom x bits are zero, start a new block. This is undoubtedly a stupid algorithm but it gives OK results. I also forced a minimum block size to stop many tiny blocks being generated. The above tests were done with x equal to 15 and a minimum block size of 10K.

Future versions will probably be using zlib internally rather than forking a new copy of gzip each time. Other possibilities include modifying gzip.

Conclusion

This type of system obviously has many applications but it needs a lot more testing to see how it would work on a real world system. If it works well on .deb, it will make the mirror sites happier and reduce people's down-load time. The 3% bigger file size is a concern though.

What I'd like to see ideally is for apt-get to to support this type of down-load. It would be nice if it noticed if you wanted to down-load a package-1.1-1.deb and you had package-1.0-2.deb on a CD, it would use the second as a base to rsync the first.

As a final point, I would like to say that this sort of approach would probably not work with bzip2, as it relies on processing extremely large blocks to get good results.

Source code for rgzip


Up: Martijn's Homepage  
Prev: Packet-Shaping-HOWTO   Next: utftpd - A tiny TFTP server  
By Martijn van Oosterhout (kleptog (at) svana.org)
Copyright © 2000-2006 - Last modified 4/05/2001