Up: Martijn's Homepage | |
Prev: Packet-Shaping-HOWTO | Next: utftpd - A tiny TFTP server |
Friday November 19, 1999
After I talked to Andrew Tridgell about this, he said he knew about this problem and a way of fixing it but he'd never got around to doing it. It involved making gzip restart it's internal compression table whenever a rolling checksum on the data said so.
This document discusses a very simple (and slow) implementation of this and the results. It consists of a little wrapper program named rgzip which runs gzip to do the real work.
kleptog/~>tar czvf /tmp/c.tar.gz c kleptog/~>tar c --use-compress-program c/rgzip -vf /tmp/c.tar.gz2 cTo use rsync, I had a slightly older version stored on the remote machine, groki. Below you can see the generated files and the fact that the decompression correctly reproduces the same original tar file.
[10:56:52] 1 groki kleptog ~>gzip -d <c.tar.gz | md5sum ; gzip -d <c.tar.gz2 | md5sum ; md5sum <c.tar ab4cb30ea1e63caa194fde4752ee4af1 ab4cb30ea1e63caa194fde4752ee4af1 ab4cb30ea1e63caa194fde4752ee4af1 [10:57:41] 2 groki kleptog ~>ls -l c.tar* -rw-rw-r-- 1 kleptog kleptog 25917440 Nov 18 10:51 c.tar -rw-r----- 1 kleptog kleptog 12350888 Nov 18 10:46 c.tar.gz -rw-r----- 1 kleptog kleptog 12767825 Nov 18 10:43 c.tar.gz2Note that the rgzipped version is about 3% larger. For this version of rgzip this seems to be typical for large files. Future alterations may improve this.
On the local machine, the same tar file but it is newer and has a few updates in it. Here again we have the 3 files, all containing the same information.
kleptog//tmp>gzip -d <c.tar.gz | md5sum ; gzip -d <c.tar.gz2 | md5sum ; md5sum <c.tar 4323a24976b9a869054037d771b4f1d1 4323a24976b9a869054037d771b4f1d1 4323a24976b9a869054037d771b4f1d1 kleptog//tmp>ls -l c.tar* -rw-rw---- 1 kleptog kleptog 25917440 Nov 18 10:59 c.tar -rw-rw---- 1 kleptog kleptog 12350909 Nov 18 10:55 c.tar.gz -rw-rw---- 1 kleptog kleptog 12768336 Nov 18 10:56 c.tar.gz2
Here are the results from just rsyncing the uncompressed tar file. As you can see, most of the data matched and the speedup was 233 times.
[11:18:16] 11 groki kleptog ~>./rsync -e ssh -b -v --stats kleptog:/tmp/c.tar . c.tar Number of files: 1 Number of files transferred: 1 Total file size: 25917440 bytes Total transferred file size: 25917440 bytes Literal data: 10304 bytes Matched data: 25907136 bytes File list size: 24 Total bytes written: 60404 Total bytes read: 50632 wrote 60404 bytes read 50632 bytes 9655.30 bytes/sec total size is 25917440 speedup is 233.41Now we do the normal gzipped tar file. As you can see here, only 60K of the 10M file matched, probably the part before the first changed file. The speedup of using rsync was marginal.
[11:18:49] 12 groki kleptog ~>./rsync -e ssh -b -v --stats kleptog:/tmp/c.tar.gz . c.tar.gz Number of files: 1 Number of files transferred: 1 Total file size: 12350909 bytes Total transferred file size: 12350909 bytes Literal data: 10834317 bytes Matched data: 1516592 bytes File list size: 27 Total bytes written: 60188 Total bytes read: 10840768 wrote 60188 bytes read 10840768 bytes 130550.37 bytes/sec total size is 12350909 speedup is 1.13Now we do the special rsync friendly gzip command. Here we see that the amount of matched data is much higher, over 95% of the file. Overall only about one-twentieth of the data needed to be transferred.
[11:22:07] 13 groki kleptog ~>./rsync -e ssh -b -v --stats kleptog:/tmp/c.tar.gz2 . c.tar.gz2 Number of files: 1 Number of files transferred: 1 Total file size: 12768336 bytes Total transferred file size: 12768336 bytes Literal data: 556671 bytes Matched data: 12211665 bytes File list size: 28 Total bytes written: 60644 Total bytes read: 595491 wrote 60644 bytes read 595491 bytes 57055.22 bytes/sec total size is 12768336 speedup is 19.46As you can see, even a simple version of this approach produces amazingly good results.
The method as it is at the moment is very simple. It forks off a separate gzip process to process every block. The block boundaries are determined by the rolling hash. The hash is calculated as follows:
for each char hash = (hash << 1) ^ charWhen the bottom x bits are zero, start a new block. This is undoubtedly a stupid algorithm but it gives OK results. I also forced a minimum block size to stop many tiny blocks being generated. The above tests were done with x equal to 15 and a minimum block size of 10K.
Future versions will probably be using zlib internally rather than forking a new copy of gzip each time. Other possibilities include modifying gzip.
What I'd like to see ideally is for apt-get to to support this type of down-load. It would be nice if it noticed if you wanted to down-load a package-1.1-1.deb and you had package-1.0-2.deb on a CD, it would use the second as a base to rsync the first.
As a final point, I would like to say that this sort of approach would probably not work with bzip2, as it relies on processing extremely large blocks to get good results.
Up: Martijn's Homepage | |
Prev: Packet-Shaping-HOWTO | Next: utftpd - A tiny TFTP server |