PostgreSQL table synchroniser

Up: Martijn's Homepage  
Prev: Puzzle solving by computer   Next: Australia Post Barcode Perl Module  

The purpose is to copy one table in one database to another database efficiently. In this sense it is similar to the popular rsync program. It uses the data already there to reduce the amount of data that needs to be sent.

However, more than plain copying, this program allows you to control exactly which data is copied out. For example, you can limit both which rows are copied out as well as which columns. For data that needs to be partially obscured you can apply an expression to modify the data copied.

There are several potential uses for this. They include:

USAGE

pgrsync -s <dbspec> -d <dbspec> -t tablelist [-o option] [-e ssh] [-p prog]

    -s <dbspec>        Database to copy from
    -d <dbspec>        Database to copy to
    -t tablelist       Comma seperated list of tables, may be specified
                       multiple times.
    -o <option>        Option to control what is copied

    -e ssh             Alternative to ssh to use. Mayb behave the same
    -p prog            pgrsync executable on remote machine

  <dbspec> := <login>@<host>:<user>@<dbname>/<port>

    The colon and <dbname> are the only required parts. The usernames are
    optional, in which case the @ symbols must be omitted. The port number
    defaults to UNIX domain.

  <option> is one of:

      map:table=table2             Map tablename for remote machine
      map:table.field=field2       Map fieldname in table for remote machine
      expr:table.field=expr        Replace field with expression
      cond:table=<expr>            Condition for WHERE clause to restrict copying
      fields:table=<tablelist>     List of fields to copy. ! excludes. *=all. Last match
The database specification is very flexible. Examples include:

INSTALLATION

At the moment the installation is a bit awkward. The single external module needs to be found by the main program. The easiest way to acheive this is to place PGDb.pm somewhere where the ssh login will find it. /usr/local/lib/site_perl is usually a good choice. The actual main script needs to be in your path, unless you use the -p switch.

LIMITATIONS

The optimisation in this program relies somewhat on being able to determine what order rows are supposed to be in so that you can easily determine which are missing. The simplest distinction is between numeric and string ordering. This must be detected to work properly.

A more serious problem is that of encodings, as that determines the default order of string output, in particular whether it is case-sensetive or not. Because Perl uses case sensetive comprisons it may sometimes disagree with the database and thus produce a non-optimal change set. However, we can still see that the databases have the rows in the same order. So in most cases this will not be a problem.

Even worse, if the source and destination use different encodings, the order will be different and the program will *never* see them as the same and send the same data each time. It is not clear what can be done about this. Hopefully tables where the primary key is a string whose case varies are not common. At least the end result still has the right data arriving.

Also in the interests of efficiency, this program requires the table to have a single column primary key. This is to avoid a possibly quite expensive sort on a large table. While using the index may be less efficient time-wise, it is much more efficient than an explicit sort. Note this restriction may be lifted in the future.

Similarly, this program could copy the structure of the table also, automatically creating and dropping columns as required. Creating and dropping triggers, functions and indexes would be an obviously useful feature. This program does not attempt to check datatypes. There was an attempt but assessing compatability proved too hard. Instead the transfer is cleanly aborted if any of the updates fail.

Finally, this program only copies between PostgreSQL databases. In principle it could work between any two databases. Only a few basic features are required: (1) support for cursors (DECLARE/FETCH) (sorry mysql) and (2) to be able to extract information about the structure of the database (tables/fields/indexes). If someone wants to extend the PGDb module to another database it would be appreciated.

EFFICIENCY

Since one the points of this program is to copy efficiently obviously this need to be explained. Currently it involves the destination to send about 8-12 bytes per row in the table. The source will then send the data to the destination. No changed data means the source sends nothing at all. This should be addressed soon as it could be improved for very large tables (>1,000,000 rows).

REVISION HISTORY

Version 0.01 - Initial release (01 Oct 2002)

DOWNLOAD

You can download it here (latest version 0.01).

BUGS / LIMITATIONS

Currently the following features are not supported. See above for more details.

AUTHOR

Martijn van Oosterhout October 2002
http://svana.org/kleptog/

LICENCE

This program is distributed under the GPL.


Up: Martijn's Homepage  
Prev: Puzzle solving by computer   Next: Australia Post Barcode Perl Module  
By Martijn van Oosterhout (kleptog (at) svana.org)
Copyright © 2000-2006 - Last modified 3/10/2002