25.2. File-based Log Shipping

Continuous archiving can be used to create a high availability (HA) cluster configuration with one or more standby servers ready to take over operations if the primary server fails. This capability is widely referred to as warm standby or log shipping.

The primary and standby server work together to provide this capability, though the servers are only loosely coupled. The primary server operates in continuous archiving mode, while each standby server operates in continuous recovery mode, reading the WAL files from the primary. No changes to the database tables are required to enable this capability, so it offers low administration overhead compared to some other replication approaches. This configuration also has relatively low performance impact on the primary server.

Directly moving WAL records from one database server to another is typically described as log shipping. PostgreSQL implements file-based log shipping, which means that WAL records are transferred one file (WAL segment) at a time. WAL files (16MB) can be shipped easily and cheaply over any distance, whether it be to an adjacent system, another system at the same site, or another system on the far side of the globe. The bandwidth required for this technique varies according to the transaction rate of the primary server. Record-based log shipping is also possible with custom-developed procedures, as discussed in Section 25.2.3.

It should be noted that the log shipping is asynchronous, i.e., the WAL records are shipped after transaction commit. As a result there is a window for data loss should the primary server suffer a catastrophic failure: transactions not yet shipped will be lost. The length of the window of data loss can be limited by use of the archive_timeout parameter, which can be set as low as a few seconds if required. However such a low setting will substantially increase the bandwidth required for file shipping. If you need a window of less than a minute or so, it's probably better to consider record-based log shipping.

The standby server is not available for access, since it is continually performing recovery processing. Recovery performance is sufficiently good that the standby will typically be only moments away from full availability once it has been activated. As a result, we refer to this capability as a warm standby configuration that offers high availability. Restoring a server from an archived base backup and rollforward will take considerably longer, so that technique only offers a solution for disaster recovery, not high availability.

25.2.1. Planning

It is usually wise to create the primary and standby servers so that they are as similar as possible, at least from the perspective of the database server. In particular, the path names associated with tablespaces will be passed across unmodified, so both primary and standby servers must have the same mount paths for tablespaces if that feature is used. Keep in mind that if CREATE TABLESPACE is executed on the primary, any new mount point needed for it must be created on the primary and all standby servers before the command is executed. Hardware need not be exactly the same, but experience shows that maintaining two identical systems is easier than maintaining two dissimilar ones over the lifetime of the application and system. In any case the hardware architecture must be the same — shipping from, say, a 32-bit to a 64-bit system will not work.

In general, log shipping between servers running different major PostgreSQL release levels is not possible. It is the policy of the PostgreSQL Global Development Group not to make changes to disk formats during minor release upgrades, so it is likely that running different minor release levels on primary and standby servers will work successfully. However, no formal support for that is offered and you are advised to keep primary and standby servers at the same release level as much as possible. When updating to a new minor release, the safest policy is to update the standby servers first — a new minor release is more likely to be able to read WAL files from a previous minor release than vice versa.

There is no special mode required to enable a standby server. The operations that occur on both primary and standby servers are normal continuous archiving and recovery tasks. The only point of contact between the two database servers is the archive of WAL files that both share: primary writing to the archive, standby reading from the archive. Care must be taken to ensure that WAL archives from separate primary servers do not become mixed together or confused. The archive need not be large if it is only required for standby operation.

The magic that makes the two loosely coupled servers work together is simply a restore_command used on the standby that, when asked for the next WAL file, waits for it to become available from the primary. The restore_command is specified in the recovery.conf file on the standby server. Normal recovery processing would request a file from the WAL archive, reporting failure if the file was unavailable. For standby processing it is normal for the next WAL file to be unavailable, so we must be patient and wait for it to appear. For files ending in .backup or .history there is no need to wait, and a non-zero return code must be returned. A waiting restore_command can be written as a custom script that loops after polling for the existence of the next WAL file. There must also be some way to trigger failover, which should interrupt the restore_command, break the loop and return a file-not-found error to the standby server. This ends recovery and the standby will then come up as a normal server.

Pseudocode for a suitable restore_command is:

triggered = false;
while (!NextWALFileReady() && !triggered)
{
    sleep(100000L);         /* wait for ~0.1 sec */
    if (CheckForExternalTrigger())
        triggered = true;
}
if (!triggered)
        CopyWALFileForRecovery();

A working example of a waiting restore_command is provided as a contrib module named pg_standby. It should be used as a reference on how to correctly implement the logic described above. It can also be extended as needed to support specific configurations and environments.

PostgreSQL does not provide the system software required to identify a failure on the primary and notify the standby database server. Many such tools exist and are well integrated with the operating system facilities required for successful failover, such as IP address migration.

The method for triggering failover is an important part of planning and design. One potential option is the restore_command command. It is executed once for each WAL file, but the process running the restore_command is created and dies for each file, so there is no daemon or server process, and we cannot use signals or a signal handler. Therefore, the restore_command is not suitable to trigger failover. It is possible to use a simple timeout facility, especially if used in conjunction with a known archive_timeout setting on the primary. However, this is somewhat error prone since a network problem or busy primary server might be sufficient to initiate failover. A notification mechanism such as the explicit creation of a trigger file is ideal, if this can be arranged.

The size of the WAL archive can be minimized by using the %r option of the restore_command. This option specifies the last archive file name that needs to be kept to allow the recovery to restart correctly. This can be used to truncate the archive once files are no longer required, assuming the archive is writable from the standby server.

25.2.2. Implementation

The short procedure for configuring a standby server is as follows. For full details of each step, refer to previous sections as noted.

  1. Set up primary and standby systems as nearly identical as possible, including two identical copies of PostgreSQL at the same release level.

  2. Set up continuous archiving from the primary to a WAL archive directory on the standby server. Ensure that archive_mode, archive_command and archive_timeout are set appropriately on the primary (see Section 24.3.1).

  3. Make a base backup of the primary server (see Section 24.3.2), and load this data onto the standby.

  4. Begin recovery on the standby server from the local WAL archive, using a recovery.conf that specifies a restore_command that waits as described previously (see Section 24.3.3).

Recovery treats the WAL archive as read-only, so once a WAL file has been copied to the standby system it can be copied to tape at the same time as it is being read by the standby database server. Thus, running a standby server for high availability can be performed at the same time as files are stored for longer term disaster recovery purposes.

For testing purposes, it is possible to run both primary and standby servers on the same system. This does not provide any worthwhile improvement in server robustness, nor would it be described as HA.

25.2.3. Record-based Log Shipping

PostgreSQL directly supports file-based log shipping as described above. It is also possible to implement record-based log shipping, though this requires custom development.

An external program can call the pg_xlogfile_name_offset() function (see Section 9.24) to find out the file name and the exact byte offset within it of the current end of WAL. It can then access the WAL file directly and copy the data from the last known end of WAL through the current end over to the standby servers. With this approach, the window for data loss is the polling cycle time of the copying program, which can be very small, and there is no wasted bandwidth from forcing partially-used segment files to be archived. Note that the standby servers' restore_command scripts can only deal with whole WAL files, so the incrementally copied data is not ordinarily made available to the standby servers. It is of use only when the primary dies — then the last partial WAL file is fed to the standby before allowing it to come up. The correct implementation of this process requires cooperation of the restore_command script with the data copying program.

Starting with PostgreSQL version 8.5, you can use streaming replication (see Section 25.3) to achieve the same with less effort.