WIP patch for parallel pg_dump

From: Joachim Wieland <joe(at)mcknight(dot)de>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: WIP patch for parallel pg_dump
Date: 2010-11-14 23:52:55
Message-ID: AANLkTin27_TOVU5KF90Ou3qGnT+d76JPgaDbrDLZBaxV@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This is the second patch for parallel pg_dump, now the actual part that
parallelizes the whole thing. More precisely, it adds parallel
backup/restore
to pg_dump/pg_restore for the directory archive format and keeps the
parallel
restore part of the custom archive format. Combined with my archive format
directory patch, which also includes a prototype of the liblzf compression
you
can combine this compression with any of the just mentioned backup/restore
scenarios. This patch is on top of the previous directory patch.

You would add a regular parallel dump with

$ pg_dump -j 4 -Fd -f out.dir dbname

In previous discussions there was a request to add support for multiple
directories, which I have done as well, so that you can also run

$ pg_dump -j 4 -Fd -f dir1:dir2:dir3 dbname

to equally distribute the data among those three directories (we can still
discuss the syntax, I am not all that happy with the colon either...)

The dump would always start with the largest objects, by looking at the
relpages column of pg_class which should give a good estimate. The order of
the
objects to restore is determined by the dependencies among the objects
(which
is already used in the parallel restore of the custom archivetype).

The file test.sh includes some example commands that I have run here as a
kind
of regression test that should give you an impression of how to call it from
the
command line.

One thing that is currently missing is proper support for Windows, this is
the next
thing that I will be working on. Also this version still gives quite a bunch
of debug
information about what the processes are doing, so don't try to pipe the
pg_dump output anywhere (even when not run in parallel), it will probably
just
not work...

The missing part that would make parallel pg_dump work with no strings
attached
is snapshot synchronization. As long as there are no synchronized snapshots,
you would need to stop writing to your database before starting the parallel
pg_dump. However it turns out that most often when you are especially
concerned
about a fast dump, you have shut down your applications anyway (which is the
reason why you are so concerned about speed in the first place). These cases
are typically database migrations from one host/platform to another or
database
upgrades without pg_migrator.

Joachim

Attachment Content-Type Size
pg_dump-parallel.diff text/x-patch 132.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2010-11-14 23:59:03 Re: pg_stat_bgwriter broken?
Previous Message Greg Smith 2010-11-14 23:48:24 Spread checkpoint sync