Re: directory archive format for pg_dump

Lists: pgsql-hackers
From: Joachim Wieland <joe(at)mcknight(dot)de>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: directory archive format for pg_dump
Date: 2010-11-14 23:48:07
Message-ID: AANLkTimUELTXwRSQDQNwxik_k1y3YcH1u-9NgHZqpi9e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This is the first of two patches for parallel pg_dump. In particular, this
patch adds a new pg_dump archive type which can save pg_dump data to a
directory, with each table/blob being a file so that several processes can
write to different files in parallel.

Since the compression is currently all down in the custom format backup
code,
the first thing I've done was refactoring the compression functions into a
separate file. While at it, I have added support for liblzf compression.

Writing the backup to a directory brings the disadvantage that your backup
now
consists of a bunch of files and you should make sure not to lose files or
mix
files of different backup sets. Therefore, I have added a -k switch that
checks if a directory backup set is complete. To do this, every backup has a
different id (basically a random md5sum) which is copied into every file
(both
TOC and data files). The TOC also knows about the size of each data file and
can check if it has been truncated for some reason.

Regarding lzf compression, the last discussion was here:

http://archives.postgresql.org/pgsql-hackers/2010-04/msg00442.php

I have included it to actually have multiple compression algorithms to build
a
framework for and to allow people to just compile and run it and see what
they
get. In my tests, when I run a backup with lzf compression, the postgres
backend is using 100% of one CPU and pg_dump is using 15% of another CPU.
Running with zlib however gives me 100% zlib and 70% postgres. Specifying
the
fastest zlib compression rate of 1 gives me 50% pg_dump and 100% postgres.
zlib
compression can be taken out of the code in like two minutes, it's all in
#ifdef's, so please see lzf just as an optional addition to the directory
patch
instead of as a main feature.

I am also submitting a WIP patch that shows the parallel version of pg_dump
which is a patch on top of this one. It is not completely ready yet but I am
releasing it as a WIP patch so you can see the overall picture and can play
with it already now. And hopefully I can get some feedback if I am going
into
the right direction.

There is a small shellscript included (test.sh) listing some of the
commands,
to give people a quick overview of how to call it.

Joachim

Attachment Content-Type Size
pg_dump-directory.diff text/x-patch 110.2 KB

From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Joachim Wieland <joe(at)mcknight(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: directory archive format for pg_dump
Date: 2010-11-19 19:44:06
Message-ID: m28w0pm73t.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Sharing some thoughts after a first round of reviewing, where I only had
time to read the patch itself.

Joachim Wieland <joe(at)mcknight(dot)de> writes:
> Since the compression is currently all down in the custom format
> backup code,
> the first thing I've done was refactoring the compression functions
> into a
> separate file. While at it, I have added support for liblzf
> compression.

I think I'd like to see a separate patch for the new compression
support. Sorry about that, I realize that's extra work…

And it could be about personal preferences, but the way you added the
liblzf support strikes me at odd, with all those #ifdefs everywhere. Is
it possible to have a specific file for each supported compression
format, then some routing code in src/bin/pg_dump/compress_io.c?

The routing code already exists but then the file is full of #ifdef
sections to define the right supporting function when I think having a
compress_io_zlib and a compress_io_lzf files would be better.

Then there's the bulk of the new dump format feature in the other part
of the patch, namely src/bin/pg_dump/pg_backup_directory.c. You have to
update the copyright in the file header there, at least :)

I'm yet to devote more time on this part of the patch but it seems like
it's rewriting the full support without using the existing bits. That's
something I have to check, didn't have time to read the existing other
archive formats code there.

I'm hesitant as far as marking the patch "Waiting on author" to get it
split. Joachim, what do you think?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Joachim Wieland <joe(at)mcknight(dot)de>
To: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: directory archive format for pg_dump
Date: 2010-11-19 22:40:55
Message-ID: AANLkTinZW2DNTAedTyD-QxDVOiCU5qoFLmjcXQd5OCGv@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Dimitri,

thanks for reviewing my patch!

On Fri, Nov 19, 2010 at 2:44 PM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr> wrote:
> I think I'd like to see a separate patch for the new compression
> support. Sorry about that, I realize that's extra work…

I guess it wouldn't be a very big deal but I also doubt that it makes
the review that much easier. Basically the compression refactor patch
would just touch pg_backup_custom.c (because this is the place where
the libz compression is currently burried into) and the two new
compress_io.(c|h) files. Everything else is pretty much the directory
stuff and is on top of these changes.

> And it could be about personal preferences, but the way you added the
> liblzf support strikes me at odd, with all those #ifdefs everywhere. Is
> it possible to have a specific file for each supported compression
> format, then some routing code in src/bin/pg_dump/compress_io.c?

Sure we could. But I wanted to wait with any fancy function pointer
stuff until we have decided if we want to include the liblzf support
at all. The #ifdefs might be a bit ugly but in case we do not include
liblzf support, it's the easiest way to take it out again. As written
in my introduction, this patch is not really about liblzf, liblzf is
just a proof of concept for factoring out the compression part and I
have included it, so that people can use it and see how much speed
improvement they get.

> The routing code already exists but then the file is full of #ifdef
> sections to define the right supporting function when I think having a
> compress_io_zlib and a compress_io_lzf files would be better.

Sure! I completely agree...

> Then there's the bulk of the new dump format feature in the other part
> of the patch, namely src/bin/pg_dump/pg_backup_directory.c. You have to
> update the copyright in the file header there, at least :)

Well, not sure if we can just change the copyright notice, because in
the end the structure was copied from one of the other files which all
have the copyright notice in them, so my work is based on those other
files...

> I'm hesitant as far as marking the patch "Waiting on author" to get it
> split. Joachim, what do you think?

I will see if I can split it.

Joachim


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc: Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: directory archive format for pg_dump
Date: 2010-11-19 22:53:03
Message-ID: 17836.1290207183@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr> writes:
> I think I'd like to see a separate patch for the new compression
> support. Sorry about that, I realize that's extra work

That part of the patch is likely to get rejected outright anyway,
so I *strongly* recommend splitting it out. We have generally resisted
adding random compression algorithms to pg_dump because of license and
patent considerations, and I see no reason to suppose this one is going
to pass muster.

regards, tom lane


From: Joachim Wieland <joe(at)mcknight(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: directory archive format for pg_dump
Date: 2010-11-19 23:53:22
Message-ID: AANLkTi=Q+UNRNGiKu+wxKvnMsBBNruNjT4gNiyCUuEKb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 19, 2010 at 11:53 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr> writes:
> > I think I'd like to see a separate patch for the new compression
> > support. Sorry about that, I realize that's extra work…
>
> That part of the patch is likely to get rejected outright anyway,
> so I *strongly* recommend splitting it out. We have generally resisted
> adding random compression algorithms to pg_dump because of license and
> patent considerations, and I see no reason to suppose this one is going
> to pass muster.
>

I was already anticipating that possiblitiy and my inital patch description
is along these lines.

However, liblzf is BSD licensed so on the license side we should be fine.
Regarding patents, your last comment was that you'd like to see if it's
really worth it and so I have included support for lzf for anybody to go
ahead and find that out.

Will send an updated split up patch this weekend (which would actually be
four patches already...).

Joachim


From: Joachim Wieland <joe(at)mcknight(dot)de>
To: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: directory archive format for pg_dump
Date: 2010-11-22 04:02:10
Message-ID: AANLkTi=EHHfXxUMKqnL-vpyKTczwZ+BaX_QG14xys5Xa@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 19, 2010 at 2:44 PM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr> wrote:
> I think I'd like to see a separate patch for the new compression
> support. Sorry about that, I realize that's extra work…

Attached are two patches building on top of each other. The first one
factors out the I/O routines (esp. libz) of pg_backup_custom.c into a
new file compress_io.c. This patch is without liblzf support now.

The second patch on top implements the new archive format of a directory.

Regarding the parallel part, I have been playing with Windows support
this weekend but I am still facing some issues (if anybody wants to
help who knows more about Windows programming than me, just let me
know). I will send the parallel patch and the liblzf part as two other
separate patches in the next few days.

Joachim

Attachment Content-Type Size
pg_dump-compression-refactor.diff text/x-patch 32.6 KB
pg_dump-directory.diff text/x-patch 66.1 KB