Re: pg_dump directory archive format / parallel pg_dump

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Joachim Wieland <joe(at)mcknight(dot)de>
Cc: Jaime Casanova <jaime(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_dump directory archive format / parallel pg_dump
Date: 2011-01-20 15:22:01
Message-ID: 4D385319.1060005@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 20.01.2011 15:46, Joachim Wieland wrote:
> On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> The header is there to identify a file, it contains the header that
>>> every other pgdump file contains, including the internal version
>>> number and the unique backup id.
>>>
>>> The tar format doesn't support compression so going from one to the
>>> other would only work for an uncompressed archive and special care
>>> must be taken to get the order of the tar file right.
>>
>> Hmm, tar format doesn't support compression, but looks like the file format
>> issue has been thought of already: there's still code there to add .gz
>> suffix for compressed files. How about adopting that convention in the
>> directory format too? That would make an uncompressed directory format
>> compatible with the tar format.
>
> So what you could do is dump in the tar format, untar and restore in
> the directory format. I see that this sounds nice but still I am not
> sure why someone would dump to the tar format in the first place.

I'm not sure either. Maybe you want to pipe the output of "pg_dump -F t"
via an ssh tunnel to another host, where you untar it, producing a
directory format dump. You can then edit the directory format dump, and
restore it back to the database without having to tar it again.

It gives you a lot of flexibility if the formats are compatible, which
is generally good.

> But you still cannot go back from the directory archive to the tar
> archive because the standard command line tar will not respect the
> order of the objects that pg_restore expects in a tar format, right?

Hmm, I didn't realize pg_restore requires the files to be in certain
order in the tar file. There's no mention of that in the docs either, we
should add that. It doesn't actually require that if you read from a
file, but from stdin it does.

You can put files in the archive in a certain order if you list them
explicitly in the tar command line, like "tar cf backup.tar toc.dat
...". It's hard to know the right order, though. In practice you would
need to do "tar tf backup.tar >files" before untarring, and use "files"
to tar them again in the rightorder.

>> That seems pretty attractive anyway, because you can then dump to a
>> directory, and manually gzip the data files later.
>
> The command line gzip will probably add its own header to the file
> that pg_restore would need to strip off...

Yeah, we should write the header too. That's not hard, e.g gzopen will
do that automatically, or you can pass a flag to deflateInit2.

>>> A tar archive has the advantage that you can postprocess the dump data
>>> with other tools but for this we could also add an option that gives
>>> you only the data part of a dump file (and uncompresses it at the same
>>> time if compressed). Once we have that however, the question is what
>>> anybody would then still want to use the tar format for...
>>
>> I don't know how popular it'll be in practice, but it seems very nice to me
>> if you can do things like parallel pg_dump in directory format first, and
>> then tar it up to a file for archival.
>
> Yes, but you cannot pg_restore the archive then if it was created with
> standard tar, right?

See above, you can unless you try to pipe it to pg_restore. In fact,
that's listed as an advantage of the tar format over other formats in
the pg_dump documentation.

(I'm working on this, no need to submit a new patch)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-01-20 15:23:14 Re: Moving test_fsync to /contrib?
Previous Message Robert Haas 2011-01-20 15:20:14 Re: pg_basebackup for streaming base backups