hot backups: am I doing it wrong, or do we have a problem with pg_clog?

Lists: pgsql-hackers
From: Daniel Farina <daniel(at)heroku(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: hot backups: am I doing it wrong, or do we have a problem with pg_clog?
Date: 2011-04-21 11:15:48
Message-ID: BANLkTi=j-k3QOFKpjxUG5m0FtihANz3tOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

To start at the end of this story: "DETAIL: Could not read from file
"pg_clog/007D" at offset 65536: Success."

This is a message we received on a a standby that we were bringing
online as part of a test. The clog file was present, but apparently
too small for Postgres (or at least I tihnk this is what the message
meant), so one could stub in another clog file and then continue
recovery successfully (modulus the voodoo of stubbing in clog files in
general). I am unsure if this is due to an interesting race condition
in Postgres or a result of my somewhat-interesting hot-backup
protocol, which is slightly more involved than the norm. I will
describe what it does here:

1) Call pg start backup
2) crawl the entire postgres cluster directory structure, except
pg_xlog, taking notes of the size of every file present
3) begin writing TAR files, but *only up to the size noted during the
original crawling of the cluster directory,* so if the file grows
between the original snapshot and subsequently actually calling read()
on the file those extra bytes will not be added to the TAR.
3a) If a file is truncated partially, I add "\0" bytes to pad the
tarfile member up to the size sampled in step 2, as I am streaming the
tar file and cannot go back in the stream and adjust the tarfile
member size
4) call pg stop backup

The reason I go to this trouble is because I use many completely
disjoint tar files to do parallel compression, decompression,
uploading, and downloading of the base backup of the database, and I
want to be able to control the size of these files up-front. The
requirement of stubbing in \0 is because of a limitation of the tar
format when dealing with streaming archives and the requirement to
truncate the files to the size snapshotted in the step 2 is to enable
splitting up the files between volumes even in the presence of
possible concurrent growth while I'm performing the hot backup. (ex: a
handful of nearly-empty heap files can rapidly grow due to a
concurrent bulk load if I get unlucky, which I do not intend to allow
myself to be).

Any ideas? Or does it sound like I'm making some bookkeeping errors
and should review my code again? It does work most of the time. I
have not gotten a sense how often this reproduces just yet.

--
fdr


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hot backups: am I doing it wrong, or do we have a problem with pg_clog?
Date: 2011-04-21 14:05:43
Message-ID: BANLkTikKjTNwx+0uGHMcjDFVPMdKHxGgPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 21, 2011 at 6:15 AM, Daniel Farina <daniel(at)heroku(dot)com> wrote:
> To start at the end of this story: "DETAIL:  Could not read from file
> "pg_clog/007D" at offset 65536: Success."
>
> This is a message we received on a a standby that we were bringing
> online as part of a test.  The clog file was present, but apparently
> too small for Postgres (or at least I tihnk this is what the message
> meant), so one could stub in another clog file and then continue
> recovery successfully (modulus the voodoo of stubbing in clog files in
> general).  I am unsure if this is due to an interesting race condition
> in Postgres or a result of my somewhat-interesting hot-backup
> protocol, which is slightly more involved than the norm.  I will
> describe what it does here:
>
> 1) Call pg start backup
> 2) crawl the entire postgres cluster directory structure, except
> pg_xlog, taking notes of the size of every file present
> 3) begin writing TAR files, but *only up to the size noted during the
> original crawling of the cluster directory,* so if the file grows
> between the original snapshot and subsequently actually calling read()
> on the file those extra bytes will not be added to the TAR.
>  3a) If a file is truncated partially, I add "\0" bytes to pad the
> tarfile member up to the size sampled in step 2, as I am streaming the
> tar file and cannot go back in the stream and adjust the tarfile
> member size
> 4) call pg stop backup
>
> The reason I go to this trouble is because I use many completely
> disjoint tar files to do parallel compression, decompression,
> uploading, and downloading of the base backup of the database, and I
> want to be able to control the size of these files up-front.  The
> requirement of stubbing in \0 is because of a limitation of the tar
> format when dealing with streaming archives and the requirement to
> truncate the files to the size snapshotted in the step 2 is to enable
> splitting up the files between volumes even in the presence of
> possible concurrent growth while I'm performing the hot backup. (ex: a
> handful of nearly-empty heap files can rapidly grow due to a
> concurrent bulk load if I get unlucky, which I do not intend to allow
> myself to be).
>
> Any ideas?  Or does it sound like I'm making some bookkeeping errors
> and should review my code again?  It does work most of the time.  I
> have not gotten a sense how often this reproduces just yet.

Everyone here is going to assume the problem is in your (too?) fancy
tar/diff delta archiving approach because we can't see that code and
it just sounds suspicious. A busted clog file is of course very
noteworthy but to eliminate your stuff you should try reproducing
using a more standard method of grabbing the base backup.

Have you considered using rsync instead?

merlin


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Daniel Farina <daniel(at)heroku(dot)com>
Subject: Re: hot backups: am I doing it wrong, or do we have a problem with pg_clog?
Date: 2011-04-21 14:31:15
Message-ID: 201104211631.16064.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Thursday, April 21, 2011 01:15:48 PM Daniel Farina wrote:
> Any ideas? Or does it sound like I'm making some bookkeeping errors
> and should review my code again? It does work most of the time. I
> have not gotten a sense how often this reproduces just yet.
I would suggest taking both, your backup, and a simpler version for now. When
the error occurs again you can compare...

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hot backups: am I doing it wrong, or do we have a problem with pg_clog?
Date: 2011-04-21 15:19:18
Message-ID: BANLkTikHhTSF_3a4qG9_Fbe7eCm5AX2SPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 21, 2011 at 7:15 AM, Daniel Farina <daniel(at)heroku(dot)com> wrote:
> To start at the end of this story: "DETAIL:  Could not read from file
> "pg_clog/007D" at offset 65536: Success."
>
> This is a message we received on a a standby that we were bringing
> online as part of a test.  The clog file was present, but apparently
> too small for Postgres (or at least I tihnk this is what the message
> meant), so one could stub in another clog file and then continue
> recovery successfully (modulus the voodoo of stubbing in clog files in
> general).  I am unsure if this is due to an interesting race condition
> in Postgres or a result of my somewhat-interesting hot-backup
> protocol, which is slightly more involved than the norm.  I will
> describe what it does here:
>
> 1) Call pg start backup
> 2) crawl the entire postgres cluster directory structure, except
> pg_xlog, taking notes of the size of every file present
> 3) begin writing TAR files, but *only up to the size noted during the
> original crawling of the cluster directory,* so if the file grows
> between the original snapshot and subsequently actually calling read()
> on the file those extra bytes will not be added to the TAR.
>  3a) If a file is truncated partially, I add "\0" bytes to pad the
> tarfile member up to the size sampled in step 2, as I am streaming the
> tar file and cannot go back in the stream and adjust the tarfile
> member size
> 4) call pg stop backup

In theory I would expect any defects introduced by the, ahem,
exciting, procedure described in steps 3 and 3a to be corrected by
recovery automatically when you start the new cluster. It shouldn't
matter exactly when you read the file, and recovery for unrelated
blocks ought to proceed totally independently, and an all-zeros block
should be treated the same way as one that isn't allocated yet, so it
seems like it ought to work. But you may be stressing some paths in
the recovery code that don't get regular exercise, since the manner in
which you are taking the backup can produce backups that are different
from any backup that could be taken by the normal method, and those
paths might have bugs. It's also possible, as others have said, that
you've botched it. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Daniel Farina <daniel(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hot backups: am I doing it wrong, or do we have a problem with pg_clog?
Date: 2011-04-21 16:05:50
Message-ID: BANLkTin03DyWL7MPOK5Kq7m5YX6gnArSgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 21, 2011 at 8:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Apr 21, 2011 at 7:15 AM, Daniel Farina <daniel(at)heroku(dot)com> wrote:
>> To start at the end of this story: "DETAIL:  Could not read from file
>> "pg_clog/007D" at offset 65536: Success."
>>
>> This is a message we received on a a standby that we were bringing
>> online as part of a test.  The clog file was present, but apparently
>> too small for Postgres (or at least I tihnk this is what the message
>> meant), so one could stub in another clog file and then continue
>> recovery successfully (modulus the voodoo of stubbing in clog files in
>> general).  I am unsure if this is due to an interesting race condition
>> in Postgres or a result of my somewhat-interesting hot-backup
>> protocol, which is slightly more involved than the norm.  I will
>> describe what it does here:
>>
>> 1) Call pg start backup
>> 2) crawl the entire postgres cluster directory structure, except
>> pg_xlog, taking notes of the size of every file present
>> 3) begin writing TAR files, but *only up to the size noted during the
>> original crawling of the cluster directory,* so if the file grows
>> between the original snapshot and subsequently actually calling read()
>> on the file those extra bytes will not be added to the TAR.
>>  3a) If a file is truncated partially, I add "\0" bytes to pad the
>> tarfile member up to the size sampled in step 2, as I am streaming the
>> tar file and cannot go back in the stream and adjust the tarfile
>> member size
>> 4) call pg stop backup
>
> In theory I would expect any defects introduced by the, ahem,
> exciting, procedure described in steps 3 and 3a to be corrected by
> recovery automatically when you start the new cluster.

Neat. This is mostly what I was looking to get out of this thread, I
will start looking for places where I have botched things.

Although some of the frontend interface and some of the mechanism is
embarrassingly rough for several reasons, the other thread posters can
have access to the code if they wish: the code responsible for these
shenangians can be found at https://github.com/heroku/wal-e (and
https://github.com/fdr/wal-e), in the tar_partition.py file.
(https://github.com/heroku/WAL-E/blob/master/wal_e/tar_partition.py)

But I realize that's really too much detail for most people to be
interested in, which is why I didn't post it in the first place. I
think given your assessment I have enough to try to reproduce this
case synthetically (I think taking a very old pg_clog snapshot,
committing a few million xacts while not vacuuming, and then trying to
merge the old clog otherwise newer base backup may prove out the
mechanism I have in mind) or add some more robust logging so I can
catch my (or any, really) problem.

--
fdr