Re: [RFC] What should we do for reliable WAL archiving?

Lists: pgsql-hackers
From: "MauMau" <maumau307(at)gmail(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-16 10:23:58
Message-ID: 7E37040CF3804EA5B018D7A022822984@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
Windows) as an example for archive_command. However, cp/copy does not sync
the copied data to disk. As a result, the completed WAL segments would be
lost in the following sequence:

1. A WAL segment fills up.

2. The archiver process archives the just filled WAL segment using
archive_command. That is, cp/copy reads the WAL segment file from pg_xlog/
and writes to the archive area. At this point, the WAL file is not
persisted to the archive area yet, because cp/copy doesn't sync the writes.

3. The checkpoint processing removes the WAL segment file from pg_xlog/.

4. The OS crashes. The filled WAL segment doesn't exist anywhere any more.

Considering the "reliable" image of PostgreSQL and widespread use in
enterprise systems, I think something should be done. Could you give me
your opinions on the right direction? Although the doc certainly escapes by
saying "(This is an example, not a recommendation, and might not work on all
platforms.)", it seems from pgsql-xxx MLs that many people are following
this example.

* Improve the example in the documentation.
But what command can we use to reliably sync just one file?

* Provide some command, say pg_copy, which copies a file synchronously by
using fsync(), and describes in the doc something like "for simple use
cases, you can use pg_copy as the standard reliable copy command."

Related to this topic, pg_basebackup doesn't fsync the backed up files. I'm
afraid this too is different from what the users expect --- I guess they
would expect the backup is certainly available after pg_basebackup completes
even if the machine crashes.

Regards
MauMau


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-16 11:21:18
Message-ID: CAA4eK1Lcch93G3w4g28qKZxaw_kM36MZ=xKqqOTi3hw-966SAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 16, 2014 at 3:53 PM, MauMau <maumau307(at)gmail(dot)com> wrote:
> Hello,
>
> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
> Windows) as an example for archive_command. However, cp/copy does not sync
> the copied data to disk. As a result, the completed WAL segments would be
> lost in the following sequence:
>
> 1. A WAL segment fills up.
>
> 2. The archiver process archives the just filled WAL segment using
> archive_command. That is, cp/copy reads the WAL segment file from pg_xlog/
> and writes to the archive area. At this point, the WAL file is not
> persisted to the archive area yet, because cp/copy doesn't sync the writes.
>
> 3. The checkpoint processing removes the WAL segment file from pg_xlog/.
>
> 4. The OS crashes. The filled WAL segment doesn't exist anywhere any more.
>
> Considering the "reliable" image of PostgreSQL and widespread use in
> enterprise systems, I think something should be done. Could you give me
> your opinions on the right direction?

How about using pg_receivexlog for archiving purpose?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Amit Kapila" <amit(dot)kapila16(at)gmail(dot)com>
Cc: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-16 14:23:26
Message-ID: 4D837AAE66BA4BE980F13FC061E86FC1@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Amit Kapila" <amit(dot)kapila16(at)gmail(dot)com>
> How about using pg_receivexlog for archiving purpose?

pg_receivexlog is good in that it does fsync(). But it seems difficult to
use correctly, and I'm not sure if I can catch all WAL segments without any
loss. pg_receivexlog must be started with postmaster and monitored with
some measures. This won't be very easy at least on Windows.

The pg_receivexlog reference page suggests another difficulty:

Notes
When using pg_receivexlog instead of archive_command, the server will
continue to recycle transaction log files even if the backups are not
properly archived, since there is no command that fails. This can be worked
around by having an archive_command that fails when the file has not been
properly archived yet, for example:
archive_command = 'sleep 5 && test -f /mnt/server/archivedir/%f'

This suggestion is not correct, because it only checks the existence of the
file. What if the file size is less than 16MB? How can we check if the
file is completely archived?

Regards
MauMau


From: Greg Stark <stark(at)mit(dot)edu>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-16 14:47:29
Message-ID: CAM-w4HMcLq4X-MyoQBtXH09Lmg08aZQ6XjLtmxOacGi=DJ-hAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 16, 2014 at 10:23 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
> Windows) as an example for archive_command. However, cp/copy does not sync
> the copied data to disk

I'm actually a lot less concerned about fsyncing the backup than I am
about fsyncing the restore. The backup is just a bunch of files for
the user to make use of. They might copy them around, compress them,
move them onto tape or other storage. They need to be aware of the
persistence of whatever storage system they're putting them in.

But when they do a restore they just untar or whatever other
extraction tool and then hand those files to Postgres to maintain. I
bet the number of people who fsync or call sync the data files after
untarring their backups is vanishingly small and problems could
manifest later after Postgres has been running.

WAL-e recently changed to fsync each data file and the directories
containing them after restore. But perhaps Postgres should open and
fsync each file in the database when it starts up?

In most file systems files written to are guaranteed to be synced
within a configurable amount of time (in some systems unless the
filesystem can't keep up). So the practical risk may be small. But in
theory a database that wasn't synced when it was restored could
suddenly lose files days or months later when a crash occurs and some
data files weren't touched by the database in the intervening time.

--
greg


From: Andreas Karlsson <andreas(at)proxel(dot)se>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-16 18:59:40
Message-ID: 5325F49C.20109@proxel.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/16/2014 03:23 PM, MauMau wrote:
> From: "Amit Kapila" <amit(dot)kapila16(at)gmail(dot)com>
>> How about using pg_receivexlog for archiving purpose?
>
> pg_receivexlog is good in that it does fsync(). But it seems difficult
> to use correctly, and I'm not sure if I can catch all WAL segments
> without any loss. pg_receivexlog must be started with postmaster and
> monitored with some measures. This won't be very easy at least on Windows.

Replication slots should solve the issue of making sure to catch all of
the WAL.

--
Andreas Karlsson


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-17 01:20:14
Message-ID: CA+Tgmoa4u1nVpkXj6zpy5OUHhNVabLa5uYZsFA4sD__EzfvV1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 16, 2014 at 6:23 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
> Windows) as an example for archive_command. However, cp/copy does not sync
> the copied data to disk. As a result, the completed WAL segments would be
> lost in the following sequence:
>
> 1. A WAL segment fills up.
>
> 2. The archiver process archives the just filled WAL segment using
> archive_command. That is, cp/copy reads the WAL segment file from pg_xlog/
> and writes to the archive area. At this point, the WAL file is not
> persisted to the archive area yet, because cp/copy doesn't sync the writes.
>
> 3. The checkpoint processing removes the WAL segment file from pg_xlog/.
>
> 4. The OS crashes. The filled WAL segment doesn't exist anywhere any more.
>
> Considering the "reliable" image of PostgreSQL and widespread use in
> enterprise systems, I think something should be done. Could you give me
> your opinions on the right direction? Although the doc certainly escapes by
> saying "(This is an example, not a recommendation, and might not work on all
> platforms.)", it seems from pgsql-xxx MLs that many people are following
> this example.
>
> * Improve the example in the documentation.
> But what command can we use to reliably sync just one file?
>
> * Provide some command, say pg_copy, which copies a file synchronously by
> using fsync(), and describes in the doc something like "for simple use
> cases, you can use pg_copy as the standard reliable copy command."

+1. This won't obviate the need for tools to manage replication, but
it would make it possible to get the simplest case right without
guessing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-17 04:10:14
Message-ID: CAA4eK1+eXphtyjgOsxucFCQjros8dmmLNu306YcjLHci2Xrk2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 16, 2014 at 7:53 PM, MauMau <maumau307(at)gmail(dot)com> wrote:
> From: "Amit Kapila" <amit(dot)kapila16(at)gmail(dot)com>
>
>> How about using pg_receivexlog for archiving purpose?
>
>
> pg_receivexlog is good in that it does fsync(). But it seems difficult to
> use correctly, and I'm not sure if I can catch all WAL segments without any
> loss. pg_receivexlog must be started with postmaster and monitored with
> some measures. This won't be very easy at least on Windows.
>
> The pg_receivexlog reference page suggests another difficulty:
>
> Notes
> When using pg_receivexlog instead of archive_command, the server will
> continue to recycle transaction log files even if the backups are not
> properly archived, since there is no command that fails. This can be worked
> around by having an archive_command that fails when the file has not been
> properly archived yet, for example:
> archive_command = 'sleep 5 && test -f /mnt/server/archivedir/%f'
>
> This suggestion is not correct, because it only checks the existence of the
> file. What if the file size is less than 16MB? How can we check if the
> file is completely archived?

The most probable reasons for un-successful archiving could be:
1. Disk space got full - pg_receivexlog makes sure while open/create new
segment file that the size of new file should be 16MB (open_walfile()). So due
to this reason there should not be a problem to above command.

2. Permission got denied - I think this will lead to failure of above archive
command mentioned by you.

3. n/w connection broken - This will also lead to failure of above command, but
here I think there is a possibility that it might have checked the existence of
in-complete wal file on archive location and consider it archived, but I think
wal_keep_segments can avoid this problem.

Also if you are on 9.4, then may be --slot parameter can help you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-17 12:12:52
Message-ID: CAHGQGwEO1R+AAP3vuRCoT8Rseq=5OHV=98svwVec1oJU52ac1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 17, 2014 at 10:20 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Mar 16, 2014 at 6:23 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
>> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
>> Windows) as an example for archive_command. However, cp/copy does not sync
>> the copied data to disk. As a result, the completed WAL segments would be
>> lost in the following sequence:
>>
>> 1. A WAL segment fills up.
>>
>> 2. The archiver process archives the just filled WAL segment using
>> archive_command. That is, cp/copy reads the WAL segment file from pg_xlog/
>> and writes to the archive area. At this point, the WAL file is not
>> persisted to the archive area yet, because cp/copy doesn't sync the writes.
>>
>> 3. The checkpoint processing removes the WAL segment file from pg_xlog/.
>>
>> 4. The OS crashes. The filled WAL segment doesn't exist anywhere any more.
>>
>> Considering the "reliable" image of PostgreSQL and widespread use in
>> enterprise systems, I think something should be done. Could you give me
>> your opinions on the right direction? Although the doc certainly escapes by
>> saying "(This is an example, not a recommendation, and might not work on all
>> platforms.)", it seems from pgsql-xxx MLs that many people are following
>> this example.
>>
>> * Improve the example in the documentation.
>> But what command can we use to reliably sync just one file?
>>
>> * Provide some command, say pg_copy, which copies a file synchronously by
>> using fsync(), and describes in the doc something like "for simple use
>> cases, you can use pg_copy as the standard reliable copy command."
>
> +1. This won't obviate the need for tools to manage replication, but
> it would make it possible to get the simplest case right without
> guessing.

+1, too.

And, what about making pg_copy call posix_fadvise(DONT_NEED) against the
archived file after the copy? Also It might be good idea to support the direct
copy of the file to avoid wasting the file cache.

Regards,

--
Fujii Masao


From: Mitsumasa KONDO <kondo(dot)mitsumasa(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, MauMau <maumau307(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-17 13:44:38
Message-ID: CADupcHV2vmNBvS5hGoK0CC-m=JGFPbX=mF0dtrvN8kRwwQw7Hw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2014-03-17 21:12 GMT+09:00 Fujii Masao <masao(dot)fujii(at)gmail(dot)com>:

> On Mon, Mar 17, 2014 at 10:20 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> > On Sun, Mar 16, 2014 at 6:23 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
> >> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
> >> Windows) as an example for archive_command. However, cp/copy does not
> sync
> >> the copied data to disk. As a result, the completed WAL segments would
> be
> >> lost in the following sequence:
> >>
> >> 1. A WAL segment fills up.
> >>
> >> 2. The archiver process archives the just filled WAL segment using
> >> archive_command. That is, cp/copy reads the WAL segment file from
> pg_xlog/
> >> and writes to the archive area. At this point, the WAL file is not
> >> persisted to the archive area yet, because cp/copy doesn't sync the
> writes.
> >>
> >> 3. The checkpoint processing removes the WAL segment file from pg_xlog/.
> >>
> >> 4. The OS crashes. The filled WAL segment doesn't exist anywhere any
> more.
> >>
> >> Considering the "reliable" image of PostgreSQL and widespread use in
> >> enterprise systems, I think something should be done. Could you give me
> >> your opinions on the right direction? Although the doc certainly
> escapes by
> >> saying "(This is an example, not a recommendation, and might not work
> on all
> >> platforms.)", it seems from pgsql-xxx MLs that many people are following
> >> this example.
> >>
> >> * Improve the example in the documentation.
> >> But what command can we use to reliably sync just one file?
> >>
> >> * Provide some command, say pg_copy, which copies a file synchronously
> by
> >> using fsync(), and describes in the doc something like "for simple use
> >> cases, you can use pg_copy as the standard reliable copy command."
> >
> > +1. This won't obviate the need for tools to manage replication, but
> > it would make it possible to get the simplest case right without
> > guessing.
>
> +1, too.
>
> And, what about making pg_copy call posix_fadvise(DONT_NEED) against the
> archived file after the copy? Also It might be good idea to support the
> direct
> copy of the file to avoid wasting the file cache.

Use direct_cp.
http://directcp.sourceforge.net/direct_cp.html

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Mitsumasa KONDO" <kondo(dot)mitsumasa(at)gmail(dot)com>, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-21 11:54:52
Message-ID: 9C1EB95CA1F34DAB93DF549A51E3E874@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Mitsumasa KONDO" <kondo(dot)mitsumasa(at)gmail(dot)com>
> 2014-03-17 21:12 GMT+09:00 Fujii Masao <masao(dot)fujii(at)gmail(dot)com>:
>
>> On Mon, Mar 17, 2014 at 10:20 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
>> wrote:
>> > On Sun, Mar 16, 2014 at 6:23 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
>> >> * Improve the example in the documentation.
>> >> But what command can we use to reliably sync just one file?
>> >>
>> >> * Provide some command, say pg_copy, which copies a file synchronously
>> by
>> >> using fsync(), and describes in the doc something like "for simple use
>> >> cases, you can use pg_copy as the standard reliable copy command."
>> >
>> > +1. This won't obviate the need for tools to manage replication, but
>> > it would make it possible to get the simplest case right without
>> > guessing.
>>
>> +1, too.
>>
>> And, what about making pg_copy call posix_fadvise(DONT_NEED) against the
>> archived file after the copy? Also It might be good idea to support the
>> direct
>> copy of the file to avoid wasting the file cache.
>
> Use direct_cp.
> http://directcp.sourceforge.net/direct_cp.html

Thank you all for giving favorable responses and interesting ideas.
Then, I think I'll do:

* Create pg_copy in C so that it can be used on Windows as well as on
UNIX/Linux. It just copies one file. Its source code is located in
src/bin/pg_copy/. Please recommend a better name if you have one in mind.

* Add a reference page for pg_copy in the chapter "Server applications".
Modify the section for continuous archiving to recommend pg_copy for simple
use cases as the standard command.

* pg_copy calls posix_fadvise(DONT_NEED) on the destination file.

* pg_copy passes O_DIRECT flag when opening the destination file
when --directio or -d option is specified. O_DIRECT is not used by default
because it may not be available on some file systems, as well as it might
cause trouble on older platforms such as RHEL4/5. pg_copy does not use
O_DIRECT for the source file so that it can copy the data from the
filesystem cache, which is just written by postgres.

Could you give me your opinions before starting the work, including the
following?

* Should I refactor the functions (copy_file, copydir, etc.) in
src/backend/storage/file/copydir.c so that they can also be used for
frontends? If so, which of src/port or src/common/ is the right place to
put copydir.c in?

* Should I complete the work before 9.4 beta so that it will be available
starting with 9.4? I think so because it is a basic capability to archive
transaction logs safely (although the time may not allow me to do this).

Regards
MauMau


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Mitsumasa KONDO <kondo(dot)mitsumasa(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-21 13:24:08
Message-ID: CAB7nPqSWihB2tbpSTnQhTS-tokd=OWZEX0+vcjVt3=MHxVb=qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 21, 2014 at 8:54 PM, MauMau <maumau307(at)gmail(dot)com> wrote:
> * Create pg_copy in C so that it can be used on Windows as well as on
> UNIX/Linux. It just copies one file. Its source code is located in
> src/bin/pg_copy/. Please recommend a better name if you have one in mind.
I'd rather see that as a part of contrib/ if possible. Is there any
portion of the code you have in mind that makes mandatory putting it
in src/bin?

> * Should I complete the work before 9.4 beta so that it will be available
> starting with 9.4? I think so because it is a basic capability to archive
> transaction logs safely (although the time may not allow me to do this).
Pursing efforts on a utility like that is worth the shot IMO (I would
use it for sure if it has reliable cross-platform support to unify
sets of replication scripts), but including it in 9.4 is out of scope.
A saner target would be the 1st commit fest of 9.5.

Regards,
--
Michael


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Michael Paquier" <michael(dot)paquier(at)gmail(dot)com>
Cc: "Mitsumasa KONDO" <kondo(dot)mitsumasa(at)gmail(dot)com>, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "PostgreSQL mailing lists" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-21 15:02:08
Message-ID: 6A638670F2644973B931E3F276CF2202@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Michael Paquier" <michael(dot)paquier(at)gmail(dot)com>
> On Fri, Mar 21, 2014 at 8:54 PM, MauMau <maumau307(at)gmail(dot)com> wrote:
>> * Create pg_copy in C so that it can be used on Windows as well as on
>> UNIX/Linux. It just copies one file. Its source code is located in
>> src/bin/pg_copy/. Please recommend a better name if you have one in
>> mind.
> I'd rather see that as a part of contrib/ if possible. Is there any
> portion of the code you have in mind that makes mandatory putting it
> in src/bin?

Archiving transaction logs reliably is a basic responsibility of DBMS, so I
think it should be treated as part of the core. It is not a desirable
feature but actually a mandatory one to persist transaction logs. Even if
it were a "better to have" feature, it can be put in the core like
pg_basebackup and pg_isready, which are not mandatory tools.

>> * Should I complete the work before 9.4 beta so that it will be available
>> starting with 9.4? I think so because it is a basic capability to
>> archive
>> transaction logs safely (although the time may not allow me to do this).
> Pursing efforts on a utility like that is worth the shot IMO (I would
> use it for sure if it has reliable cross-platform support to unify
> sets of replication scripts), but including it in 9.4 is out of scope.
> A saner target would be the 1st commit fest of 9.5.

OK, I don't mind if it should be targeted at 9.4 or 9.5. If someone wants
it for 9.4, I try to hurry.

Regards
MauMau


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Michael Paquier" <michael(dot)paquier(at)gmail(dot)com>
Cc: "Mitsumasa KONDO" <kondo(dot)mitsumasa(at)gmail(dot)com>, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "PostgreSQL mailing lists" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-21 15:03:00
Message-ID: 94F13A624B5F411D8792858B76294015@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Michael Paquier" <michael(dot)paquier(at)gmail(dot)com>
> On Fri, Mar 21, 2014 at 8:54 PM, MauMau <maumau307(at)gmail(dot)com> wrote:
>> * Create pg_copy in C so that it can be used on Windows as well as on
>> UNIX/Linux. It just copies one file. Its source code is located in
>> src/bin/pg_copy/. Please recommend a better name if you have one in
>> mind.
> I'd rather see that as a part of contrib/ if possible. Is there any
> portion of the code you have in mind that makes mandatory putting it
> in src/bin?

Archiving transaction logs reliably is a basic responsibility of DBMS, so I
think it should be treated as part of the core. It is not a desirable
feature but actually a mandatory one to persist transaction logs. Even if
it were a "better to have" feature, it can be put in the core like
pg_basebackup and pg_isready, which are not mandatory tools.

>> * Should I complete the work before 9.4 beta so that it will be available
>> starting with 9.4? I think so because it is a basic capability to
>> archive
>> transaction logs safely (although the time may not allow me to do this).
> Pursing efforts on a utility like that is worth the shot IMO (I would
> use it for sure if it has reliable cross-platform support to unify
> sets of replication scripts), but including it in 9.4 is out of scope.
> A saner target would be the 1st commit fest of 9.5.

OK, I don't mind if it should be targeted at 9.4 or 9.5. If someone wants
it for 9.4, I try to hurry.

Regards
MauMau


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-21 20:16:08
Message-ID: CAMkU=1zr5mcqZYP0eLXRMN1Le66pRNH-hjpzCHZi8racgW9Y-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 16, 2014 at 3:23 AM, MauMau <maumau307(at)gmail(dot)com> wrote:

> Hello,
>
> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
> Windows) as an example for archive_command. However, cp/copy does not sync
> the copied data to disk. As a result, the completed WAL segments would be
> lost in the following sequence:
>
> 1. A WAL segment fills up.
>
> 2. The archiver process archives the just filled WAL segment using
> archive_command. That is, cp/copy reads the WAL segment file from pg_xlog/
> and writes to the archive area. At this point, the WAL file is not
> persisted to the archive area yet, because cp/copy doesn't sync the writes.
>
> 3. The checkpoint processing removes the WAL segment file from pg_xlog/.
>

Note that it takes two checkpoints for this to happen, at least as
currently coded.

Also, if the system crashed badly enough to need media recovery, rather
than just automatic crash recovery, some lost transactions are expected.
Although this could silently break your PITR chain, of a crash happened
and automatic recover used the copy in pg_xlog (which of course was synced)
, while copy in the archive was not synced.

> 4. The OS crashes. The filled WAL segment doesn't exist anywhere any more.
>
> Considering the "reliable" image of PostgreSQL and widespread use in
> enterprise systems, I think something should be done. Could you give me
> your opinions on the right direction? Although the doc certainly escapes
> by saying "(This is an example, not a recommendation, and might not work on
> all platforms.)", it seems from pgsql-xxx MLs that many people are
> following this example.
>

I use this as an example, kind of, but what I am copying to is a network
mount, so any attempts to fsync it there would probably need unavailable
hooks into the remote file system.

Do people really just copy the files from one directory of local storage to
another directory of local storage? I don't see the point of that. But it
seems like this is an area where there are hundreds of use cases, and often
one doesn't see the point of other people's, making it hard to come up with
good examples.

>
> * Improve the example in the documentation.
> But what command can we use to reliably sync just one file?
>
> * Provide some command, say pg_copy, which copies a file synchronously by
> using fsync(), and describes in the doc something like "for simple use
> cases, you can use pg_copy as the standard reliable copy command."
>

The recommendation is to refuse to overwrite an existing file of the same
name, and exit with failure. Which essentially brings archiving to a halt,
because it keeps trying but it will keep failing. If we make a custom
version, one thing it should do is determine if the existing archived file
is just a truncated version of the attempting-to-be archived file, and if
so overwrite it. Because if the first archival command fails with a
network glitch, it can leave behind a partial file.

Cheers,

Jeff


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
Cc: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-21 21:22:37
Message-ID: AD05352D260B451FB853441F557C3A24@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
> Do people really just copy the files from one directory of local storage
> to
> another directory of local storage? I don't see the point of that.

It makes sense to archive WAL to a directory of local storage for media
recovery. Here, the local storage is a different disk drive which is
directly attached to the database server or directly connected through SAN.

> The recommendation is to refuse to overwrite an existing file of the same
> name, and exit with failure. Which essentially brings archiving to a
> halt,
> because it keeps trying but it will keep failing. If we make a custom
> version, one thing it should do is determine if the existing archived file
> is just a truncated version of the attempting-to-be archived file, and if
> so overwrite it. Because if the first archival command fails with a
> network glitch, it can leave behind a partial file.

What I'm trying to address is just an alternative to cp/copy which fsyncs a
file. It just overwrites an existing file.

Yes, you're right, the failed archive attempt leaves behind a partial file
which causes subsequent attempts to fail, if you follow the PG manual.
That's another undesirable point in the current doc. To overcome this,
someone on this ML recommended me to do "cp %p /archive/dir/%f.tmp && mv
/archive/dir/%f.tmp /archive/dir/%f". Does this solve your problem?

Regards
MauMau


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-21 23:02:21
Message-ID: 20140321230221.GA32632@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 21, 2014 at 01:16:08PM -0700, Jeff Janes wrote:
> On Sun, Mar 16, 2014 at 3:23 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
>
> Hello,
>
> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
> Windows) as an example for archive_command. However, cp/copy does not sync
> the copied data to disk. As a result, the completed WAL segments would be
> lost in the following sequence:
>
> 1. A WAL segment fills up.
>
> 2. The archiver process archives the just filled WAL segment using
> archive_command. That is, cp/copy reads the WAL segment file from pg_xlog/
> and writes to the archive area. At this point, the WAL file is not
> persisted to the archive area yet, because cp/copy doesn't sync the writes.
>
> 3. The checkpoint processing removes the WAL segment file from pg_xlog/.
>
>
> Note that it takes two checkpoints for this to happen, at least as currently
> coded.
>
> Also, if the system crashed badly enough to need media recovery, rather than
> just automatic crash recovery, some lost transactions are expected. Although
> this could silently break your PITR chain, of a crash happened and automatic
> recover used the copy in pg_xlog (which of course was synced) , while copy in
> the archive was not synced.

That is one good reason to keep checkpoint_warning=30, so the typical
file system sync that happens every 30 seconds warns that those files
might not on permanent storage.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-22 09:21:06
Message-ID: 20140322092105.GA12234@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 22, 2014 at 06:22:37AM +0900, MauMau wrote:
> From: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
> >Do people really just copy the files from one directory of local
> >storage to
> >another directory of local storage? I don't see the point of that.
>
> It makes sense to archive WAL to a directory of local storage for
> media recovery. Here, the local storage is a different disk drive
> which is directly attached to the database server or directly
> connected through SAN.

I'm one of those peope. They are archived into a local directory in
preparation for an rsync over ssh.

> >The recommendation is to refuse to overwrite an existing file of the same
> >name, and exit with failure. Which essentially brings archiving
> >to a halt,
> >because it keeps trying but it will keep failing. If we make a custom
> >version, one thing it should do is determine if the existing archived file
> >is just a truncated version of the attempting-to-be archived file, and if
> >so overwrite it. Because if the first archival command fails with a
> >network glitch, it can leave behind a partial file.
>
> What I'm trying to address is just an alternative to cp/copy which
> fsyncs a file. It just overwrites an existing file.

I ran into a related problem with cp, where halfway the copy the disk
was full and I was left with half a WAL file. This caused the rsync to
copy only half a file and the replication broke. This is clearly a
recoverable situation, but it didn't recover in this case.

> Yes, you're right, the failed archive attempt leaves behind a
> partial file which causes subsequent attempts to fail, if you follow
> the PG manual. That's another undesirable point in the current doc.
> To overcome this, someone on this ML recommended me to do "cp %p
> /archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f".
> Does this solve your problem?

This would probably have handled it, but I find it odd that there's
program to handle restoring of archives properly, but on the archiving
side you have to cobble together your own shell scripts which fail in
various corner cases.

I'd love a program that just Did The Right Thing.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-29 21:38:03
Message-ID: CAMkU=1wM5CvcTQB2DXnt1v_NcmT0e=aiWQCZJ7+Zci4gB-HovQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 21, 2014 at 2:22 PM, MauMau <maumau307(at)gmail(dot)com> wrote:

> From: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
>
> Do people really just copy the files from one directory of local storage
>> to
>> another directory of local storage? I don't see the point of that.
>>
>
> It makes sense to archive WAL to a directory of local storage for media
> recovery. Here, the local storage is a different disk drive which is
> directly attached to the database server or directly connected through SAN.

For a SAN I guess we have different meanings of "local" :)
(I have no doubt yours is correct--the fine art of IT terminology is not my
thing.)

The recommendation is to refuse to overwrite an existing file of the same
>> name, and exit with failure. Which essentially brings archiving to a
>> halt,
>> because it keeps trying but it will keep failing. If we make a custom
>> version, one thing it should do is determine if the existing archived file
>> is just a truncated version of the attempting-to-be archived file, and if
>> so overwrite it. Because if the first archival command fails with a
>> network glitch, it can leave behind a partial file.
>>
>
> What I'm trying to address is just an alternative to cp/copy which fsyncs
> a file. It just overwrites an existing file.
>
> Yes, you're right, the failed archive attempt leaves behind a partial file
> which causes subsequent attempts to fail, if you follow the PG manual.
> That's another undesirable point in the current doc. To overcome this,
> someone on this ML recommended me to do "cp %p /archive/dir/%f.tmp && mv
> /archive/dir/%f.tmp /archive/dir/%f". Does this solve your problem?
>

As written is doesn't solve it, as it just unconditionally overwrites the
file. If you wanted that you could just do the single-statement
unconditional overwrite.

You could make it so that the .tmp gets overwritten unconditionally, but
the move of it will not overwrite an existing permanent file. That would
solve the problem where a glitch in the network leaves in incomplete file
behind that blocks the next attempt, *except* that mv on (at least some)
network file systems is really a copy, and not an atomic rename, so is
still subject to leaving behind incomplete crud.

But, it is hard to tell what the real solution is, because the doc doesn't
explain why it should refuse (and fail) to overwrite an existing file. The
only reason I can think of to make that recommendation is because it is
easy to accidentally configure two clusters to attempt to archive to the
same location, and having them overwrite each others files should be
guarded against. If I am right, it seems like this reason should be added
to the docs, so people know what they are defending against. And if I am
wrong, it seems even more important that the (correct) reason is added to
the docs.

Cheers,

Jeff


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-29 21:44:14
Message-ID: 9616.1396129454@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> But, it is hard to tell what the real solution is, because the doc doesn't
> explain why it should refuse (and fail) to overwrite an existing file. The
> only reason I can think of to make that recommendation is because it is
> easy to accidentally configure two clusters to attempt to archive to the
> same location, and having them overwrite each others files should be
> guarded against. If I am right, it seems like this reason should be added
> to the docs, so people know what they are defending against. And if I am
> wrong, it seems even more important that the (correct) reason is added to
> the docs.

If memory serves, that is the reason ... and I thought it *was* explained
somewhere in the docs.

regards, tom lane


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] What should we do for reliable WAL archiving?
Date: 2014-03-29 23:10:10
Message-ID: CAMkU=1zGeZyQ7B4tDwMJ2Q78aWnxUs246wtNLe9=rHGccX4Dzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Saturday, March 29, 2014, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Jeff Janes <jeff(dot)janes(at)gmail(dot)com <javascript:;>> writes:
> > But, it is hard to tell what the real solution is, because the doc
> doesn't
> > explain why it should refuse (and fail) to overwrite an existing file.
> The
> > only reason I can think of to make that recommendation is because it is
> > easy to accidentally configure two clusters to attempt to archive to the
> > same location, and having them overwrite each others files should be
> > guarded against. If I am right, it seems like this reason should be
> added
> > to the docs, so people know what they are defending against. And if I am
> > wrong, it seems even more important that the (correct) reason is added to
> > the docs.
>
> If memory serves, that is the reason ... and I thought it *was* explained
> somewhere in the docs.
>

You are right, and it has been there for a decade. I don't know how I
missed that the last several times I read it. I remember clearly the
paragraph below it, just not that one.

Sorry,

Jeff