Re: [patch] pg_copy - a command for reliable WAL archiving

Lists: pgsql-hackers
From: "MauMau" <maumau307(at)gmail(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-06-17 14:26:37
Message-ID: 718FF758044A42F6B57B7AE41556BB91@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

As I proposed before in the thread below, I've implemented a simple command
for reliable WAL archiving. I would appreciate it if you could review and
test the patch.

http://www.postgresql.org/message-id/9C1EB95CA1F34DAB93DF549A51E3E874@maumau

Regards
MauMau

Attachment Content-Type Size
pg_copy.patch application/octet-stream 17.2 KB

From: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-06-17 14:53:01
Message-ID: 20140617145301.GT5162@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2014-06-17 23:26:37 +0900, maumau307(at)gmail(dot)com wrote:
>
> Hello,
>
> As I proposed before in the thread below, I've implemented a simple
> command for reliable WAL archiving. I would appreciate it if you
> could review and test the patch.

Please add your patch to the next CF (i.e. 2014-08), so that it gets
review attention at the appropriate time.

-- Abhijit


From: Joe Conway <mail(at)joeconway(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-06-17 15:02:59
Message-ID: 53A058A3.80804@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/17/2014 07:26 AM, MauMau wrote:
> Hello,
>
> As I proposed before in the thread below, I've implemented a simple
> command for reliable WAL archiving. I would appreciate it if you could
> review and test the patch.
>
> http://www.postgresql.org/message-id/9C1EB95CA1F34DAB93DF549A51E3E874@maumau

That first hunk refers to dblink -- I'm guessing it does not belong with
this patch.

Joe

- --
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJToFiiAAoJEDfy90M199hlseMP/Rpwneo5la7mJkHJA0BWUtj/
Hh8yNzzyBVPbKYuTEOZWi2yblFW6yA2dknrYD8RigS1qJFwLw+drFt5673Vi6jKW
Pf7Qn62Ck/U0lZTGXUU9akOhSx7BsKVwH8HvdARIp5DSV2n7/HFDtazi3hTtFq31
GHKA84lPwuynODN42eVez0YXdeRUX7K/s5rCRq154q3BrLPhEEvQwo/kfhet3F0A
utf0ymSPuX3RvpGDDPAZ1oQxTd/xA+qJhX0jrsRoJVaY40rufTgXPmAnNFOdWEds
fXs6ObRm+tHmeLhYVUXhODa1HPHBQMvElVeB3CGxhPUvhP2494sfoLOd7qs8Lblg
oUkYbIJee8Ir7NN34Gc69YW58sekPSVI4vXKu0PwF++Ubs4MYNNd7nP4Wu9N2ISw
p/GPIbx0uR3NbFGCoOLD0K3QHnX/b0FTWHzTTboZ+b69rNIDePpn3eGO2QEOLL/R
P/YkPma8SLyDvNnCzuSU3XDkFmWQsgK7xa7qpsBR1mbdF7zKPfOxCtoby/enSeuk
z7NJxtpHeUQkO7Pb3ZNk6gL+E8UAQihvdBgdBwBzj4qaZyAM5ec29aya3TtBbF+3
UoFX3m4tthR6mMWqizsdadSPvozDjMrhqSRFAjdBSX80Nfs2DVN1Hepp8NXvjRzV
b5RfV7x4yvtr92FFQboj
=+TWh
-----END PGP SIGNATURE-----


From: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-06-17 15:06:13
Message-ID: 20140617150613.GU5162@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2014-06-17 08:02:59 -0700, mail(at)joeconway(dot)com wrote:
>
> That first hunk refers to dblink -- I'm guessing it does not belong
> with this patch.

Yes, it's a leftover of the dblink memory leak patch that's in this CF.

-- Abhijit


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Joe Conway" <mail(at)joeconway(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-06-17 22:41:43
Message-ID: CC077760AD7242B4B0E8F5CFF1F7659F@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Joe Conway" <mail(at)joeconway(dot)com>
> That first hunk refers to dblink -- I'm guessing it does not belong with
> this patch.

Ouch, what a careless mistake. The attached one is clean. I'll update the
CommitFest entry when I'm back home from work.

Regards
MauMau

Attachment Content-Type Size
pg_copy_v2.patch application/octet-stream 16.7 KB

From: "MauMau" <maumau307(at)gmail(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-14 04:52:37
Message-ID: DAC5903323674CF8A7D9E6770C88B72B@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I fixed some minor mistakes.

Regards
MauMau

Attachment Content-Type Size
pg_copy_v3.patch application/octet-stream 17.1 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-15 14:46:31
Message-ID: CAHGQGwFXvnfPVhUS_9Y5HXqUrxxV4YeNN77NFOZc7-PLRGMxCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 14, 2014 at 1:52 PM, MauMau <maumau307(at)gmail(dot)com> wrote:
> I fixed some minor mistakes.

What's the main purpose of this tool? If it's for WAL archiving, the tool name
"pg_copy" sounds too generic. We already have pg_archivecleanup, so maybe
"pg_archivecopy" or something is better for the consistency?

pg_copy in the patch copies the file to the destination in a
straightforward way,
i.e., directly copies the file to the dest file with actual name. This can cause
the problem which some people reported. The problem is that, when the server
crashes while WAL file is being archived by cp command, its partially-filled
WAL file remains at the archival area. This half-baked archive file can cause
various troubles. To address this, WAL file needs to be copied to the temporary
file at first, then renamed to the actual name. I think that pg_copy should
copy the WAL file in that way.

Currently pg_copy always syncs the archive file, and there is no way to disable
that. But I'm sure that not everyone want to sync the archive file. So I think
that it's better to add the option specifying whether to sync the file
or not, into
pg_copy.

Some users might want to specify whether to call posix_fadvise or not because
they might need to re-read the archvied files just after the archiving.
For example, network copy of the archived files from the archive area to
remote site for disaster recovery.

Do you recommend to use pg_copy for restore_command? If yes, it also should
be documented. And in the WAL restore case, the restored WAL files are re-read
soon by recovery, so posix_fadvise is not good in that case.

Direct I/O and posix_fadvise are used only for destination file. But why not
source file? That might be useful especially for restore_command case.

At last, the big question is, is there really no OS command which provides
the same functionality as pg_copy does? If there is, I'd like to avoid duplicate
work basically.

Regards,

--
Fujii Masao


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, MauMau <maumau307(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-15 15:47:51
Message-ID: 1408117671.17900.YahooMailNeo@web122304.mail.ne1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> Direct I/O and posix_fadvise are used only for destination file.
> But why not source file? That might be useful especially for
> restore_command case.

That would prevent people from piping the file through a
compression utility.  We should support piped I/O for input (and if
we want to use it on the recovery side, for the output, too), at
least as an option.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-19 13:35:49
Message-ID: 1E31D6AF4FC047F99D53A798D5CA3AD8@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>
> What's the main purpose of this tool? If it's for WAL archiving, the tool
> name
> "pg_copy" sounds too generic. We already have pg_archivecleanup, so maybe
> "pg_archivecopy" or something is better for the consistency?
>
> pg_copy in the patch copies the file to the destination in a
> straightforward way,
> i.e., directly copies the file to the dest file with actual name. This can
> cause
> the problem which some people reported. The problem is that, when the
> server
> crashes while WAL file is being archived by cp command, its
> partially-filled
> WAL file remains at the archival area. This half-baked archive file can
> cause
> various troubles. To address this, WAL file needs to be copied to the
> temporary
> file at first, then renamed to the actual name. I think that pg_copy
> should
> copy the WAL file in that way.

I intended to make pg_copy a straightforward replacement of cp/copy, which
complements the missing sync. Direct I/O and posix_fadvice() feature may be
convenient but not essential for this utility. cp/copy doesn't copy to a
temporary file, and the problem can be solved easily by mv/move. I wanted
to keep pg_copy as generic as cp/copy, so that it can be used by some
advanced features in the future, e.g. comprehensive backup/recovery
management like RMAN (this example may not be best) when it's integrated
into the core.

With that said, copying to a temporary file like <dest>.tmp and renaming it
to <dest> sounds worthwhile even as a basic copy utility. I want to avoid
copying to a temporary file with a fixed name like _copy.tmp, because some
advanced utility may want to run multiple instances of pg_copy to copy
several files into the same directory simultaneously. However, I'm afraid
multiple <dest>.tmp files might continue to occupy disk space after
canceling copy or power failure in some use cases, where the copy of the
same file won't be retried. That's also the reason why I chose to not use a
temporary file like cp/copy.

> Currently pg_copy always syncs the archive file, and there is no way to
> disable
> that. But I'm sure that not everyone want to sync the archive file. So I
> think
> that it's better to add the option specifying whether to sync the file
> or not, into
> pg_copy.

pg_copy is for copying a file reliably by syncing. If sync is not
necessary, people can use cp/copy.

> Some users might want to specify whether to call posix_fadvise or not
> because
> they might need to re-read the archvied files just after the archiving.
> For example, network copy of the archived files from the archive area to
> remote site for disaster recovery.

This sounds reasonable. Do you have an idea on the switch name and the
default behavior?

> Do you recommend to use pg_copy for restore_command? If yes, it also
> should
> be documented. And in the WAL restore case, the restored WAL files are
> re-read
> soon by recovery, so posix_fadvise is not good in that case.
>
> Direct I/O and posix_fadvise are used only for destination file. But why
> not
> source file? That might be useful especially for restore_command case.

No, I don't think it's necessary to use pg_copy for restore_command.

> At last, the big question is, is there really no OS command which provides
> the same functionality as pg_copy does? If there is, I'd like to avoid
> duplicate
> work basically.

If there exists such a command available in the standard OS installation, I
want to use it.

Regards
MauMau


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-19 19:37:19
Message-ID: 53F3A76F.3050801@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 8/19/14 9:35 AM, MauMau wrote:
> pg_copy is for copying a file reliably by syncing. If sync is not
> necessary, people can use cp/copy.

I'm getting mixed messages from this thread.

I think there could be a fair amount of support for a new tool that can
serve as a universal plug-and-play archive_command, with a variety of
options, such as fsync yes/no, fadvise yes/no, direct-io[*] yes/no,
atomic copy yes/no, allow overwrite yes/no, compression yes/no. That
would reduce the need for users to compose adventurous shell commands,
and it would set out the various options more clearly.

This is not that. This is cp+fsync with a hardcoded fadvise policy and
optional direct-io. That is a valid problem to address, but it is one
of many. On the other hand, I fear that the addition of this
single-and-a-half-purpose tool would make the overall landscape more
complicated than it already is. Since it's in the examples, people will
probably use it, even if they don't need to or shouldn't. And not
recommending it for the restore_command is also confusing.

Another example of how confusing all of this is: On Windows, the copy
command by default doesn't overwrite files, which is what we want
(usually). The patch changes those instances of copy to pg_copy, but it
doesn't have that behavior. Should the examples by changed to do a test
&& pg_copy on Windows (what's the Windows shell syntax for that?), or
should pg_copy have an option to not overwrite a file? How do you then
avoid inconsistencies with the Unix behavior? Or what if you want fsync
but allow overwriting on Windows?

On the technical side, I think if you fsync a new file, you also need to
fsync the directory, to make sure the file is certain to be visible
after a crash.

[*] I keep reading "directio" as a typo of "direction", so please
consider putting a hyphen or underscore in the option and variable
names. ;-)


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, MauMau <maumau307(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-19 19:40:07
Message-ID: 53F3A817.2060002@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 8/15/14 10:46 AM, Fujii Masao wrote:
> At last, the big question is, is there really no OS command which provides
> the same functionality as pg_copy does? If there is, I'd like to avoid duplicate
> work basically.

If you look hard enough, you can maybe find an OS command that can fsync
a file after it was copied. Some versions of dd can do that, and some
systems have an fsync program. But it's not clear whether all systems
have that, and it probably won't be simple and consistent.


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>, MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-19 19:52:47
Message-ID: 53F3AB0F.8030406@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/19/2014 12:37 PM, Peter Eisentraut wrote:
> On 8/19/14 9:35 AM, MauMau wrote:
>> pg_copy is for copying a file reliably by syncing. If sync is not
>> necessary, people can use cp/copy.
>
> I'm getting mixed messages from this thread.
>
> I think there could be a fair amount of support for a new tool that can
> serve as a universal plug-and-play archive_command, with a variety of
> options, such as fsync yes/no, fadvise yes/no, direct-io[*] yes/no,
> atomic copy yes/no, allow overwrite yes/no, compression yes/no. That
> would reduce the need for users to compose adventurous shell commands,
> and it would set out the various options more clearly.

Yes please!

Although I'm not sold on the idea of using DirectIO for this. Is there
really enough benefit to make it worth the trouble?

>
> This is not that. This is cp+fsync with a hardcoded fadvise policy and
> optional direct-io. That is a valid problem to address, but it is one
> of many. On the other hand, I fear that the addition of this
> single-and-a-half-purpose tool would make the overall landscape more
> complicated than it already is. Since it's in the examples, people will
> probably use it, even if they don't need to or shouldn't. And not
> recommending it for the restore_command is also confusing.

I'm afraid that I agree with Peter here. pg_copy looks like a nice
foundation for the eventual pg_copy utility we need, but it's not there yet.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-19 21:42:06
Message-ID: 20140819214205.GG6817@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

MauMau wrote:

> With that said, copying to a temporary file like <dest>.tmp and
> renaming it to <dest> sounds worthwhile even as a basic copy
> utility. I want to avoid copying to a temporary file with a fixed
> name like _copy.tmp, because some advanced utility may want to run
> multiple instances of pg_copy to copy several files into the same
> directory simultaneously. However, I'm afraid multiple <dest>.tmp
> files might continue to occupy disk space after canceling copy or
> power failure in some use cases, where the copy of the same file
> won't be retried. That's also the reason why I chose to not use a
> temporary file like cp/copy.

Is there a way to create a link to a file which only exists as an open
file descriptor? If there was, you could create a temp file, open an
fd, then delete the file. That would remove the issue with files being
leaked due to failures of various kinds.

Also, it's been mentioned that this utility might be useful for
restore_command. That sounds good I guess, but need to keep the
RECOVERYXLOG trick in mind. I remember a case of stalled replay because
the restore command was writing to RECOVERYXLOG.gz and ungzipping, and
the unlink("RECOVERYXLOG") call failed after a partial copy and so did
the copy from the archive. (Removing the borked RECOVERYXLOG.gz fixed
it.)

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Greg Stark <stark(at)mit(dot)edu>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 12:58:58
Message-ID: CAM-w4HMdnAEZerVKeSV5YOhN6z1Jk3506DVF+0QzCzY=Gwg5RA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 19, 2014 at 10:42 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Is there a way to create a link to a file which only exists as an open
> file descriptor? If there was, you could create a temp file, open an
> fd, then delete the file. That would remove the issue with files being
> leaked due to failures of various kinds.

Sort of. On recent Linuxen you can create a file with open(O_TMPFILE)
then use linkat(2) to create a link for it only once it's fully
written.

--
greg


From: Greg Stark <stark(at)mit(dot)edu>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 13:02:44
Message-ID: CAM-w4HNaJOjV1f+LhVL+JfaWAj5JNq2H1BqvSzo3bO0fWGa0bw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

c.f.:

O_TMPFILE (since Linux 3.11)
Create an unnamed temporary file. The pathname argument
specifies a directory; an unnamed inode will be created in
that directory's filesystem. Anything written to the
resulting file will be lost when the last file descriptor is
closed, unless the file is given a name.

O_TMPFILE must be specified with one of O_RDWR or O_WRONLY
and, optionally, O_EXCL. If O_EXCL is not specified, then
linkat(2) can be used to link the temporary file into the
filesystem, making it permanent, using code like the
following:

char path[PATH_MAX];
fd = open("/path/to/dir", O_TMPFILE | O_RDWR,
S_IRUSR | S_IWUSR);

/* File I/O on 'fd'... */

snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file",
AT_SYMLINK_FOLLOW);

In this case, the open() mode argument determines the file
permission mode, as with O_CREAT.

Specifying O_EXCL in conjunction with O_TMPFILE prevents a
temporary file from being linked into the filesystem in the
above manner. (Note that the meaning of O_EXCL in this case
is different from the meaning of O_EXCL otherwise.)

There are two main use cases for O_TMPFILE:

* Improved tmpfile(3) functionality: race-free creation of
temporary files that (1) are automatically deleted when
closed; (2) can never be reached via any pathname; (3) are
not subject to symlink attacks; and (4) do not require the
caller to devise unique names.

* Creating a file that is initially invisible, which is then
populated with data and adjusted to have appropriate
filesystem attributes (chown(2), chmod(2), fsetxattr(2),
etc.) before being atomically linked into the filesystem
in a fully formed state (using linkat(2) as described
above).

O_TMPFILE requires support by the underlying filesystem; only
a subset of Linux filesystems provide that support. In the
initial implementation, support was provided in the ext2,
ext3, ext4, UDF, Minix, and shmem filesystems. XFS support
was added in Linux 3.15.


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 13:27:36
Message-ID: 20140820132736.GA7648@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark wrote:

> char path[PATH_MAX];
> fd = open("/path/to/dir", O_TMPFILE | O_RDWR,
> S_IRUSR | S_IWUSR);
>
> /* File I/O on 'fd'... */
>
> snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
> linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file",
> AT_SYMLINK_FOLLOW);

Hmm, the real trick here is linkat(... "/proc/self/foobar"), not the
O_TMPFILE: you can have an open file descriptor to an "invisible" file
simply by creating a normal file and unlinking it. I looked at linkat()
yesterday but the idea of using /proc/self didn't occur to me. Nasty
trick :-( It seems linkat() is quite a bit more portable than
O_TMPFILE, fortunately ...

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 13:35:32
Message-ID: 20140820133532.GI10041@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-08-19 17:42:06 -0400, Alvaro Herrera wrote:
> MauMau wrote:
>
> > With that said, copying to a temporary file like <dest>.tmp and
> > renaming it to <dest> sounds worthwhile even as a basic copy
> > utility. I want to avoid copying to a temporary file with a fixed
> > name like _copy.tmp, because some advanced utility may want to run
> > multiple instances of pg_copy to copy several files into the same
> > directory simultaneously. However, I'm afraid multiple <dest>.tmp
> > files might continue to occupy disk space after canceling copy or
> > power failure in some use cases, where the copy of the same file
> > won't be retried. That's also the reason why I chose to not use a
> > temporary file like cp/copy.
>
> Is there a way to create a link to a file which only exists as an open
> file descriptor? If there was, you could create a temp file, open an
> fd, then delete the file. That would remove the issue with files being
> leaked due to failures of various kinds.

Isn't this a solution looking for a problem? We're using tempfiles in
dozens of other places and I really don't see why this is the place to
stop doing so. Just copy to <dest>.tmp and move it into place. If things
crash during that, the command will be repeated shortly afterwards again
*anyway*. Let's not get into platform specific games here.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 13:50:56
Message-ID: 20140820135056.GB7648@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund wrote:
> On 2014-08-19 17:42:06 -0400, Alvaro Herrera wrote:
> > MauMau wrote:
> >
> > > With that said, copying to a temporary file like <dest>.tmp and
> > > renaming it to <dest> sounds worthwhile even as a basic copy
> > > utility. I want to avoid copying to a temporary file with a fixed
> > > name like _copy.tmp, because some advanced utility may want to run
> > > multiple instances of pg_copy to copy several files into the same
> > > directory simultaneously. However, I'm afraid multiple <dest>.tmp
> > > files might continue to occupy disk space after canceling copy or
> > > power failure in some use cases, where the copy of the same file
> > > won't be retried. That's also the reason why I chose to not use a
> > > temporary file like cp/copy.
> >
> > Is there a way to create a link to a file which only exists as an open
> > file descriptor? If there was, you could create a temp file, open an
> > fd, then delete the file. That would remove the issue with files being
> > leaked due to failures of various kinds.
>
> Isn't this a solution looking for a problem? We're using tempfiles in
> dozens of other places and I really don't see why this is the place to
> stop doing so. Just copy to <dest>.tmp and move it into place. If things
> crash during that, the command will be repeated shortly afterwards again
> *anyway*. Let's not get into platform specific games here.

The issue is what happens if there's a crash while the temp file is in
the middle of being filled. I agree there are bigger problems to solve
in this area though. Yes, I agree that a fixed name such as _copy.tmp
is a really bad choice for several reasons.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Greg Stark <stark(at)mit(dot)edu>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 14:08:25
Message-ID: CAM-w4HNQNE7+G4gR=4BbwqarBk2aKgVAXyR-RV+cwD1Hq+mTjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 20, 2014 at 2:27 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Hmm, the real trick here is linkat(... "/proc/self/foobar"), not the
> O_TMPFILE: you can have an open file descriptor to an "invisible" file
> simply by creating a normal file and unlinking it. I looked at linkat()
> yesterday but the idea of using /proc/self didn't occur to me. Nasty
> trick :-( It seems linkat() is quite a bit more portable than
> O_TMPFILE, fortunately ...

Supposedly linkat(2) on Linux refuses to create a link to a file that
was opened normally and had its last link removed with unlink(2) due
to concerns that this would create security holes.

--
greg


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 14:14:09
Message-ID: 20140820141409.GJ10041@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-08-20 09:50:56 -0400, Alvaro Herrera wrote:
> Andres Freund wrote:
> > On 2014-08-19 17:42:06 -0400, Alvaro Herrera wrote:
> > > MauMau wrote:
> > >
> > > > With that said, copying to a temporary file like <dest>.tmp and
> > > > renaming it to <dest> sounds worthwhile even as a basic copy
> > > > utility. I want to avoid copying to a temporary file with a fixed
> > > > name like _copy.tmp, because some advanced utility may want to run
> > > > multiple instances of pg_copy to copy several files into the same
> > > > directory simultaneously. However, I'm afraid multiple <dest>.tmp
> > > > files might continue to occupy disk space after canceling copy or
> > > > power failure in some use cases, where the copy of the same file
> > > > won't be retried. That's also the reason why I chose to not use a
> > > > temporary file like cp/copy.
> > >
> > > Is there a way to create a link to a file which only exists as an open
> > > file descriptor? If there was, you could create a temp file, open an
> > > fd, then delete the file. That would remove the issue with files being
> > > leaked due to failures of various kinds.
> >
> > Isn't this a solution looking for a problem? We're using tempfiles in
> > dozens of other places and I really don't see why this is the place to
> > stop doing so. Just copy to <dest>.tmp and move it into place. If things
> > crash during that, the command will be repeated shortly afterwards again
> > *anyway*. Let's not get into platform specific games here.
>
> The issue is what happens if there's a crash while the temp file is in
> the middle of being filled.

The archive command will be be run again a couple seconds and remove the
half-filled temp file.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 14:18:28
Message-ID: 20140820141828.GD7648@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark wrote:
> On Wed, Aug 20, 2014 at 2:27 PM, Alvaro Herrera
> <alvherre(at)2ndquadrant(dot)com> wrote:
> > Hmm, the real trick here is linkat(... "/proc/self/foobar"), not the
> > O_TMPFILE: you can have an open file descriptor to an "invisible" file
> > simply by creating a normal file and unlinking it. I looked at linkat()
> > yesterday but the idea of using /proc/self didn't occur to me. Nasty
> > trick :-( It seems linkat() is quite a bit more portable than
> > O_TMPFILE, fortunately ...
>
> Supposedly linkat(2) on Linux refuses to create a link to a file that
> was opened normally and had its last link removed with unlink(2) due
> to concerns that this would create security holes.

Sigh.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 14:19:33
Message-ID: 9368.1408544373@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2014-08-20 09:50:56 -0400, Alvaro Herrera wrote:
>> Andres Freund wrote:
>>> Isn't this a solution looking for a problem? We're using tempfiles in
>>> dozens of other places and I really don't see why this is the place to
>>> stop doing so. Just copy to <dest>.tmp and move it into place.

>> The issue is what happens if there's a crash while the temp file is in
>> the middle of being filled.

> The archive command will be be run again a couple seconds and remove the
> half-filled temp file.

Alternatively, you could use the process PID as part of the temp file
name; which is probably a good idea anyway.

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 14:27:46
Message-ID: 20140820142746.GK10041@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-08-20 10:19:33 -0400, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2014-08-20 09:50:56 -0400, Alvaro Herrera wrote:
> >> Andres Freund wrote:
> >>> Isn't this a solution looking for a problem? We're using tempfiles in
> >>> dozens of other places and I really don't see why this is the place to
> >>> stop doing so. Just copy to <dest>.tmp and move it into place.
>
> >> The issue is what happens if there's a crash while the temp file is in
> >> the middle of being filled.
>
> > The archive command will be be run again a couple seconds and remove the
> > half-filled temp file.
>
> Alternatively, you could use the process PID as part of the temp file
> name; which is probably a good idea anyway.

I think that's actually worse, because nothing will clean up those
unless you explicitly scan for all <whatever>.$pid files, and somehow
kill them.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 14:36:40
Message-ID: 9715.1408545400@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2014-08-20 10:19:33 -0400, Tom Lane wrote:
>> Alternatively, you could use the process PID as part of the temp file
>> name; which is probably a good idea anyway.

> I think that's actually worse, because nothing will clean up those
> unless you explicitly scan for all <whatever>.$pid files, and somehow
> kill them.

True. As long as the copy command is prepared to get rid of a
pre-existing target file, using a fixed .tmp extension should be fine.

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 22:58:05
Message-ID: 20140820225805.GA17822@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 20, 2014 at 10:36:40AM -0400, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2014-08-20 10:19:33 -0400, Tom Lane wrote:
> >> Alternatively, you could use the process PID as part of the temp file
> >> name; which is probably a good idea anyway.
>
> > I think that's actually worse, because nothing will clean up those
> > unless you explicitly scan for all <whatever>.$pid files, and somehow
> > kill them.
>
> True. As long as the copy command is prepared to get rid of a
> pre-existing target file, using a fixed .tmp extension should be fine.

Well, then we are back to this comment by MauMau:

> With that said, copying to a temporary file like <dest>.tmp and
> renaming it to <dest> sounds worthwhile even as a basic copy utility.
> I want to avoid copying to a temporary file with a fixed name like
> _copy.tmp, because some advanced utility may want to run multiple
> instances of pg_copy to copy several files into the same directory
> simultaneously. However, I'm afraid multiple <dest>.tmp files might
> continue to occupy disk space after canceling copy or power failure in
> some use cases, where the copy of the same file won't be retried.
> That's also the reason why I chose to not use a temporary file like
> cp/copy.

Do we want cases where the same directory is used multiple pg_copy
processes? I can't imagine how that setup would make sense.

I am thinking pg_copy should emit a warning message when it removes an
old temp file. This might alert people that something odd is happening
if they see the message often.

The pid-extension idea would work as pg_copy can test all pid extension
files to see if the pid is still active. However, that assumes that the
pid is running on the local machine and not on another machines that has
NFS-mounted this directory, so maybe this is a bad idea, but again, we
are back to the idea that only one process should be writing into this
directory.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, MauMau <maumau307(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] pg_copy - a command for reliable WAL archiving
Date: 2014-08-20 23:10:53
Message-ID: 20140820231053.GA9835@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-08-20 18:58:05 -0400, Bruce Momjian wrote:
> On Wed, Aug 20, 2014 at 10:36:40AM -0400, Tom Lane wrote:
> > Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > > On 2014-08-20 10:19:33 -0400, Tom Lane wrote:
> > >> Alternatively, you could use the process PID as part of the temp file
> > >> name; which is probably a good idea anyway.
> >
> > > I think that's actually worse, because nothing will clean up those
> > > unless you explicitly scan for all <whatever>.$pid files, and somehow
> > > kill them.
> >
> > True. As long as the copy command is prepared to get rid of a
> > pre-existing target file, using a fixed .tmp extension should be fine.
>
> Well, then we are back to this comment by MauMau:

> > With that said, copying to a temporary file like <dest>.tmp and
> > renaming it to <dest> sounds worthwhile even as a basic copy utility.
> > I want to avoid copying to a temporary file with a fixed name like
> > _copy.tmp, because some advanced utility may want to run multiple
> > instances of pg_copy to copy several files into the same directory
> > simultaneously. However, I'm afraid multiple <dest>.tmp files might
> > continue to occupy disk space after canceling copy or power failure in
> > some use cases, where the copy of the same file won't be retried.
> > That's also the reason why I chose to not use a temporary file like
> > cp/copy.
>
> Do we want cases where the same directory is used multiple pg_copy
> processes? I can't imagine how that setup would make sense.

I don't think anybody is proposing the _copy.tmp proposal. We've just
argued about the risk of <dest>.tmp. And I argued - and others seem to
agree - the space usage problem isn't really relevant because archive
commands and such are rerun after failure and can then clean up the temp
file again.

> I am thinking pg_copy should emit a warning message when it removes an
> old temp file. This might alert people that something odd is happening
> if they see the message often.

Don't really see a point in this. If the archive command or such failed,
that will already have been logged. I'd expect this to be implemented by
passing O_CREAT | O_TRUNC to open(), nothing else.

> The pid-extension idea would work as pg_copy can test all pid extension
> files to see if the pid is still active. However, that assumes that the
> pid is running on the local machine and not on another machines that has
> NFS-mounted this directory, so maybe this is a bad idea, but again, we
> are back to the idea that only one process should be writing into this
> directory.

I don't actually think we should assume that. There very well could be
one process running an archive command, using differently prefixed file
names or such.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services