Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69

Lists: pgsql-committerspgsql-hackers
From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-committers(at)postgresql(dot)org
Subject: pgsql: Upgrade to Autoconf 2.69
Date: 2013-12-19 02:11:40
Message-ID: E1VtT5Q-0003Cc-Hh@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Upgrade to Autoconf 2.69

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/94b899b829657332bda856ac3f06153d09077bd1

Modified Files
--------------
configure |35943 +++++++++++++-------------------------------
configure.in | 2 +-
src/include/pg_config.h.in | 31 +-
3 files changed, 10253 insertions(+), 25723 deletions(-)


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Upgrade to Autoconf 2.69
Date: 2013-12-20 13:54:36
Message-ID: 20131220135436.GT11006@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Peter Eisentraut wrote:
> Upgrade to Autoconf 2.69
>

Shortly after this patch was committed, buildfarm member locust (running
Mac OS X 10.5 apparently) started failing the pg_upgrade check:

command: "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/install//Users/pgbuildfarm/Documents/workdir//HEAD/inst/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/data" -o "-p 57632 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade'" start >> "pg_upgrade_server.log" 2>&1
waiting for server to start....LOG: database system was shut down at 2013-12-19 12:51:16 CET
LOG: invalid primary checkpoint record
LOG: invalid secondary checkpoint link in control file
PANIC: could not locate a valid checkpoint record

...

command: "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/install//Users/pgbuildfarm/Documents/workdir//HEAD/inst/bin/pg_resetxlog" -l 000000010000000000000009 "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/data" >> "pg_upgrade_utility.log" 2>&1
pg_resetxlog: could not read from directory "pg_xlog": Invalid argument

I don't see how can the pg_upgrade check fail in this way but not the
regular regression test. This patch includes the following hunk to
pg_config.h.in:

+/* Enable large inode numbers on Mac OS X 10.5. */
+#ifndef _DARWIN_USE_64_BIT_INODE
+# define _DARWIN_USE_64_BIT_INODE 1
+#endif

Evidently something is not going well in ReadRecord. It should have
reported the read failure, but didn't. That seems a separate bug that
needs fixed.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Upgrade to Autoconf 2.69
Date: 2013-12-20 14:39:33
Message-ID: 20131220143933.GV11006@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Alvaro Herrera wrote:
> Peter Eisentraut wrote:
> > Upgrade to Autoconf 2.69

> +/* Enable large inode numbers on Mac OS X 10.5. */
> +#ifndef _DARWIN_USE_64_BIT_INODE
> +# define _DARWIN_USE_64_BIT_INODE 1
> +#endif

FWIW

http://gnu-autoconf.7623.n7.nabble.com/AC-SYS-LARGEFILE-and-old-OSX-framework-td19154.html

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2013-12-21 14:39:22
Message-ID: 1387636762.30013.13.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, 2013-12-20 at 10:54 -0300, Alvaro Herrera wrote:
> I don't see how can the pg_upgrade check fail in this way but not the
> regular regression test. This patch includes the following hunk to
> pg_config.h.in:
>
> +/* Enable large inode numbers on Mac OS X 10.5. */
> +#ifndef _DARWIN_USE_64_BIT_INODE
> +# define _DARWIN_USE_64_BIT_INODE 1
> +#endif
>
>
> Evidently something is not going well in ReadRecord. It should have
> reported the read failure, but didn't. That seems a separate bug that
> needs fixed.

This is enabling large-file support on OS X, so that seems kind of
important. It's not failing with newer versions of OS X, so that leaves
the following possibilities, I think:

- Large files never worked on 10.5. That would be strange because
Autoconf explicitly refers to that version.

- New OS X versions have this on by default (could someone check?), so
it already worked before, and it's something specific to 10.5 that's
broken.

- It's something specific to PowerPC or endianness.

- That build farm member has a strange setup.

If anyone has any of these components, please help isolate the problem.

It's conceivable that inconsistent include file order would cause this
kind of problem, but we're pretty strict about including the right
things first.

It someone can reproduce the problem, it would also help dissecting the
pg_upgrade test script to see exactly where the failure is introduced.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2013-12-23 15:46:19
Message-ID: 52B85ACB.4040509@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 12/21/13, 9:39 AM, Peter Eisentraut wrote:
> This is enabling large-file support on OS X, so that seems kind of
> important. It's not failing with newer versions of OS X, so that leaves
> the following possibilities, I think:
>
> - Large files never worked on 10.5. That would be strange because
> Autoconf explicitly refers to that version.
>
> - New OS X versions have this on by default (could someone check?), so
> it already worked before, and it's something specific to 10.5 that's
> broken.

Current OS X has 64-bit inodes on by default, so this code works in
general. So it's probably one of the other causes.

> - It's something specific to PowerPC or endianness.
>
> - That build farm member has a strange setup.


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2013-12-24 15:58:04
Message-ID: 20131224155804.GN22570@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Alvaro Herrera wrote:

Heikki, Andres,

> Shortly after this patch was committed, buildfarm member locust (running
> Mac OS X 10.5 apparently) started failing the pg_upgrade check:
>
> command: "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/install//Users/pgbuildfarm/Documents/workdir//HEAD/inst/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/data" -o "-p 57632 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade'" start >> "pg_upgrade_server.log" 2>&1
> waiting for server to start....LOG: database system was shut down at 2013-12-19 12:51:16 CET
> LOG: invalid primary checkpoint record
> LOG: invalid secondary checkpoint link in control file
> PANIC: could not locate a valid checkpoint record

Any comment on this problem? Somehow ReadRecord is unable to find a
checkpoint, yet there's no error message to be seen anywhere, whereas
pg_resetxlog does report it:

> command: "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/install//Users/pgbuildfarm/Documents/workdir//HEAD/inst/bin/pg_resetxlog" -l 000000010000000000000009 "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/data" >> "pg_upgrade_utility.log" 2>&1
> pg_resetxlog: could not read from directory "pg_xlog": Invalid argument

I cannot but think xlogreader is at fault.

Regardless of the solution to the Mac OS X problem, ISTM this should be
fixed.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2013-12-24 16:12:52
Message-ID: 20131224161252.GJ26564@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Hi,

On 2013-12-24 12:58:04 -0300, Alvaro Herrera wrote:
> > Shortly after this patch was committed, buildfarm member locust (running
> > Mac OS X 10.5 apparently) started failing the pg_upgrade check:
> >
> > command: "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/install//Users/pgbuildfarm/Documents/workdir//HEAD/inst/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/data" -o "-p 57632 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade'" start >> "pg_upgrade_server.log" 2>&1
> > waiting for server to start....LOG: database system was shut down at 2013-12-19 12:51:16 CET
> > LOG: invalid primary checkpoint record
> > LOG: invalid secondary checkpoint link in control file
> > PANIC: could not locate a valid checkpoint record
>
> Any comment on this problem? Somehow ReadRecord is unable to find a
> checkpoint, yet there's no error message to be seen anywhere, whereas
> pg_resetxlog does report it:
>
> > command: "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/install//Users/pgbuildfarm/Documents/workdir//HEAD/inst/bin/pg_resetxlog" -l 000000010000000000000009 "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/data" >> "pg_upgrade_utility.log" 2>&1
> > pg_resetxlog: could not read from directory "pg_xlog": Invalid argument
>
> I cannot but think xlogreader is at fault.
>
> Regardless of the solution to the Mac OS X problem, ISTM this should be
> fixed.

I didn't look at any code, and I won't today, but it doesn't look
surprising - the report when starting the server above is presumable the
one in ReadCheckpoint() (or similar) and it probably just reports that
ReadRecord() didn't return a record.
pg_resetxlog (which doesn't use xlogreader!) reports that it couldn't
read from directory "pg_xlog", so there's something wonky independently
from xlogreader. I'd guess that xlog.c read_page callback errors out
without reporting an error. IIRC we're logging some failures as DEBUG
there, because they really aren't unexpected, and normally just signal
the end of wal.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2013-12-27 18:57:43
Message-ID: 20131227185743.GQ22570@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Andres Freund wrote:
> Hi,
>
> On 2013-12-24 12:58:04 -0300, Alvaro Herrera wrote:
> > > Shortly after this patch was committed, buildfarm member locust (running
> > > Mac OS X 10.5 apparently) started failing the pg_upgrade check:
> > >
> > > command: "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/install//Users/pgbuildfarm/Documents/workdir//HEAD/inst/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/data" -o "-p 57632 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade'" start >> "pg_upgrade_server.log" 2>&1
> > > waiting for server to start....LOG: database system was shut down at 2013-12-19 12:51:16 CET
> > > LOG: invalid primary checkpoint record
> > > LOG: invalid secondary checkpoint link in control file
> > > PANIC: could not locate a valid checkpoint record
> >
> > Any comment on this problem? Somehow ReadRecord is unable to find a
> > checkpoint, yet there's no error message to be seen anywhere, whereas
> > pg_resetxlog does report it:
> >
> > > command: "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/install//Users/pgbuildfarm/Documents/workdir//HEAD/inst/bin/pg_resetxlog" -l 000000010000000000000009 "/Users/pgbuildfarm/Documents/workdir/HEAD/pgsql.82393/contrib/pg_upgrade/tmp_check/data" >> "pg_upgrade_utility.log" 2>&1
> > > pg_resetxlog: could not read from directory "pg_xlog": Invalid argument
> >
> > I cannot but think xlogreader is at fault.
> >
> > Regardless of the solution to the Mac OS X problem, ISTM this should be
> > fixed.
>
> I didn't look at any code, and I won't today, but it doesn't look
> surprising - the report when starting the server above is presumable the
> one in ReadCheckpoint() (or similar) and it probably just reports that
> ReadRecord() didn't return a record.

How is this not surprising? Surely failing to find a checkpoint record
is not a problem to be taken lightly.

> pg_resetxlog (which doesn't use xlogreader!) reports that it couldn't
> read from directory "pg_xlog", so there's something wonky independently
> from xlogreader.

Yes, most likely there is. My point is that the LOG messages above
should have logged the system error that caused the checkpoint record to
be unfindable.

> I'd guess that xlog.c read_page callback errors out without reporting
> an error. IIRC we're logging some failures as DEBUG there, because
> they really aren't unexpected, and normally just signal the end of
> wal.

Hmm? At least, I recall something like a "unexpected pageaddr" message
is sometimes logged when end-of-wal is found. Why would other error
messages be hidden?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2013-12-29 07:48:21
Message-ID: 3894.1388303301@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On Fri, 2013-12-20 at 10:54 -0300, Alvaro Herrera wrote:
>> Evidently something is not going well in ReadRecord. It should have
>> reported the read failure, but didn't. That seems a separate bug that
>> needs fixed.

> This is enabling large-file support on OS X, so that seems kind of
> important. It's not failing with newer versions of OS X, so that leaves
> the following possibilities, I think:

[ gets out ancient PPC laptop ... ]

[ man, this thing is slow ... ]

[ hours later ... ]

There are three or four different bugs here, but the key points are:

1. pg_resetxlog is diligently trashing every single WAL file in pg_xlog/,
and then failing (by virtue of some ancient OS X bug in readdir()), so
that it doesn't get to the point of recreating a WAL file with a
checkpoint record.

2. pg_resetxlog already rewrote pg_control, which might not be such a hot
idea; but whether it had or not, pg_control's last-checkpoint pointer
would be pointing to a WAL file that is not in pg_xlog/ now.

3. pg_upgrade ignores the fact that pg_resetxlog failed, and keeps going.

4. The server tries to start, and fails because it can't find a WAL file
containing the last checkpoint record. This is pretty unsurprising given
the facts above. The reason you don't see any "no such file" report is
that XLogFileRead() will report any BasicOpenFile() failure *other than*
ENOENT. And nothing else makes up for that.

Re point 4: the logic, if you can call it that, in xlog.c and xlogreader.c
is making my head spin. There are about four levels of overcomplicated
and undercommented code before you ever get down to XLogFileRead, so I
have no idea which level to blame for the lack of error reporting in this
specific case. But there are pretty clearly some cases in which ignoring
ENOENT in XLogFileRead isn't such a good idea, and XLogFileRead isn't
being told when to do that or not.

Re point 3: I already bitched about that.

Re point 2: I wonder whether pg_resetxlog shouldn't do things in a
different order. Rewriting pg_control to point to a file that doesn't
exist yet doesn't sound great. I'm not sure though if there's any
great improvement to be had here; basically, if we fail partway
through, we're probably screwed no matter what.

Re point 1: there are threads in our archives suggesting that EINVAL
from readdir() after unlinking a directory entry is a known problem
on some versions of OS X, eg
http://www.postgresql.org/message-id/flat/47C45B07-8459-48D8-8FBE-291864019CC2(at)me(dot)com
and I also find stuff on the intertubes suggesting that renaming an entry
can cause it too, eg
http://www.dovecot.org/list/dovecot/2009-July/041009.html

We thought at the time that the readdir bug was new in OS X 10.6, but I'll
bet it was there in 10.5's 64-bit-inode support and was only exposed to
default builds when 10.6 turned on 64-bit-inodes in userland by default.
So Apple fixed it in 10.6.2 but never back-patched the fix into 10.5.x ---
shame on them.

I'm disinclined to contort our stuff to any great extent to fix it,
because all the OS X versions mentioned are several years obsolete.
Perhaps though we should override Autoconf's setting of
_DARWIN_USE_64_BIT_INODE, if we can do that easily? It's clearly
not nearly as problem-free on 10.5 as the Autoconf boys believe,
and it's already enabled by default on the release series where it
does work.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2013-12-29 18:05:29
Message-ID: 14711.1388340329@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

I wrote:
> Perhaps though we should override Autoconf's setting of
> _DARWIN_USE_64_BIT_INODE, if we can do that easily? It's clearly
> not nearly as problem-free on 10.5 as the Autoconf boys believe,
> and it's already enabled by default on the release series where it
> does work.

I looked into this and found that _DARWIN_USE_64_BIT_INODE is being turned
on by AC_SYS_LARGEFILE. Quite aside from the wisdom of doing this at all,
it's got nothing to do with the advertised purpose of that macro: the
width of inode_t would affect how many files you can put on one
filesystem, not how large the individual files are. I don't think that is
something that we need to concern ourselves with enabling when it's not
the platform default. And just to add insult to injury, the
implementation technique is such that the #define gets put into
pg_config.h unconditionally, even if AC_SYS_LARGEFILE isn't executed by
the configure script!

So IMO this is brain-dead in at least three different ways, and I've
pushed a patch to revert it.

We still need to address the other issues enumerated in my previous
message, but this should be enough to get buildfarm member locust
happy again.

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2013-12-30 12:46:38
Message-ID: 20131230124638.GC12910@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 2013-12-29 02:48:21 -0500, Tom Lane wrote:
> 4. The server tries to start, and fails because it can't find a WAL file
> containing the last checkpoint record. This is pretty unsurprising given
> the facts above. The reason you don't see any "no such file" report is
> that XLogFileRead() will report any BasicOpenFile() failure *other than*
> ENOENT. And nothing else makes up for that.
>
> Re point 4: the logic, if you can call it that, in xlog.c and xlogreader.c
> is making my head spin. There are about four levels of overcomplicated
> and undercommented code before you ever get down to XLogFileRead, so I
> have no idea which level to blame for the lack of error reporting in this
> specific case. But there are pretty clearly some cases in which ignoring
> ENOENT in XLogFileRead isn't such a good idea, and XLogFileRead isn't
> being told when to do that or not.

Yes, that code is pretty horrid. To Heikki's and my defense, I don't
think the xlogreader.c split had much to do with it tho. I think the
path erroring out essentially is

ReadRecord()->XLogReadRecord()*->ReadPageInternal()*->XLogPageRead()
->WaitForWALToBecomeAvailable()->XLogFileReadAnyTLI()->XLogFileRead()

The *ed functions are new, but it's really code that was in ReadRecord()
before. So I don't think too much has changed since 9.0ish, although the
timeline switch didn't make it simpler.

As far as I can tell XLogFileRead() actually is told when it's ok to
ignore an error - the notfoundOK parameter. It's just that we're always
passing true for it we're not streaming...
I think it might be sufficient to make passing that flag additionally
conditional on fetching_ckpt, that's already passed to
WaitForWALToBecomeAvailable(), so we'd just need to add it to
XLogFileReadAnyTLI().

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2014-01-07 04:07:32
Message-ID: 20140107040732.GB30539@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Sun, Dec 29, 2013 at 02:48:21AM -0500, Tom Lane wrote:
> There are three or four different bugs here, but the key points are:
>
> 1. pg_resetxlog is diligently trashing every single WAL file in pg_xlog/,
> and then failing (by virtue of some ancient OS X bug in readdir()), so
> that it doesn't get to the point of recreating a WAL file with a
> checkpoint record.
>
> 2. pg_resetxlog already rewrote pg_control, which might not be such a hot
> idea; but whether it had or not, pg_control's last-checkpoint pointer
> would be pointing to a WAL file that is not in pg_xlog/ now.
>
> 3. pg_upgrade ignores the fact that pg_resetxlog failed, and keeps going.

Does pg_resetxlog return a non-zero exit status? If so, pg_upgrade
should have caught that and exited.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2014-01-07 04:40:53
Message-ID: 31473.1389069653@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> On Sun, Dec 29, 2013 at 02:48:21AM -0500, Tom Lane wrote:
>> 3. pg_upgrade ignores the fact that pg_resetxlog failed, and keeps going.

> Does pg_resetxlog return a non-zero exit status? If so, pg_upgrade
> should have caught that and exited.

It certainly does:

if (errno)
{
fprintf(stderr, _("%s: could not read from directory \"%s\": %s\n"),
progname, XLOGDIR, strerror(errno));
exit(1);
}

The bug is that pg_upgrade appears to assume (in many places not just this
one) that exec_prog() will abort if the called program fails, but *it
doesn't*, contrary to the claim in its own header comment. This is
because pg_log(FATAL, ...) doesn't call exit(). pg_fatal() does, but
that's not what's being called in the throw_error case.

I imagine that this used to work correctly and got broken by some
ill-advised refactoring, but whatever the origin, it's 100% broken today.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2014-01-07 13:19:49
Message-ID: 95257337-421E-474B-9799-3B807974DDD6@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

That was probably me. I'll look into it.

> On Jan 6, 2014, at 11:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>>> On Sun, Dec 29, 2013 at 02:48:21AM -0500, Tom Lane wrote:
>>> 3. pg_upgrade ignores the fact that pg_resetxlog failed, and keeps going.
>
>> Does pg_resetxlog return a non-zero exit status? If so, pg_upgrade
>> should have caught that and exited.
>
> It certainly does:
>
> if (errno)
> {
> fprintf(stderr, _("%s: could not read from directory \"%s\": %s\n"),
> progname, XLOGDIR, strerror(errno));
> exit(1);
> }
>
> The bug is that pg_upgrade appears to assume (in many places not just this
> one) that exec_prog() will abort if the called program fails, but *it
> doesn't*, contrary to the claim in its own header comment. This is
> because pg_log(FATAL, ...) doesn't call exit(). pg_fatal() does, but
> that's not what's being called in the throw_error case.
>
> I imagine that this used to work correctly and got broken by some
> ill-advised refactoring, but whatever the origin, it's 100% broken today.
>
> regards, tom lane
>


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2014-01-10 21:02:02
Message-ID: 20140110210202.GG4873@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Tue, Jan 7, 2014 at 08:19:49AM -0500, Peter Eisentraut wrote:
> That was probably me. I'll look into it.
>
>
>
> > On Jan 6, 2014, at 11:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > Bruce Momjian <bruce(at)momjian(dot)us> writes:
> >>> On Sun, Dec 29, 2013 at 02:48:21AM -0500, Tom Lane wrote:
> >>> 3. pg_upgrade ignores the fact that pg_resetxlog failed, and keeps going.
> >
> >> Does pg_resetxlog return a non-zero exit status? If so, pg_upgrade
> >> should have caught that and exited.
> >
> > It certainly does:
> >
> > if (errno)
> > {
> > fprintf(stderr, _("%s: could not read from directory \"%s\": %s\n"),
> > progname, XLOGDIR, strerror(errno));
> > exit(1);
> > }
> >
> > The bug is that pg_upgrade appears to assume (in many places not just this
> > one) that exec_prog() will abort if the called program fails, but *it
> > doesn't*, contrary to the claim in its own header comment. This is
> > because pg_log(FATAL, ...) doesn't call exit(). pg_fatal() does, but
> > that's not what's being called in the throw_error case.
> >
> > I imagine that this used to work correctly and got broken by some
> > ill-advised refactoring, but whatever the origin, it's 100% broken today.

I know Peter is looking at this, but I looked at and I can't see the
problem. Every call of exec_prog() that uses pg_resetxlog has
throw_error = true, and the test there is:

result = system(cmd);

if (result != 0)
...
pg_log(FATAL, ...)

and in pg_log_v() I see:

switch (type)
...
case PG_FATAL:
printf("\n%s", _(message));
printf("Failure, exiting\n");
--> exit(1);
break;

so I am not clear how you are seeing the return status of pg_resetxlog
ignored. I tried the attached patch which causes pg_resetxlog -f to
return -1, and got the proper error from pg_upgrade in git head:

Performing Upgrade
------------------
Analyzing all rows in the new cluster ok
Freezing all rows on the new cluster ok
Deleting files from new pg_clog ok
Copying old pg_clog to new server ok
Setting next transaction ID for new cluster
*failure*

Consult the last few lines of "pg_upgrade_utility.log" for
the probable cause of the failure.
Failure, exiting

and the last line in "pg_upgrade_utility.log" is:

command: "/u/pgsql/bin/pg_resetxlog" -f -x 683 "/u/pgsql/data" >>
"pg_upgrade_utility.log" 2>&1

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

Attachment Content-Type Size
reset.diff text/x-diff 377 bytes

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2014-01-10 21:06:06
Message-ID: 20140110210605.GW6840@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Bruce Momjian wrote:

> I know Peter is looking at this, but I looked at and I can't see the
> problem. Every call of exec_prog() that uses pg_resetxlog has
> throw_error = true, and the test there is:
>
> result = system(cmd);
>
> if (result != 0)
> ...
> pg_log(FATAL, ...)
>
> and in pg_log_v() I see:
>
> switch (type)
> ...
> case PG_FATAL:
> printf("\n%s", _(message));
> printf("Failure, exiting\n");
> --> exit(1);
> break;

This was fixed by Peter two days ago in commit
http://git.postgresql.org/pg/commitdiff/ca607b155e86ce529fc9ac322a232f264cda9ab6

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2014-01-10 21:06:11
Message-ID: 16764.1389387971@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> On Tue, Jan 7, 2014 at 08:19:49AM -0500, Peter Eisentraut wrote:
>> That was probably me. I'll look into it.

> and in pg_log_v() I see:

> switch (type)
> ...
> case PG_FATAL:
> printf("\n%s", _(message));
> printf("Failure, exiting\n");
> --> exit(1);
> break;

Peter just fixed that; see commit ca607b155e86ce529fc9ac322a232f264cda9ab6

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Date: 2014-01-10 21:10:58
Message-ID: 20140110211058.GH4873@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Jan 10, 2014 at 04:06:11PM -0500, Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > On Tue, Jan 7, 2014 at 08:19:49AM -0500, Peter Eisentraut wrote:
> >> That was probably me. I'll look into it.
>
> > and in pg_log_v() I see:
>
> > switch (type)
> > ...
> > case PG_FATAL:
> > printf("\n%s", _(message));
> > printf("Failure, exiting\n");
> > --> exit(1);
> > break;
>
> Peter just fixed that; see commit ca607b155e86ce529fc9ac322a232f264cda9ab6

Oh, I guess I checked git log on the wrong file then. Thanks.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +