Re: 9.0beta2 - server crash when using HS + SR

Lists: pgsql-hackers
From: Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
To: pgsql-hackers(at)postgresql(dot)org
Subject: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-12 01:29:04
Message-ID: 4C12E2E0.2090307@usit.uio.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello

I am testing HS + SR in a system running 9.0beta2. What I am doing is
just trying all kind of crazy combinations and see how the system
handles them.

One of the test I knew was going to fail was to create a tablespace in
the master node with the directory used by the tablespace existing in
the master and not in the standby node.

What I didn't expect was such a serious consequence. Postgres crashed in
the standby node and it refused to start until the directory needed by
the tablespace was created also in the standby.

I suppose there is not an easy way of fixing this, but at least it would
be a good idea to update the documentation with some information about
how to fix this error situation (hot-standby.html#HOT-STANDBY-CAVEATS
will be a nice place to have this information)

Another thing is that the HINT message in the logs was a little
misleading. The server is down and it will not start without fixing the
cause of the problem.
- ----------------------------------------------------
FATAL: directory "/var/pgsql/ts_test" does not exist
CONTEXT: xlog redo create ts: 20177 "/var/pgsql/ts_test"
LOG: startup process (PID 10147) exited with exit code 1
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
- ----------------------------------------------------

regards,
- --
Rafael Martinez, <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
Center for Information Technology Services
University of Oslo, Norway

PGP Public Key: http://folk.uio.no/rafael/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkwS4ssACgkQBhuKQurGihQSfACePmzdjILYnPzKnk9NuDoB19YT
b3YAn2ufyis1r819ow3KJ46OO0Kv0Hd0
=boIg
-----END PGP SIGNATURE-----


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-13 16:42:49
Message-ID: AANLkTinxVuWOSS4egXx6A6bLmMrNsxuYS7kFfgKPLwkp@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jun 11, 2010 at 9:29 PM, Rafael Martinez
<r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no> wrote:
> I am testing HS + SR in a system running 9.0beta2. What I am doing is
> just trying all kind of crazy combinations and see how the system
> handles them.

Thanks!

> One of the test I knew was going to fail was to create a tablespace in
> the master node with the directory used by the tablespace existing in
> the master and not in the standby node.
>
> What I didn't expect was such a serious consequence. Postgres crashed in
> the standby node and it refused to start until the directory needed by
> the tablespace was created also in the standby.
>
> I suppose there is not an easy way of fixing this, but at least it would
> be a good idea to update the documentation with some information about
> how to fix this error situation (hot-standby.html#HOT-STANDBY-CAVEATS
> will be a nice place to have this information)
>
> Another thing is that the HINT message in the logs was a little
> misleading. The server is down and it will not start without fixing the
> cause of the problem.
> - ----------------------------------------------------
> FATAL:  directory "/var/pgsql/ts_test" does not exist
> CONTEXT:  xlog redo create ts: 20177 "/var/pgsql/ts_test"
> LOG:  startup process (PID 10147) exited with exit code 1
> LOG:  terminating any other active server processes
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back
> the current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.

I think the behavior is correct (what else would we do? we must be
able to replace the subsequent WAL records that use the new
tablespace) but I agree that the hint is a little misleading.
Ideally, it seems like we'd like to issue that hint if we're planning
to restart, but not otherwise. You get that same message, for
example, if the DBA performs an immediate shutdown.

I'm somewhat disinclined to try to address this for 9.0. We've had
this problem for a long time, and I'm not sure that the fact that it
can now happen in a slightly wider set of circumstances is enough
reason to engineer a solution so close to release time, nor am I sure
what that other solution would look like. But I'm open to other
opinions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-13 20:52:10
Message-ID: 20100613205210.GA20580@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 13, 2010 at 12:42:49PM -0400, Robert Haas wrote:
> I think the behavior is correct (what else would we do? we must be
> able to replace the subsequent WAL records that use the new
> tablespace) but I agree that the hint is a little misleading.
> Ideally, it seems like we'd like to issue that hint if we're planning
> to restart, but not otherwise. You get that same message, for
> example, if the DBA performs an immediate shutdown.

A bit of a comment from the sidelines: there's no particular reason why
the tablespaces on the master would need to match the tablespaces on
the slave. For a first cut it would seem to me that you should just be
able to ignore the tablespace commands on the slave. Not sure whether
that's easy or not though.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
> - Charles de Gaulle


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-13 21:05:08
Message-ID: AANLkTimO21Cp6zsiQ-kk7rDgQf4gaO65ElcxB1JbZUWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 13, 2010 at 4:52 PM, Martijn van Oosterhout
<kleptog(at)svana(dot)org> wrote:
> On Sun, Jun 13, 2010 at 12:42:49PM -0400, Robert Haas wrote:
>> I think the behavior is correct (what else would we do? we must be
>> able to replace the subsequent WAL records that use the new
>> tablespace) but I agree that the hint is a little misleading.
>> Ideally, it seems like we'd like to issue that hint if we're planning
>> to restart, but not otherwise.  You get that same message, for
>> example, if the DBA performs an immediate shutdown.
>
> A bit of a comment from the sidelines: there's no particular reason why
> the tablespaces on the master would need to match the tablespaces on
> the slave. For a first cut it would seem to me that you should just be
> able to ignore the tablespace commands on the slave. Not sure whether
> that's easy or not though.

It's not particularly easy, and it might also not be what you want.

Perhaps in an ideal world we would have some system for mapping
tablespaces on the master to tablespaces on the slave, but I doubt
it's worth the effort: the existing system is not terribly onerous.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-13 21:17:22
Message-ID: 4C154AE2.7080702@usit.uio.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Haas wrote:
> On Fri, Jun 11, 2010 at 9:29 PM, Rafael Martinez

> I'm somewhat disinclined to try to address this for 9.0. We've had
> this problem for a long time, and I'm not sure that the fact that it
> can now happen in a slightly wider set of circumstances is enough
> reason to engineer a solution so close to release time, nor am I sure
> what that other solution would look like. But I'm open to other
> opinions.
>

A minimum and probably the only feasible thing for 9.0 will be to update
the documentation. We need an entry in the hot-standby caveats section
explaining that if you create a tablespace and the directory needed does
not exist in the the standby, the standby will shutdown itself and will
not be able to start until the directory is also created in the standby.

For a DBA point of view, two possible solutions could be:

1) PostgreSQL creates the directory needed for the tablespace if the
user running postgres has privileges to do so at the OS level.

2) The standby discovers that the directory needed does not exist and
pauses the recovering (without shutting down the server) in the WAL
record that creates the tablespace. The standby will check periodically
if the directory is created before starting the recovery process again.

With this the users will be able to continue using and running queries
in the standby node. In very busy systems with many changes, the standby
will fall behind quite a lot if the error is not discovered and fixed
quickly. But in many other systems the delay will not be a problem as
serious as the loss of access to the standby.

regards,
- --
Rafael Martinez, <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
Center for Information Technology Services
University of Oslo, Norway

PGP Public Key: http://folk.uio.no/rafael/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEUEARECAAYFAkwVSsgACgkQBhuKQurGihQ1HgCXQKdwOEHLkj7g6FpJG663NUiZ
2gCZAT70aIQZ5Wj3IqsLlY6n+leLruI=
=neA1
-----END PGP SIGNATURE-----


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-14 00:16:48
Message-ID: 4C1574F0.2070304@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rafael Martinez wrote:
> A minimum and probably the only feasible thing for 9.0 will be to update
> the documentation. We need an entry in the hot-standby caveats section
> explaining that if you create a tablespace and the directory needed does
> not exist in the the standby, the standby will shutdown itself and will
> not be able to start until the directory is also created in the standby.
>

This is not a Hot Standby problem, and it's been documented since at
least http://www.postgresql.org/docs/8.2/static/warm-standby.html ; read
25.2.1 "Planning" in the current
http://developer.postgresql.org/pgdocs/postgres/warm-standby.html where
it's spelled out quite clearly.

It's a mixed blessing that it's now possible to actually get a
replicated server up so much more easily that people don't have to read
that particular document quite as carefully now and still get something
going. But if there's a documentation change to made, it should be
highlighting the warning already in that section better; it's not
something appropriate for the Hot Standby caveats. Since this is
clearly documented already, and there are bigger problems to worry about
for the current release, the real minimum action to perform here (and
the only one I would consider reasonable) is to change nothing at this
point for 9.0 here. I'm sorry you missed where this was covered, but
adding redundant documentation for basics like this invariably leads to
the multiple copies becoming out of sync with one another as changes are
made in the future.

> 1) PostgreSQL creates the directory needed for the tablespace if the
> user running postgres has privileges to do so at the OS level.
> 2) The standby discovers that the directory needed does not exist and
> pauses the recovering (without shutting down the server) in the WAL
> record that creates the tablespace. The standby will check periodically
> if the directory is created before starting the recovery process again.
>

Given that the idea behind a tablespace is that you want to relocate it
to a specific storage path, which may not map in the same way on the
standby, your first idea will never get implemented; it's not something
you want the server to guess about. As for the second, I would rather
see the standby go down--and hopefully set off some serious alarms for
the DBA who has screwed up here--than to stay up in a dysfunctional
polling state. The very serious mistake made is far more likely to be
discovered the way it's built right now.

I wouldn't be adverse to improving the error messages emitted when this
happens by the server to make it more obvious what's gone wrong in 9.1.
That's the only genuine improvement I'd see value in here, to cut down
on other people running into what you did and being as confused by it.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-14 06:21:49
Message-ID: AANLkTin8tuhrOML5d4gxZbYOkONQyG2GsTE1QPt9C26_@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 14, 2010 at 9:16 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> I wouldn't be adverse to improving the error messages emitted when this
> happens by the server to make it more obvious what's gone wrong in 9.1.
>  That's the only genuine improvement I'd see value in here, to cut down on
> other people running into what you did and being as confused by it.

What about the attached patch? When we encounter that problem, we get
the following hint message:

FATAL: directory "/path_to/ts" does not exist
HINT: create "/path_to/ts" directory for tablespace before
restarting the server
CONTEXT: xlog redo create ts: 16384 "/path_to/ts"

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
recovery_ts_hint_v1.patch application/octet-stream 752 bytes

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-14 11:36:56
Message-ID: 1276515416.23257.50322.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 2010-06-12 at 03:29 +0200, Rafael Martinez wrote:
> What I didn't expect was such a serious consequence. Postgres crashed
> in the standby node and it refused to start until the directory needed
> by the tablespace was created also in the standby.

> I suppose there is not an easy way of fixing this, but at least it
> would
> be a good idea to update the documentation with some information about
> how to fix this error situation (hot-standby.html#HOT-STANDBY-CAVEATS
> will be a nice place to have this information)

Thanks for testing.

This crash has nothing to do with HS or SR. There is already a
documented caveat and this has been like this for 6 years
http://developer.postgresql.org/pgdocs/postgres/continuous-archiving.html#CONTINUOUS-ARCHIVING-CAVEATS

If you said the docs need some work, I would agree with you. There isn't
anything to say that the caveats in 24.3.6 still apply to SR, though
they still all do.

--
Simon Riggs www.2ndQuadrant.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-30 17:15:57
Message-ID: 201006301715.o5UHFvb25845@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> On Mon, Jun 14, 2010 at 9:16 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> > I wouldn't be adverse to improving the error messages emitted when this
> > happens by the server to make it more obvious what's gone wrong in 9.1.
> > ?That's the only genuine improvement I'd see value in here, to cut down on
> > other people running into what you did and being as confused by it.
>
> What about the attached patch? When we encounter that problem, we get
> the following hint message:
>
> FATAL: directory "/path_to/ts" does not exist
> HINT: create "/path_to/ts" directory for tablespace before
> restarting the server
> CONTEXT: xlog redo create ts: 16384 "/path_to/ts"

This is an interesting patch idea. One problem with the patch is that
create_tablespace_directories() is called both during recovery and when
creating a tablespace, and the hint only makes sense in the first case.

The attached patch shows the hint only during recovery. Unless there
are objections, I will apply this for 9.0. I do think people will be
hit by this more often in 9.0.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +

Attachment Content-Type Size
/pgpatches/tablespace text/x-diff 3.2 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-30 17:24:34
Message-ID: 13033.1277918674@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
>> FATAL: directory "/path_to/ts" does not exist
>> HINT: create "/path_to/ts" directory for tablespace before
>> restarting the server
>> CONTEXT: xlog redo create ts: 16384 "/path_to/ts"

> This is an interesting patch idea. One problem with the patch is that
> create_tablespace_directories() is called both during recovery and when
> creating a tablespace, and the hint only makes sense in the first case.

Please make the hint conform to the project message style guidelines.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-30 17:27:27
Message-ID: 13097.1277918847@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> The attached patch shows the hint only during recovery.

BTW, it would be easier and more consistent with the rest of the code to
look at InRecovery, instead of messing around with the function
signature. And the usual way to emit a hint conditionally is

(InRecovery ? errhint(...) : 0)

rather than duplicate a lot of surrounding code.

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-30 23:41:53
Message-ID: 201006302341.o5UNfrp07888@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > The attached patch shows the hint only during recovery.
>
> BTW, it would be easier and more consistent with the rest of the code to
> look at InRecovery, instead of messing around with the function
> signature. And the usual way to emit a hint conditionally is
>
> (InRecovery ? errhint(...) : 0)
>
> rather than duplicate a lot of surrounding code.

Thanks for the "hints". I was thinking there was a way to use ? : for
the hint, but couldn't find an example. I see examples now. Updated
patch attached.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +

Attachment Content-Type Size
/pgpatches/tablespace text/x-diff 980 bytes

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-07-02 02:44:48
Message-ID: 201007020244.o622imV20225@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > > The attached patch shows the hint only during recovery.
> >
> > BTW, it would be easier and more consistent with the rest of the code to
> > look at InRecovery, instead of messing around with the function
> > signature. And the usual way to emit a hint conditionally is
> >
> > (InRecovery ? errhint(...) : 0)
> >
> > rather than duplicate a lot of surrounding code.
>
> Thanks for the "hints". I was thinking there was a way to use ? : for
> the hint, but couldn't find an example. I see examples now. Updated
> patch attached.

Applied.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +