Quick Links

Re: Hot standby and b-tree killed items

Lists:	pgsql-hackers

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Hot standby and b-tree killed items
Date:	2008-12-19 08:49:02
Message-ID:	494B5FFE.4090909@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Whenever a B-tree index scan fetches a heap tuple that turns out to be
dead, the B-tree item is marked as killed by calling _bt_killitems. When
the page gets full, all the killed items are removed by calling
_bt_vacuum_one_page.

That's a problem for hot standby. If any of the killed b-tree items
point to a tuple that is still visible to a running read-only
transaction, we have the same situation as with vacuum, and have to
either wait for the read-only transaction to finish before applying the
WAL record or kill the transaction.

It looks like there's some cosmetic changes related to that in the
patch, the signature of _bt_delitems is modified, but there's no actual
changes that would handle that situation. I didn't see it on the TODO on
the hot standby wiki either. Am I missing something, or the patch?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 09:38:44
Message-ID:	1229679524.4793.481.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-12-19 at 10:49 +0200, Heikki Linnakangas wrote:

> Whenever a B-tree index scan fetches a heap tuple that turns out to be
> dead, the B-tree item is marked as killed by calling _bt_killitems. When
> the page gets full, all the killed items are removed by calling
> _bt_vacuum_one_page.
>
> That's a problem for hot standby. If any of the killed b-tree items
> point to a tuple that is still visible to a running read-only
> transaction, we have the same situation as with vacuum, and have to
> either wait for the read-only transaction to finish before applying the
> WAL record or kill the transaction.
>
> It looks like there's some cosmetic changes related to that in the
> patch, the signature of _bt_delitems is modified, but there's no actual
> changes that would handle that situation. I didn't see it on the TODO on
> the hot standby wiki either. Am I missing something, or the patch?

ResolveRedoVisibilityConflicts() describes the current patch's position
on this point, which on review is wrong, I agree.

It looks like I assumed that _bt_delitems is only called during VACUUM,
which I knew it wasn't. I know I was going to split XLOG_BTREE_VACUUM
into two record types at one point, one for delete, one for vacuum. In
the end I didn't. Anyhow, its wrong.

We have infrastructure in place to make this work correctly, just need
to add latestRemovedXid field to xl_btree_vacuum. So that part is easily
solved.

Thanks for spotting it. More like that please!

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 10:24:52
Message-ID:	494B7674.5040804@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> We have infrastructure in place to make this work correctly, just need
> to add latestRemovedXid field to xl_btree_vacuum. So that part is easily
> solved.

That's tricky because there's no xmin/xmax on index tuples. You could
conservatively use OldestXmin as latestRemovedXid, but that could stall
the WAL redo a lot more than necessary. Or you could store
latestRemovedXid in the page header, but that would need to be
WAL-logged to ensure that it's valid after crash. Or you could look at
the heap to fetch the xmin/xmax, but that would be expensive.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 10:52:42
Message-ID:	1229683962.4793.504.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-12-19 at 12:24 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > We have infrastructure in place to make this work correctly, just need
> > to add latestRemovedXid field to xl_btree_vacuum. So that part is easily
> > solved.
>
> That's tricky because there's no xmin/xmax on index tuples.

Doh.

> You could
> conservatively use OldestXmin as latestRemovedXid, but that could stall
> the WAL redo a lot more than necessary. Or you could store
> latestRemovedXid in the page header, but that would need to be
> WAL-logged to ensure that it's valid after crash. Or you could look at
> the heap to fetch the xmin/xmax, but that would be expensive.

Agreed. Probably need to use OldestXmin then.

If I was going to add anything to the btree page header, it would be
latestRemovedLSN, only set during recovery. That way we don't have to
explicitly kill queries, we can do the a wait on OldestXmin then let
them ERROR out when they find a page that has been modified.

I have a suspicion that we may need some modification of that solution
for all data blocks, so we don't kill too many queries.

Hmmm. I wonder if we can track latestRemovedLSN for all of
shared_buffers. That was initially rejected, but if we set the
latestRemovedLSN to be the block's LSN when we read it in, that would be
fairly useful. Either way we use 8 bytes RAM per buffer.

BTW, I noticed the other day that Oracle 11g only allows you to have a
read only slave *or* allows you to continue replaying. You need to
manually switch back and forth between those modes. They can't do
*both*, as Postgres will be able to do. That's because their undo
information is stored off-block in the Undo Tablespace, so is not
available for standby queries. Nice one, Postgres.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 14:22:59
Message-ID:	2F3DEB6C-88E6-4F3A-9F08-C650A228DC8D@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I'm confused shouldn't read-only transactions on the slave just be
hacked to not set any hint bits including lp_delete?

--
Greg

On 19 Dec 2008, at 03:49, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com
> wrote:

> Whenever a B-tree index scan fetches a heap tuple that turns out to
> be dead, the B-tree item is marked as killed by calling
> _bt_killitems. When the page gets full, all the killed items are
> removed by calling _bt_vacuum_one_page.
>
> That's a problem for hot standby. If any of the killed b-tree items
> point to a tuple that is still visible to a running read-only
> transaction, we have the same situation as with vacuum, and have to
> either wait for the read-only transaction to finish before applying
> the WAL record or kill the transaction.
>
> It looks like there's some cosmetic changes related to that in the
> patch, the signature of _bt_delitems is modified, but there's no
> actual changes that would handle that situation. I didn't see it on
> the TODO on the hot standby wiki either. Am I missing something, or
> the patch?
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 15:34:56
Message-ID:	1229700896.4793.544.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-12-19 at 09:22 -0500, Greg Stark wrote:

> I'm confused shouldn't read-only transactions on the slave just be
> hacked to not set any hint bits including lp_delete?

They could be, though I see no value in doing so.

But that is not Heikki's point. He is discussing what happens on the
primary and the effects that must then occur on the standby. He has
rightly pointed out a (pluggable) hole in my logic.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 17:54:18
Message-ID:	494B8B69.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:

> If I was going to add anything to the btree page header, it would be
> latestRemovedLSN, only set during recovery. That way we don't have
to
> explicitly kill queries, we can do the a wait on OldestXmin then let
> them ERROR out when they find a page that has been modified.
>
> I have a suspicion that we may need some modification of that
solution
> for all data blocks, so we don't kill too many queries.

If the failure is caused by the timing of various database
transactions, and the query is likely to run successfully after a
delay and a retry, please use SQLSTATE of '40001'. Some software
(ours, for one) will recognize this and retry the query automatically,
so that the user impact is essentially the same as blocking.

-Kevin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 18:06:20
Message-ID:	1229709980.4793.564.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-12-19 at 11:54 -0600, Kevin Grittner wrote:
> >>> Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
>
> > If I was going to add anything to the btree page header, it would be
> > latestRemovedLSN, only set during recovery. That way we don't have
> to
> > explicitly kill queries, we can do the a wait on OldestXmin then let
> > them ERROR out when they find a page that has been modified.
> >
> > I have a suspicion that we may need some modification of that
> solution
> > for all data blocks, so we don't kill too many queries.
>
> If the failure is caused by the timing of various database
> transactions, and the query is likely to run successfully after a
> delay and a retry, please use SQLSTATE of '40001'. Some software
> (ours, for one) will recognize this and retry the query automatically,
> so that the user impact is essentially the same as blocking.

I understand the need, but we won't be using SQLSTATE = 40001.

That corresponds to ERRCODE_T_R_SERIALIZATION_FAILURE, which that error
would not be.

The error message ought to be "snapshot too old", which could raise a
chuckle, so I called it something else.

The point you raise is a good one and I think we should publish a list
of retryable error messages. I contemplated once proposing a special log
level for a retryable error, but not quite a good idea.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 18:09:37
Message-ID:	1229710177.4793.567.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-12-19 at 10:52 +0000, Simon Riggs wrote:

> > You could
> > conservatively use OldestXmin as latestRemovedXid, but that could stall
> > the WAL redo a lot more than necessary. Or you could store
> > latestRemovedXid in the page header, but that would need to be
> > WAL-logged to ensure that it's valid after crash. Or you could look at
> > the heap to fetch the xmin/xmax, but that would be expensive.
>
> Agreed. Probably need to use OldestXmin then.

Just finished coding this up, plus TODO item to pin every index page.
Will post after some further testing.

Used RecentOldestXmin.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>
Cc:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 18:53:28
Message-ID:	494B9948.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:

> I understand the need, but we won't be using SQLSTATE = 40001.
>
> That corresponds to ERRCODE_T_R_SERIALIZATION_FAILURE, which that
error
> would not be.

Isn't it a problem with serialization of database transactions? You
hit it in a different way, but if it is a temporary failure due to the
timing of the transactions, I strongly feel that that is the correct
SQLSTATE to use. Perhaps more information to provide any useful
context could be in the info or hint areas?

-Kevin

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 18:59:33
Message-ID:	877i5wkn3e.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:

> The error message ought to be "snapshot too old", which could raise a
> chuckle, so I called it something else.
>
> The point you raise is a good one and I think we should publish a list
> of retryable error messages. I contemplated once proposing a special log
> level for a retryable error, but not quite a good idea.

I'm a bit concerned about the idea of killing off queries to allow WAL to
proceed. While I have nothing against that being an option I think we should
be aiming to make it not necessary for correctness and not the default. By
default I think WAL replay should stick to stalling WAL replay and only resort
to killing queries if the user specifically requests it.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 19:37:24
Message-ID:	1229715444.4793.584.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-12-19 at 18:59 +0000, Gregory Stark wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>
> > The error message ought to be "snapshot too old", which could raise a
> > chuckle, so I called it something else.
> >
> > The point you raise is a good one and I think we should publish a list
> > of retryable error messages. I contemplated once proposing a special log
> > level for a retryable error, but not quite a good idea.
>
> I'm a bit concerned about the idea of killing off queries to allow WAL to
> proceed. While I have nothing against that being an option I think we should
> be aiming to make it not necessary for correctness and not the default. By
> default I think WAL replay should stick to stalling WAL replay and only resort
> to killing queries if the user specifically requests it.

Increasing the waiting time increases the failover time and thus
decreases the value of the standby as an HA system. Others value high
availability higher than you and so we had agreed to provide an option
to allow the max waiting time to be set.

max_standby_delay is set in recovery.conf, value 0 (forever) - 2,000,000
secs, settable in milliseconds. So think of it like a deadlock detector
for recovery apply.

Also, there is a set of functions to control the way recovery proceeds,
much as you might control an MP3 player (start, stop, pause). There ares
also functions to pause at specific xids, pause at specific time, pause
at the next cleanup record. That allows you to set the max_standby_delay
lower and then freeze the server for longer to run a long query if
required. It also allows you to do PITR by trial and error rather than
one shot specify-in-advance settings. There is a function to manually
end recovery at a useful place if desired.

I hope your needs and wishes are catered for by that?

(I have a Plan B in case we need it during wider user testing, as
explained up thread.)

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 19:47:10
Message-ID:	494BA5DE.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:

> max_standby_delay is set in recovery.conf, value 0 (forever) -
2,000,000
> secs, settable in milliseconds. So think of it like a deadlock
detector
> for recovery apply.

Aha! A deadlock is a type of serialization failure. (In fact, on
databases with lock-based concurrency control rather than MVCC, it can
be the ONLY type of serialization failure.)

-Kevin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 20:09:40
Message-ID:	1229717380.4793.617.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-12-19 at 13:47 -0600, Kevin Grittner wrote:
> >>> Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
>
> > max_standby_delay is set in recovery.conf, value 0 (forever) -
> 2,000,000
> > secs, settable in milliseconds. So think of it like a deadlock
> detector
> > for recovery apply.
>
> Aha! A deadlock is a type of serialization failure. (In fact, on
> databases with lock-based concurrency control rather than MVCC, it can
> be the ONLY type of serialization failure.)

The SQL Standard specifically names this error as thrown when "it
detects the inability to guarantee the serializability of two or more
concurrent SQL-transactions". Now that really should only apply when
running with SERIALIZABLE transactions, but I grant you the standard
doesn't explicitly say that.

You give me the strange sense that you want this because of some quirk
in your software, rather than an overwhelming desire to see these two
situations described the same.

I guess making it that SQLSTATE would make it simpler to understand why
the error occurs and also how to handle it (i.e. resubmit). So there
probably is a wide argument for making developers jobs a little easier
by doing it. i.e. usability will be improved if we do that.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 20:17:44
Message-ID:	873agjly1j.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:

> Increasing the waiting time increases the failover time and thus
> decreases the value of the standby as an HA system. Others value high
> availability higher than you and so we had agreed to provide an option
> to allow the max waiting time to be set.

Sure, it's a nice option to have. But I think the default should be to pause
WAL replay.

The question I had was whether your solution for btree pointers marked dead
and later dropped from the index works when the user hasn't configured a
timeout and doesn't want standby queries killed.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 20:19:01
Message-ID:	87y6ybkjey.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:

>>>> Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
>
>> max_standby_delay is set in recovery.conf, value 0 (forever) -
> 2,000,000
>> secs, settable in milliseconds. So think of it like a deadlock
> detector
>> for recovery apply.
>
> Aha! A deadlock is a type of serialization failure. (In fact, on
> databases with lock-based concurrency control rather than MVCC, it can
> be the ONLY type of serialization failure.)

I think the fundamental difference is that a deadlock or serialization failure
can be predicted as a potential problem when writing the code. This is
something that can happen for any query any time, even plain old read-only
select queries.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>
Cc:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 20:23:57
Message-ID:	494BAE7D.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:

> The SQL Standard specifically names this error as thrown when "it
> detects the inability to guarantee the serializability of two or
more
> concurrent SQL-transactions". Now that really should only apply when
> running with SERIALIZABLE transactions,

I disagree. Database integrity could not be guaranteed without
detection of conflicting modification in READ COMMITTED on up, and
this is the normal means of indicating these problems.

> but I grant you the standard doesn't explicitly say that.

I think that's intentional.

> You give me the strange sense that you want this because of some
quirk
> in your software, rather than an overwhelming desire to see these
two
> situations described the same.

Well, we are very unlikely to ever use this feature, so it's not
really something I care about for us; it just struck me that there may
be others that care about categorizing errors accurately according the
the SQL standard, and that what you were describing sounded like a new
type of serialization failure in the PostgreSQL environment, and
should be classified that way.

The primary quirkiness of our software is that it needs to be able to
run with a number of different database products, and we do want to
take advantage of whatever information is available in a portable
format. This is not the only standard SQLSTATE we look for and handle
appropriately for the documented meaning, but it is an important one,
as it has simplified application programming and reduced the confusing
error messages which reach our end users.

> I guess making it that SQLSTATE would make it simpler to understand
why
> the error occurs and also how to handle it (i.e. resubmit).

Precisely.

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 20:32:46
Message-ID:	494BB08E.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> Gregory Stark <stark(at)enterprisedb(dot)com> wrote:

> I think the fundamental difference is that a deadlock or
serialization
> failure
> can be predicted as a potential problem when writing the code. This
is
> something that can happen for any query any time, even plain old
read-only
> select queries.

I've heard that on Oracle it is (or at least was) possible to get a
serialization failure on a single SELECT statement which was the only
user-requested activity on the system, because it could conflict with
automatic maintenance operations.

In Sybase and Microsoft databases it is definitely possible for a
plain old read-only SELECT statement to be a deadlock victim (reported
as a serialization failure) if some of the data it is referencing is
being updated concurrently. In these (and many other) products, a
lock must be acquired before a row can be read. Imagine, the SELECT
locks a row against updates, another transaction locks some other row
against any access, then the UPDATE tries to change the row locked by
the SELECT while the SELECT tries to read the row locked by the
UPDATE.

PostgreSQL is much less prone to serialization failures, but it is
certainly understandable if hot standby replication introduces new
cases of it.

-Kevin

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 20:54:48
Message-ID:	87tz8zkhrb.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:

> PostgreSQL is much less prone to serialization failures, but it is
> certainly understandable if hot standby replication introduces new
> cases of it.

In this case it will be possible to get this error even if you're just running
a single SELECT query -- and that's the *only* query in the database at all.

A vacuum being replayed -- even in a different database -- could trigger the
error. Or with the btree split issue, a data load -- again even in a different
database -- would be quite likely cause your SELECT to be killed.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-19 21:16:26
Message-ID:	494BBACA.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>
>> PostgreSQL is much less prone to serialization failures, but it is
>> certainly understandable if hot standby replication introduces new
>> cases of it.
>
> In this case it will be possible to get this error even if you're
just
> running
> a single SELECT query -- and that's the *only* query in the database
at all.
>
> A vacuum being replayed -- even in a different database -- could
trigger the
> error. Or with the btree split issue, a data load -- again even in a
different
> database -- would be quite likely cause your SELECT to be killed.

OK. Does that make serialization failure a bad description of the
problem?

If these steps are serialized (run one after the other), is there a
problem? It just seems that the hot standby near-synchronous
replication creates a situation where tasks on multiple, linked
databases might need to be serialized.

It does seem like it will be important to provide as much information
to the user about what's causing the problem, and hints about what to
do. PostgreSQL has nice features for that, though.

Since I have no vested interest here, I'm not inclined to belabor the
point. I was really just trying to make sure the feature was as
useful as possible to others, some of whom might be looking for
standard SQLSTATE values to help the software take the right course.
If others feel the 40001 code would confuse more than enlighten, I'll
respect that.

-Kevin

From:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-20 00:29:16
Message-ID:	200812191929.16606.xzilla@users.sourceforge.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 December 2008 05:52:42 Simon Riggs wrote:
> BTW, I noticed the other day that Oracle 11g only allows you to have a
> read only slave *or* allows you to continue replaying. You need to
> manually switch back and forth between those modes. They can't do
> *both*, as Postgres will be able to do. That's because their undo
> information is stored off-block in the Undo Tablespace, so is not
> available for standby queries. Nice one, Postgres.
>

I think this is true for physical replay, but Oracle also offers the option to
do logical replay (where transaction logs are converted into sql and run
against the standby; i believe this is similar to what continuant is trying
to do with thier latest offering). In that scenario you can do read and
replay at the same time, though I think there are some conflicts possible;
fewer than what postgres will have, since I think most of thier DDL can be
done online. (This might require some extra modules / high end version of
Oracle, please consult your local Oracle wizard for more details)

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-20 00:36:42
Message-ID:	1229733402.4793.657.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-12-19 at 20:54 +0000, Gregory Stark wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>
> > PostgreSQL is much less prone to serialization failures, but it is
> > certainly understandable if hot standby replication introduces new
> > cases of it.
>
> In this case it will be possible to get this error even if you're just running
> a single SELECT query -- and that's the *only* query in the database at all.
>
> A vacuum being replayed -- even in a different database -- could trigger the
> error. Or with the btree split issue, a data load -- again even in a different
> database -- would be quite likely cause your SELECT to be killed.

Quite likely? "You're all doomed I say!", his eyes rolling wildly. :-)

The standby is an extension of the primary and is quite literally
running the same transactions. This "only query" idea isn't the right
way to think about it. It's fairly easily possible to predict it will
happen and it will happen only in same database as transactions on the
primary. And as we just said, you can control whether and/or after how
long this will happen in some detail.

Industry context: In the worst case this is as bad as Oracle 11g. In
many/most cases it is much better.

Perhaps we should listen to the people that have said they don't want
queries cancelled, even if the alternative is inconsistent answers. That
is easily possible yet is not currently an option. Plus we have the
option I referred to up thread, which is to defer query cancel until the
query reads a modified data block. I'm OK with implementing either of
those, as non-default options. Do we need those options or are we ok?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-20 00:44:05
Message-ID:	1229733845.4793.660.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-12-19 at 19:29 -0500, Robert Treat wrote:
> On Friday 19 December 2008 05:52:42 Simon Riggs wrote:
> > BTW, I noticed the other day that Oracle 11g only allows you to have a
> > read only slave *or* allows you to continue replaying. You need to
> > manually switch back and forth between those modes. They can't do
> > *both*, as Postgres will be able to do. That's because their undo
> > information is stored off-block in the Undo Tablespace, so is not
> > available for standby queries. Nice one, Postgres.
> >
>
> I think this is true for physical replay, but Oracle also offers the option to
> do logical replay (where transaction logs are converted into sql and run
> against the standby; i believe this is similar to what continuant is trying
> to do with thier latest offering). In that scenario you can do read and
> replay at the same time, though I think there are some conflicts possible;
> fewer than what postgres will have, since I think most of thier DDL can be
> done online.

That is also an option I have argued that we need, BTW.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-20 07:21:15
Message-ID:	494C9CEB.5000605@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>
>> Increasing the waiting time increases the failover time and thus
>> decreases the value of the standby as an HA system. Others value high
>> availability higher than you and so we had agreed to provide an option
>> to allow the max waiting time to be set.
>
> Sure, it's a nice option to have. But I think the default should be to pause
> WAL replay.

I think I agree that pausing should be the default. If for no other
reason, because I can't think of a good default for max_standby_delay.

It would be nice to have a setting to specify the max. amount of
unapplied WAL before killing queries. When the primary isn't doing much,
you might want wait longer before killing queries, and if you're falling
behind a lot, you might want to kill queries more aggressively to catch
up. I guess that doesn't quite fit the current architecture; you'd need
to peek ahead to see how much unapplied WAL there is.

> The question I had was whether your solution for btree pointers marked dead
> and later dropped from the index works when the user hasn't configured a
> timeout and doesn't want standby queries killed.

Yes, it's not any different from vacuum WAL records.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-20 07:25:27
Message-ID:	494C9DE7.1060505@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Gregory Stark wrote:
>> The question I had was whether your solution for btree pointers marked
>> dead
>> and later dropped from the index works when the user hasn't configured a
>> timeout and doesn't want standby queries killed.
>
> Yes, it's not any different from vacuum WAL records.

No wait, there is a nasty corner-case. When an index tuple is marked as
killed, no WAL record is written. Since there's now WAL record, it won't
be killed in the slave yet. But if we take a full-page image of that
page later for some other operation, the LP_DEAD flag is included in the
full-page image. If the flag sneaks into the slave without an explicit
WAL record like that, there's no latestRemovedXid for the slave to wait on.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-20 09:02:39
Message-ID:	1229763759.4793.663.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2008-12-20 at 09:21 +0200, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> > Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> >
> >> Increasing the waiting time increases the failover time and thus
> >> decreases the value of the standby as an HA system. Others value high
> >> availability higher than you and so we had agreed to provide an option
> >> to allow the max waiting time to be set.
> >
> > Sure, it's a nice option to have. But I think the default should be to pause
> > WAL replay.
>
> I think I agree that pausing should be the default. If for no other
> reason, because I can't think of a good default for max_standby_delay.

I would rather err on the side of caution. If we do as you suggest,
somebody will lose their database and start shouting "stupid default".
So I would suggest we set it to say 5 seconds to start with and let
people that read the manual set it higher, or at least read the manual
after they receive their first query cancellation.

> It would be nice to have a setting to specify the max. amount of
> unapplied WAL before killing queries.

Agreed.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-20 09:10:21
Message-ID:	494CB67D.9070604@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Sat, 2008-12-20 at 09:21 +0200, Heikki Linnakangas wrote:
>> Gregory Stark wrote:
>>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>>>
>>>> Increasing the waiting time increases the failover time and thus
>>>> decreases the value of the standby as an HA system. Others value high
>>>> availability higher than you and so we had agreed to provide an option
>>>> to allow the max waiting time to be set.
>>> Sure, it's a nice option to have. But I think the default should be to pause
>>> WAL replay.
>> I think I agree that pausing should be the default. If for no other
>> reason, because I can't think of a good default for max_standby_delay.
>
> I would rather err on the side of caution. If we do as you suggest,
> somebody will lose their database and start shouting "stupid default".

Even if we stop applying the WAL, it should still be archived safely,
right? So no data should be lost, although the standby can fall very
much behind, and it can take a while to catch up.

> So I would suggest we set it to say 5 seconds to start with and let
> people that read the manual set it higher, or at least read the manual
> after they receive their first query cancellation.

I don't feel strongly either way...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-20 20:07:35
Message-ID:	494D5087.3090401@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> A vacuum being replayed -- even in a different database -- could trigger the
> error. Or with the btree split issue, a data load -- again even in a different
> database -- would be quite likely cause your SELECT to be killed.

Hmm, I wonder if we should/could track the "latestRemovedXid" separately
for each database. There's no reason why we need to kill a read-only
query in database X when a table in database Y is vacuumed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-20 23:09:54
Message-ID:	20081220230954.GC3989@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Gregory Stark wrote:
>> A vacuum being replayed -- even in a different database -- could trigger the
>> error. Or with the btree split issue, a data load -- again even in a different
>> database -- would be quite likely cause your SELECT to be killed.
>
> Hmm, I wonder if we should/could track the "latestRemovedXid" separately
> for each database. There's no reason why we need to kill a read-only
> query in database X when a table in database Y is vacuumed.

What about shared catalogs?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-21 08:19:25
Message-ID:	494DFC0D.4000108@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>> Gregory Stark wrote:
>>> A vacuum being replayed -- even in a different database -- could trigger the
>>> error. Or with the btree split issue, a data load -- again even in a different
>>> database -- would be quite likely cause your SELECT to be killed.
>> Hmm, I wonder if we should/could track the "latestRemovedXid" separately
>> for each database. There's no reason why we need to kill a read-only
>> query in database X when a table in database Y is vacuumed.
>
> What about shared catalogs?

True, vacuums on shared catalogs would affect read-only queries on all
databases.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-21 23:40:31
Message-ID:	20081221234031.GA8720@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Alvaro Herrera wrote:
>> Heikki Linnakangas wrote:
>>> Gregory Stark wrote:
>>>> A vacuum being replayed -- even in a different database -- could trigger the
>>>> error. Or with the btree split issue, a data load -- again even in a different
>>>> database -- would be quite likely cause your SELECT to be killed.
>>> Hmm, I wonder if we should/could track the "latestRemovedXid"
>>> separately for each database. There's no reason why we need to kill
>>> a read-only query in database X when a table in database Y is
>>> vacuumed.
>>
>> What about shared catalogs?
>
> True, vacuums on shared catalogs would affect read-only queries on all
> databases.

Maybe it's possible to track latestRemovedXid for each database, and
additionally another counter that tracks vacuums on shared catalogs.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-23 08:21:15
Message-ID:	1230020475.4793.699.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2008-12-20 at 22:07 +0200, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> > A vacuum being replayed -- even in a different database -- could trigger the
> > error. Or with the btree split issue, a data load -- again even in a different
> > database -- would be quite likely cause your SELECT to be killed.
>
> Hmm, I wonder if we should/could track the "latestRemovedXid" separately
> for each database. There's no reason why we need to kill a read-only
> query in database X when a table in database Y is vacuumed.

Already implemented in code. see ResolveRedoVisibilityConflicts()

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-23 08:21:23
Message-ID:	1230020483.4793.701.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2008-12-20 at 20:09 -0300, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
> > Gregory Stark wrote:
> >> A vacuum being replayed -- even in a different database -- could trigger the
> >> error. Or with the btree split issue, a data load -- again even in a different
> >> database -- would be quite likely cause your SELECT to be killed.
> >
> > Hmm, I wonder if we should/could track the "latestRemovedXid" separately
> > for each database. There's no reason why we need to kill a read-only
> > query in database X when a table in database Y is vacuumed.
>
> What about shared catalogs?

Hot Standby already covers this special case. Patch uses
GetCurrentVirtualXIDs(), which treats a dbOid of 0 to match against all
databases.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-23 12:52:37
Message-ID:	1230036757.4793.803.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-12-19 at 14:23 -0600, Kevin Grittner wrote:

> > I guess making it that SQLSTATE would make it simpler to understand
> why
> > the error occurs and also how to handle it (i.e. resubmit).
>
> Precisely.

Just confirming I will implement the SQLSTATE as requested.

I recognize my own and Greg's arguments that the match is not perfect,
but the Standard isn't clear on this and Kevin's interpretation is the
more useful behaviour for developers.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-24 04:59:19
Message-ID:	200812232359.19863.xzilla@users.sourceforge.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Saturday 20 December 2008 04:10:21 Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sat, 2008-12-20 at 09:21 +0200, Heikki Linnakangas wrote:
> >> Gregory Stark wrote:
> >>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> >>>> Increasing the waiting time increases the failover time and thus
> >>>> decreases the value of the standby as an HA system. Others value high
> >>>> availability higher than you and so we had agreed to provide an option
> >>>> to allow the max waiting time to be set.
> >>>
> >>> Sure, it's a nice option to have. But I think the default should be to
> >>> pause WAL replay.
> >>
> >> I think I agree that pausing should be the default. If for no other
> >> reason, because I can't think of a good default for max_standby_delay.
> >
> > I would rather err on the side of caution. If we do as you suggest,
> > somebody will lose their database and start shouting "stupid default".
>
> Even if we stop applying the WAL, it should still be archived safely,
> right? So no data should be lost, although the standby can fall very
> much behind, and it can take a while to catch up.
>

I was thinking the condition Simon was concerned about was that on a very busy
slave with wal delay, you could theoretically fill up the disks and destroy
the slave. With query cancel, you might be annoyed to see the queries
canceled, but theres no way that you would destroy the slave. (That might not
have been what he meant though)

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

From:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-24 04:59:24
Message-ID:	200812232359.24699.xzilla@users.sourceforge.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 December 2008 19:36:42 Simon Riggs wrote:
> Perhaps we should listen to the people that have said they don't want
> queries cancelled, even if the alternative is inconsistent answers. That
> is easily possible yet is not currently an option. Plus we have the
> option I referred to up thread, which is to defer query cancel until the
> query reads a modified data block. I'm OK with implementing either of
> those, as non-default options. Do we need those options or are we ok?
>

Haven't seen any feed back on this, but I think the two options of cancel
query for replay, and pause replay for queries, are probably enough for a
first go around (especially if you can get the query canceling to work only
when changes are made to the specific database in question)

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-24 11:11:51
Message-ID:	1230117111.4793.1096.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2008-12-23 at 23:59 -0500, Robert Treat wrote:
> On Friday 19 December 2008 19:36:42 Simon Riggs wrote:
> > Perhaps we should listen to the people that have said they don't want
> > queries cancelled, even if the alternative is inconsistent answers. That
> > is easily possible yet is not currently an option. Plus we have the
> > option I referred to up thread, which is to defer query cancel until the
> > query reads a modified data block. I'm OK with implementing either of
> > those, as non-default options. Do we need those options or are we ok?
> >
>
> Haven't seen any feed back on this, but I think the two options of cancel
> query for replay, and pause replay for queries, are probably enough for a
> first go around (especially if you can get the query canceling to work only
> when changes are made to the specific database in question)

Thanks for picking up on this. This question is the #1 usability issue
for Hot Standby, since at least May 2008. There are many potential
additions and we need to track this carefully over the next few months
to see if we have it just right. I'll take viewpoints at any time on
that; this door is never closed, though tempus fugit.

Greg and Heikki have highlighted in this thread some aspects of btree
garbage collection that will increase the chance of queries being
cancelled in various circumstances. If this is important enough to
trigger additional actions then we need to highlight that now so we have
time to take those corrective actions.

I've listened to many different viewpoints on and off list. Everybody
takes a slightly different angle on it and I'm in favour of giving
everybody what they want with the right set of options.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Robert Treat" <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-24 11:18:50
Message-ID:	2e78013d0812240318m3b884a41v7e52232865664ff2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
>
> Greg and Heikki have highlighted in this thread some aspects of btree
> garbage collection that will increase the chance of queries being
> cancelled in various circumstances

Even HOT-prune may lead to frequent query cancellations and unlike
VACUUM there is no way user can control the frequency of prune
operations.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-24 11:56:31
Message-ID:	1230119791.4793.1116.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2008-12-24 at 16:48 +0530, Pavan Deolasee wrote:
> On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> >
> >
> > Greg and Heikki have highlighted in this thread some aspects of btree
> > garbage collection that will increase the chance of queries being
> > cancelled in various circumstances
>
> Even HOT-prune may lead to frequent query cancellations and unlike
> VACUUM there is no way user can control the frequency of prune
> operations.

The patch does go to some trouble to handle that case, as I'm sure
you've seen. Are you saying that part of the patch is ineffective and
should be removed, or?

Should/could there be a way to control frequency of prune operations? We
could maintain cleanupxmin as a constant minimum distance from xmax, for
example.

Are we saying we should take further measures, as I asked upthread? If
it is a consensus that I take some action, then I will.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Robert Treat" <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-24 12:26:02
Message-ID:	2e78013d0812240426m42e03af1s93b9b1d32ba02c92@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>

>
> The patch does go to some trouble to handle that case, as I'm sure
> you've seen. Are you saying that part of the patch is ineffective and
> should be removed, or?
>

Umm.. are you talking about the "wait" mechanism ? That's the only
thing I remember. Otherwise, prune record is pretty much same as any
vacuum cleanup record.

> Should/could there be a way to control frequency of prune operations? We
> could maintain cleanupxmin as a constant minimum distance from xmax, for
> example.
>

Well, there can be. But tuning any such thing might be difficult and
would have implications on the primary. I am not saying we can do
that, but we will need additional tests to see its impact.

> Are we saying we should take further measures, as I asked upthread? If
> it is a consensus that I take some action, then I will.
>

Again, I haven't seen how frequently queries may get canceled. Or if
the delay is set to a large value, how far behind standby may get
during replication, so I can't really comment. Have you done any tests
on a reasonable hardware and checked if moderately long read queries
can be run on standby without standby lagging behind a lot.

I would prefer to have a solution which can be smarter than canceling
all queries as soon as a cleanup record comes and timeout occurs. For
example, if the queries are being run on a completely different set of
tables where as the updates/deletes are happening on another set of
tables, there is no reason why those queries should be canceled. I
think it would be very common to have large history tables which may
receive long read-only queries, but no updates/deletes. Whereas other
frequently updated tables which receive very few queries on the
standby.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-24 13:48:04
Message-ID:	1230126484.4793.1135.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2008-12-24 at 17:56 +0530, Pavan Deolasee wrote:
> On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> >
>
> >
> > The patch does go to some trouble to handle that case, as I'm sure
> > you've seen. Are you saying that part of the patch is ineffective and
> > should be removed, or?
> >
>
> Umm.. are you talking about the "wait" mechanism ? That's the only
> thing I remember. Otherwise, prune record is pretty much same as any
> vacuum cleanup record.

With respect, I was hoping you might look in the patch and see if you
agree with the way it is handled. No need to remember. The whole
latestRemovedXid concept is designed to do help.

> > Should/could there be a way to control frequency of prune operations? We
> > could maintain cleanupxmin as a constant minimum distance from xmax, for
> > example.
> >
>
> Well, there can be. But tuning any such thing might be difficult and
> would have implications on the primary. I am not saying we can do
> that, but we will need additional tests to see its impact.
>
> > Are we saying we should take further measures, as I asked upthread? If
> > it is a consensus that I take some action, then I will.
> >
>
> Again, I haven't seen how frequently queries may get canceled. Or if
> the delay is set to a large value, how far behind standby may get
> during replication, so I can't really comment. Have you done any tests
> on a reasonable hardware and checked if moderately long read queries
> can be run on standby without standby lagging behind a lot.

Queries get cancelled if data they need to see if removed and the
max_standby_delay expires. So lag will be max_standby_delay, by
definition.

Not sure what further tests would show. Queries that run for longer than
max_standby delay plus mean time between cleanup records will currently
end up being cancelled.

> I would prefer to have a solution which can be smarter than canceling
> all queries as soon as a cleanup record comes and timeout occurs.

Currently, it was the consensus view that queries should be cancelled,
though there are other options still on the table.

It's discussed in Design Notes on the Wiki. "Simply ignoring WAL removal
has been discussed and rejected (so far).
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php
Explicitly defining the tables a transaction wishes to see has also been
discussed and rejected (so far).
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php"

> For
> example, if the queries are being run on a completely different set of
> tables where as the updates/deletes are happening on another set of
> tables, there is no reason why those queries should be canceled. I
> think it would be very common to have large history tables which may
> receive long read-only queries, but no updates/deletes. Whereas other
> frequently updated tables which receive very few queries on the
> standby.

There is currently no way to tell which tables a query will touch during
the course of its execution. Nor is there likely to be because of
user-defined volatile functions.

I attempted to find ways to explicitly limit the set of tables over
which a query might venture, but that cam to nothing also.

We've also discussed storing lastCleanedLSN for each buffer, so queries
can cancel themselves if they need to read a buffer that has had data
removed from it that they would have needed to see. I'll write that up
also.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-24 14:59:50
Message-ID:	200812240959.51222.xzilla@users.sourceforge.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday 24 December 2008 08:48:04 Simon Riggs wrote:
> On Wed, 2008-12-24 at 17:56 +0530, Pavan Deolasee wrote:
> > Again, I haven't seen how frequently queries may get canceled. Or if
> > the delay is set to a large value, how far behind standby may get
> > during replication, so I can't really comment. Have you done any tests
> > on a reasonable hardware and checked if moderately long read queries
> > can be run on standby without standby lagging behind a lot.
>
> Queries get cancelled if data they need to see if removed and the
> max_standby_delay expires. So lag will be max_standby_delay, by
> definition.
>
> Not sure what further tests would show. Queries that run for longer than
> max_standby delay plus mean time between cleanup records will currently
> end up being cancelled.
>

I think the uncertainty comes from peoples experience with typical replication
use cases vs a lack of experience with this current implementation.

One such example is that it is pretty common to use read-only slaves to do
horizontal scaling of read queries across a bunch of slaves. This is not the
scenario of running reporting queries on a second machine to lower load; you
would be running a large number of read-only, relativly short, oltp-ish
queries (think pg_benchs select only test i suppose), but you also have a
fairly regular stream of inserts/updates going on with these same tables, its
just you have 95/5 split of read/write (or similar).

This is standard practice in things like mysql or using slony or what have
you. I suspect it's one of the first things people are going to want to do
with hot standby. But it's unclear how well this will work because we don't
have any experience with it yet, coupled with the two downsides being
mentioned as canceled queries and replay lag, which happen to be probably the
two worst downsides you would have in the above scenario. :-)

Hmm.... I'm not sure why I didn't think of running this test before, but
read/write pg_bench on a master with pg_bench select test on slave isn't that
bad of a scenario to match the above; it might be a little too much activity
on the master, but has anyone else run such a test?

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-24 15:45:52
Message-ID:	1230133552.4793.1155.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2008-12-24 at 09:59 -0500, Robert Treat wrote:

> I think the uncertainty comes from peoples experience with typical replication
> use cases vs a lack of experience with this current implementation.

Quite possibly.

Publishing user feedback on this will be very important in making this a
usable feature.

I'd be very happy if you were to direct the search for optimal
usability.

> One such example is that it is pretty common to use read-only slaves to do
> horizontal scaling of read queries across a bunch of slaves. This is not the
> scenario of running reporting queries on a second machine to lower load; you
> would be running a large number of read-only, relativly short, oltp-ish
> queries (think pg_benchs select only test i suppose), but you also have a
> fairly regular stream of inserts/updates going on with these same tables, its
> just you have 95/5 split of read/write (or similar).

One thing to consider also is latency of information. Sending queries to
master or slave may return different answers if querying very recent
data.

> This is standard practice in things like mysql or using slony or what have
> you. I suspect it's one of the first things people are going to want to do
> with hot standby. But it's unclear how well this will work because we don't
> have any experience with it yet, coupled with the two downsides being
> mentioned as canceled queries and replay lag, which happen to be probably the
> two worst downsides you would have in the above scenario. :-)
>
> Hmm.... I'm not sure why I didn't think of running this test before, but
> read/write pg_bench on a master with pg_bench select test on slave isn't that
> bad of a scenario to match the above; it might be a little too much activity
> on the master, but has anyone else run such a test?
>
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Robert Treat" <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-24 18:41:17
Message-ID:	2e78013d0812241041h79d25dd9g43966f285e812d28@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 24, 2008 at 7:18 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
>
>
> With respect, I was hoping you might look in the patch and see if you
> agree with the way it is handled. No need to remember. The whole
> latestRemovedXid concept is designed to do help.
>

Well, that's common for all cleanup record including vacuum. But
reading your comment, it seemed as there is something special to
handle HOT prune case which I did not see. Anyways, the trouble with
HOT prune is that uples may be cleaned up very frequently and that can
lead to query cancellation at the standby. That's what I wanted to
emphasize.

>
> Queries get cancelled if data they need to see if removed and the
> max_standby_delay expires. So lag will be max_standby_delay, by
> definition.

That's per cleanup record, isn't it ?

> We've also discussed storing lastCleanedLSN for each buffer, so queries
> can cancel themselves if they need to read a buffer that has had data
> removed from it that they would have needed to see. I'll write that up
> also.
>

What if we do that at table level ? So if a query touches a table
which had cleanup activity since the query was started, it cancels
itself automatically,

Happy X'mas to all of you!

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	"marcin mank" <marcin(dot)mank(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-26 00:42:29
Message-ID:	b1b9fac60812251642g2b2c9940h9997e0e53ad3fa16@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Perhaps we should listen to the people that have said they don't want
> queries cancelled, even if the alternative is inconsistent answers.

I think an alternative to that would be "if the wal backlog is too
big, let current queries finish and let incoming queries wait till the
backlog gets smaller".

fell free to ignore me, as a non-hacker I`m not even supposed to be
reading this list :-]

Greetings
Marcin

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	marcin mank <marcin(dot)mank(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-29 10:45:45
Message-ID:	4958AA59.6050506@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

marcin mank wrote:
>> Perhaps we should listen to the people that have said they don't want
>> queries cancelled, even if the alternative is inconsistent answers.

I don't like that much. PostgreSQL has traditionally avoided that very
hard. It's hard to tell what kind of inconsistencies you'd get, as it'd
depend on what plan is created, when a vacuum happens to run on master etc.

> I think an alternative to that would be "if the wal backlog is too
> big, let current queries finish and let incoming queries wait till the
> backlog gets smaller".

Yeah, that makes sense too.

Many approaches have been proposed, and they all have different
tradeoffs and therefore fit different use cases. I'm not sure which ones
are/will be included in the patch. We don't need all in 8.4, one or two
simplest ones will do just fine, and we can extend later.

Let me summarize. Whenever a WAL record conflicts with a
query-in-progress, we can:

1. kill the query, or
2. wait for the query to finish
3. let the query proceed, producing invalid results.

There's some combinations of those as well. You're proposal is a
variation of 2, to avoid the problem of WAL application falling behind
indefinitely. There's also the max_standby_delay option in the patch, to
wait a while, and then kill the query.

There's some additional optimizations that can be made to make those
options less painful. Instead of killing all queries that might be
affected by a vacuum record, only kill them when they actually hit a
block that was vacuumed (Simon's idea of latestRemovedLSN field in page
header).

Another line of attack is to avoid getting into the situation in the
first place, by affecting behavior on the master. If the standby has an
online connection to the master (per the synch rep patch), it can tell
master what the slave's OldestXmin is, and master can take that into
account and not remove tuples still needed by the slave. That's not good
from high availability point of view, you don't want a hung query in the
slave to cause a long-running-transaction situation in the master, but
for other use cases it would be fine. Or we can just add a constant # of
transactions to OldestXmin in master, to get some breathing room in the
server.

The bottom line is that we have enough options to make everyone happy.
Some understanding of the issue is required to tune it properly,
however, so documentation is important.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-30 15:17:38
Message-ID:	1230650258.4793.1320.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-12-19 at 09:22 -0500, Greg Stark wrote:

> I'm confused shouldn't read-only transactions on the slave just be
> hacked to not set any hint bits including lp_delete?

It seems there are multiple issues involved and I saw only the first of
these initially. I want to explicitly separate these issues so we can
discuss them more easily.

1. When we replay an XLOG_BTREE_DELETE record, we may have to
wait-then-cancel-etc other sessions.

Possibly a pain, but these records are not very common now that we have
HOT, except on certain kinds of queue table.

2. Should we ignore the LP_DEAD flag on btree rows when we are using the
index during recovery? As Heikki points out, this hint bit is not WAL
logged, but can appear in the standby as a result of full page writes.
The LP_DEAD flags will have been set using a different xmin to the one
on the standby and would cause index rows to be ignored that should have
been included in a correct MVCC answer.

So we need to either

(a) always ignore LP_DEAD flags we see when reading index during
recovery.

(b) include an additional step to clean the full page writes to remove
LP_DEAD hints from the incoming pages.

(b) is feasible, but would need to be repeated each time a new full page
arrived, so a page may need to be re-cleaned many times. Sounds like a
bad plan, so we should choose (a).

3. Should we set LP_DELETE flag on btree rows when we are using the
index during recovery? Not much point if we are ignoring them.

There is no space for an additional flag, to distinguish between primary
and standby hint bits.

Issues (2) and (3) would go away entirely if both standby and primary
always had the same xmin value as a system-wide setting. i.e. the
standby and primary are locked together at their xmins. Perhaps that was
Heikki's intention in recent suggestions?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-30 15:20:36
Message-ID:	495A3C44.90302@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> Issues (2) and (3) would go away entirely if both standby and primary
> always had the same xmin value as a system-wide setting. i.e. the
> standby and primary are locked together at their xmins. Perhaps that was
> Heikki's intention in recent suggestions?

No, I only suggested that as an optional optimization. We can't rely on
it, because the queries on standby should still work correctly if the
connection to primary is lost for some reason, or if the primary decides
not to honor standby's xmin, perhaps to avoid the usual issues with
long-running-transactions.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-30 16:31:01
Message-ID:	495A4CC5.5050504@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> (a) always ignore LP_DEAD flags we see when reading index during
> recovery.

This sounds simplest, and it's nice to not clear the flags for the
benefit of transactions running after the recovery is done.

You have to be careful to ignore the flags in read-only transactions
that started in hot standby mode, even if recovery has since ended and
we're in normal operation now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2008-12-31 10:12:40
Message-ID:	1230718360.4032.39.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2008-12-30 at 18:31 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > (a) always ignore LP_DEAD flags we see when reading index during
> > recovery.
>
> This sounds simplest, and it's nice to not clear the flags for the
> benefit of transactions running after the recovery is done.

Agreed.

(Also: Transaction hint bits are always set correctly, because we would
only ever see a full page write with hints set after the commit/abort
record was processed. So I continue to honour transaction hint bit
reading and setting during recovery).

> You have to be careful to ignore the flags in read-only transactions
> that started in hot standby mode, even if recovery has since ended and
> we're in normal operation now.

Got that.

I'm setting ignore_killed_tuples = false at the start of any index scan
during recovery. And kill_prior_tuples is never set true when in
recovery. Both measures are AM-agnostic.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby and b-tree killed items
Date:	2009-01-13 19:07:05
Message-ID:	1231873625.28919.51.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2008-12-30 at 18:31 +0200, Heikki Linnakangas wrote:

> You have to be careful to ignore the flags in read-only transactions
> that started in hot standby mode, even if recovery has since ended and
> we're in normal operation now.

My initial implementation in v6 worked, but had a corner case if a
transaction spanned the change from recovery into normal processing.

I'm now done on a more complete fix, will be in v8.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support