B-tree parent pointer and checkpoints

Lists: pgsql-hackers
From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: B-tree parent pointer and checkpoints
Date: 2010-11-02 10:56:33
Message-ID: 4CCFEE61.2090702@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

We have the rm_safe_restartpoint mechanism to ensure that we don't use a
checkpoint that splits a multi-level B-tree insertion as a restart
point. But to my surprise, we don't have anything to protect against the
analogous case during normal operation. This is possible:

1. Split child page. Write WAL records for the child pages.
2. Begin and finish a checkpoint
3. Crash, before writing the WAL record of inserting the child pointer
in the parent B-tree page.
4. Recovery begins at the new checkpoint, never sees the incomplete
split, so it stays incomplete.

In practice that's pretty hard to hit, because a checkpoint takes some
time, while locking the parent page and writing the child pointer is
usually very quick. But it's possible.

It surprises me that we thought of this when we introduced
restartpoints, but this more obvious case during normal operation seems
to have been there forever. Nothing very bad happens if you lose the
parent update, but this would be nice to fix nevertheless.

I bumped into this while thinking about archive recovery - the above can
happen at archive recovery too if the checkpoint is caused by
pg_start_backup().

I think we can fix this by requiring that any multi-WAL-record actions
that are in-progress when a checkpoint starts (at the REDO-pointer) must
finish before the checkpoint record is written. That will close the
issue with restartpoints, archive recovery etc. as well, so we no longer
need to worry about this anywhere else than while performing an online
checkpoint.

I'm thinking of using the isCommit flag for this, to delay writing the
checkpoint record until all incomplete splits are finished. isCommit
protects against a similar race condition between writing commit record
and flushing the clog page, this race condition is similar. Will
obviously need to rename it, and double-check that it's safe: b-tree
splits take longer, and there's no critical section there like there is
in the commit codepath.

Comments?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: ghatpande(at)vsnl(dot)net
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Intelligent RDBMS
Date: 2010-11-02 11:25:38
Message-ID: e382c777fb92.4cd03b82@vsnl.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello All,
I am Vijay Ghatpande from Pune, India. I am in IT field for last 30 years and mainly in ERP design, development and implementation. I have worked on JD Edwards, Oracle Apps and SAP. I was involved in design and development of ERP package called ISP – Integrated Software for Production. Now I am independent and started my own consultancy AMS ERP Solutions and work for SME industries in and around Pune. I am technical and worked from COBO, DGL to .NET, C++. I have also worked as ORACLE, DB2 DBA for many years.

I have a dream project in my mind. It is on Relational Database Management. Pl read below paragraph from the angle of users of application developed on RDBMS.
Any application contains many tables and they are related to each other. RDBMS keeps relations. Developers, users have to know these relations to extract proper data and convert it into information. If multiple tables are involved then query can be more complex. Main problem with RDBMS is writing SQL. End users avoid SQL and thinks that this is a technical work. This is limitation of RDBMS. If SQL becomes free of table and relations then l am sure everyone will like it.

For example MS Dynamics has many tables to store the data. We can keep this data in such a way that for developer MS Dynamics will be one table say MSD and user can extract any data from that table. He need not know internal table relations. If this happens life will be easy for many. Application development will be very easy and cost effective. This can be achieved by creating views. I am thinking of intelligent database where relations are inbuilt. If this is available in the product itself then domain knowledge will be in RDBMS. If SQL becomes free of table and relations then l am sure everyone will like it. My intention is not to hide relations from users. But this will be one of the advantages of intelligent RDBMS. This will be next generation RDBMS. This is going to change the RDBMS, ERP world.

I am requesting you to give me feedback.

With Warm Regards,


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-02 14:30:14
Message-ID: 26099.1288708214@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> I think we can fix this by requiring that any multi-WAL-record actions
> that are in-progress when a checkpoint starts (at the REDO-pointer) must
> finish before the checkpoint record is written.

What happens if someone wants to start a new split while the checkpoint
is hanging fire?

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-02 14:40:59
Message-ID: 4CD022FB.7000607@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02.11.2010 16:30, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> I think we can fix this by requiring that any multi-WAL-record actions
>> that are in-progress when a checkpoint starts (at the REDO-pointer) must
>> finish before the checkpoint record is written.
>
> What happens if someone wants to start a new split while the checkpoint
> is hanging fire?

You mean after CreateCheckPoint has determined the redo pointer, but
before it has written the checkpoint record? The new split can go ahead,
and the checkpoint doesn't need care about it. Recovery will start at
the redo pointer, so it will see the split record, and will know to
finish the incomplete split if necessary.

The logic is the same as with inCommit. Checkpoint will fetch the list
of in-progress splits some time after determining the redo-pointer. It
will then wait until all of those splits have finished. Any new splits
that begin after fetching the list don't affect the checkpoint.

inCommit can't be used as is, because it's tied to the Xid, but
something similar should work.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-08 13:40:10
Message-ID: 4CD7FDBA.1020506@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02.11.2010 16:40, Heikki Linnakangas wrote:
> On 02.11.2010 16:30, Tom Lane wrote:
>> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>> I think we can fix this by requiring that any multi-WAL-record actions
>>> that are in-progress when a checkpoint starts (at the REDO-pointer) must
>>> finish before the checkpoint record is written.
>>
>> What happens if someone wants to start a new split while the checkpoint
>> is hanging fire?
>
> You mean after CreateCheckPoint has determined the redo pointer, but
> before it has written the checkpoint record? The new split can go ahead,
> and the checkpoint doesn't need care about it. Recovery will start at
> the redo pointer, so it will see the split record, and will know to
> finish the incomplete split if necessary.
>
> The logic is the same as with inCommit. Checkpoint will fetch the list
> of in-progress splits some time after determining the redo-pointer. It
> will then wait until all of those splits have finished. Any new splits
> that begin after fetching the list don't affect the checkpoint.
>
> inCommit can't be used as is, because it's tied to the Xid, but
> something similar should work.

Here's a first draft of this, using the inCommit flag as is. It works,
but suffers from starvation if you have a lot of concurrent
multi-WAL-record actions. I tested that by running INSERTs to a table
with tsvector field with a GiST index on it from five concurrent
sessions, and saw checkpoints regularly busy-waiting for over a minute.

To avoid that, we need something a little bit more complicated than a
boolean flag. I'm thinking of adding a counter beside the inCommit flag
that's incremented every time a new multi-WAL-record action begins, so
that the checkpoint process can distinguish between a new action that
was started after deciding the REDO pointer and an old one that's still
running.

(inCommit is a misnomer now, of course. Will need to find a better name..)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
split-delay-checkpoint-1.patch text/x-diff 14.3 KB

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-10 18:58:07
Message-ID: 4CDAEB3F.1010502@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08.11.2010 15:40, Heikki Linnakangas wrote:
> Here's a first draft of this, using the inCommit flag as is. It works,
> but suffers from starvation if you have a lot of concurrent
> multi-WAL-record actions. I tested that by running INSERTs to a table
> with tsvector field with a GiST index on it from five concurrent
> sessions, and saw checkpoints regularly busy-waiting for over a minute.
>
> To avoid that, we need something a little bit more complicated than a
> boolean flag. I'm thinking of adding a counter beside the inCommit flag
> that's incremented every time a new multi-WAL-record action begins, so
> that the checkpoint process can distinguish between a new action that
> was started after deciding the REDO pointer and an old one that's still
> running.
>
> (inCommit is a misnomer now, of course. Will need to find a better name..)

Here's a 2nd version, with an additional counter in PGPROC to avoid
starving checkpoint in the face of a constant stream e.g GiST inserts.

The new rule is that before you start a multi-WAL-record operation that
needs to be completed at end of recovery if you crash in the middle, you
call HoldCheckpoint(), and once you're finished, ResumeCheckpoint().
rm_safe_restartpoint() is gone.

This is a pre-existing bug, but given the lack of field reports and the
fact that it's pretty darn hard to run into this in real life, I'm
inclined to not backpatch this.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
split-delay-checkpoint-2.patch text/x-diff 32.6 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-10 22:26:17
Message-ID: 12071.1289427977@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> The new rule is that before you start a multi-WAL-record operation that
> needs to be completed at end of recovery if you crash in the middle, you
> call HoldCheckpoint(), and once you're finished, ResumeCheckpoint().

What happens if you error out in between? Or is it assumed that the
*entire* sequence is a critical section? If it has to be that way,
one might wonder what's the point of trying to split it into multiple
WAL records.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-10 22:49:50
Message-ID: 12375.1289429390@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> What happens if you error out in between? Or is it assumed that the
> *entire* sequence is a critical section? If it has to be that way,
> one might wonder what's the point of trying to split it into multiple
> WAL records.

Or, to be more concrete: I'm wondering if this *entire* mechanism isn't
a bad idea that we should just rip out.

The question that ought to be asked here, I think, is whether it
shouldn't be required that every inter-WAL-record state is a valid
consistent state that doesn't require post-crash fixups. If that
isn't the case, then a simple ERROR or FATAL exit out of the backend
that was creating the sequence originally will leave the system in
an unacceptable state. We could prevent such an exit by wrapping the
whole sequence in a critical section, but if you have to do that then
it's not apparent why you shouldn't fold it into one WAL record.

IOW, forget this patch. Take out the logic that tries to complete
pending splits during replay, instead. I believe this is perfectly safe
for btree: loss of a parent record isn't fatal, as proven by the fact
that searches don't have to be locked out while a split proceeds.
(We might want to make btree_page_del not think that a missing parent
record is an error, but it shouldn't think that anyway, because of the
possibility of a non-crashing failure during the original split.)
This approach might not be safe for GIST or GIN; but if it isn't, they
need fixes anyway.

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-11 11:22:55
Message-ID: 4CDBD20F.8090807@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.11.2010 00:49, Tom Lane wrote:
> I wrote:
>> What happens if you error out in between? Or is it assumed that the
>> *entire* sequence is a critical section? If it has to be that way,
>> one might wonder what's the point of trying to split it into multiple
>> WAL records.
>
> Or, to be more concrete: I'm wondering if this *entire* mechanism isn't
> a bad idea that we should just rip out.
>
> The question that ought to be asked here, I think, is whether it
> shouldn't be required that every inter-WAL-record state is a valid
> consistent state that doesn't require post-crash fixups. If that
> isn't the case, then a simple ERROR or FATAL exit out of the backend
> that was creating the sequence originally will leave the system in
> an unacceptable state. We could prevent such an exit by wrapping the
> whole sequence in a critical section, but if you have to do that then
> it's not apparent why you shouldn't fold it into one WAL record.
>
> IOW, forget this patch. Take out the logic that tries to complete
> pending splits during replay, instead. I believe this is perfectly safe
> for btree: loss of a parent record isn't fatal, as proven by the fact
> that searches don't have to be locked out while a split proceeds.
> (We might want to make btree_page_del not think that a missing parent
> record is an error, but it shouldn't think that anyway, because of the
> possibility of a non-crashing failure during the original split.)
> This approach might not be safe for GIST or GIN; but if it isn't, they
> need fixes anyway.

GIN is similar to b-tree, the incomplete split logic there is for
inserting the parent pointers in the b-tree within the GIN index, just
like nbtree.

GiST is different. When you insert a key to a leaf page, you (sometimes)
need to adjust the parent pointer to reflect the new key as well. B-tree
tolerates incomplete splits with the 'next page' pointer, but that is
not applicable to gist. Teodor described the issue back in 2005 when
WAL-logging was added to GiST
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):

> The problem with incopmleted inserts is: when new entry is installed into leaf page, all chain (in a worst case) of keys from root page to leaf should be updated. Of course, case of updating all chain is rarely and usially it's updated only part. Each key on inner pages contains union of keys (bounding box in a case of rtree, for example) on page which it points. This union can formed only with a help of user-defined function of opclass, because of GiST doesn't know something about nature of keys. Returning to WAL, GiST core write xlog entry with all nessary information for restoration before write page, but in this moment it doesn't know it should update keys on parent page or key is unchanged. So GiST's WAL restoration code should remember this page's update as incompleted insert. When insert complited, GiST's core write to log that insert is completed and restore code can clean up stored incompleted insert. If it was crash, then sign of completed insert can be absent in
log, and GiST's restore code should continue it. While continue, it's know which page was changed and should walk up to root. On each step of walk it should form union for page and insert it to parent.

Reading that I wonder: what harm would an incomplete insert cause if we
just left it in the tree? Imagine that you insert a key to a leaf page,
but crash before updating the parent. If you search for the key starting
from the root, you'll fail to find it, because the parent pointer claims
that there are no entries with such a key on the child page. But that's
OK, the inserting transaction aborted with the crash!

Do some of the other GiST algorithms get confused if there's a key on a
page that's not represented by the parent pointer? It's possible that
you insert a key to the leaf, update the leaf's immediate parent, but
crash before updating the parent's parent. As far as search is
concerned, that's OK as well, but it creates a hazard for subsequent
inserts. If you later insert a tuple with the same key to the same leaf
page, the insertion will see that the parent pointer already includes
the key, and will fail to update the parent's parent. That's a problem.

Would it be hard to change the algorithm to update the parent keys
top-down rather than bottom-up?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-11 11:24:03
Message-ID: 4CDBD253.6060304@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.11.2010 00:49, Tom Lane wrote:
> I wrote:
>> What happens if you error out in between? Or is it assumed that the
>> *entire* sequence is a critical section? If it has to be that way,
>> one might wonder what's the point of trying to split it into multiple
>> WAL records.
>
> Or, to be more concrete: I'm wondering if this *entire* mechanism isn't
> a bad idea that we should just rip out.
>
> The question that ought to be asked here, I think, is whether it
> shouldn't be required that every inter-WAL-record state is a valid
> consistent state that doesn't require post-crash fixups. If that
> isn't the case, then a simple ERROR or FATAL exit out of the backend
> that was creating the sequence originally will leave the system in
> an unacceptable state. We could prevent such an exit by wrapping the
> whole sequence in a critical section, but if you have to do that then
> it's not apparent why you shouldn't fold it into one WAL record.
>
> IOW, forget this patch. Take out the logic that tries to complete
> pending splits during replay, instead. I believe this is perfectly safe
> for btree: loss of a parent record isn't fatal, as proven by the fact
> that searches don't have to be locked out while a split proceeds.
> (We might want to make btree_page_del not think that a missing parent
> record is an error, but it shouldn't think that anyway, because of the
> possibility of a non-crashing failure during the original split.)
> This approach might not be safe for GIST or GIN; but if it isn't, they
> need fixes anyway.

GIN is similar to b-tree, the incomplete split logic there is for
inserting the parent pointers in the b-tree within the GIN index, just
like nbtree.

GiST is different. When you insert a key to a leaf page, you (sometimes)
need to adjust the parent pointer to reflect the new key as well. B-tree
tolerates incomplete splits with the 'next page' pointer, but that is
not applicable to gist. Teodor described the issue back in 2005 when
WAL-logging was added to GiST
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):

> The problem with incopmleted inserts is: when new entry is installed into leaf page, all chain (in a worst case) of keys from root page to leaf should be updated. Of course, case of updating all chain is rarely and usially it's updated only part. Each key on inner pages contains union of keys (bounding box in a case of rtree, for example) on page which it points. This union can formed only with a help of user-defined function of opclass, because of GiST doesn't know something about nature of keys. Returning to WAL, GiST core write xlog entry with all nessary information for restoration before write page, but in this moment it doesn't know it should update keys on parent page or key is unchanged. So GiST's WAL restoration code should remember this page's update as incompleted insert. When insert complited, GiST's core write to log that insert is completed and restore code can clean up stored incompleted insert. If it was crash, then sign of completed insert can be absent in
log, and GiST's restore code should continue it. While continue, it's know which page was changed and should walk up to root. On each step of walk it should form union for page and insert it to parent.

Reading that I wonder: what harm would an incomplete insert cause if we
just left it in the tree? Imagine that you insert a key to a leaf page,
but crash before updating the parent. If you search for the key starting
from the root, you'll fail to find it, because the parent pointer claims
that there are no entries with such a key on the child page. But that's
OK, the inserting transaction aborted with the crash!

Do some of the other GiST algorithms get confused if there's a key on a
page that's not represented by the parent pointer? It's possible that
you insert a key to the leaf, update the leaf's immediate parent, but
crash before updating the parent's parent. As far as search is
concerned, that's OK as well, but it creates a hazard for subsequent
inserts. If you later insert a tuple with the same key to the same leaf
page, the insertion will see that the parent pointer already includes
the key, and will fail to update the parent's parent. That's a problem.

Would it be hard to change the algorithm to update the parent keys
top-down rather than bottom-up?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-11 15:16:11
Message-ID: 14543.1289488571@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> GiST is different. When you insert a key to a leaf page, you (sometimes)
> need to adjust the parent pointer to reflect the new key as well. B-tree
> tolerates incomplete splits with the 'next page' pointer, but that is
> not applicable to gist. Teodor described the issue back in 2005 when
> WAL-logging was added to GiST
> (http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):
> Reading that I wonder: what harm would an incomplete insert cause if we
> just left it in the tree? Imagine that you insert a key to a leaf page,
> but crash before updating the parent. If you search for the key starting
> from the root, you'll fail to find it, because the parent pointer claims
> that there are no entries with such a key on the child page. But that's
> OK, the inserting transaction aborted with the crash!

I think it'd be okay as far as that one entry is concerned, since as you
say it doesn't matter whether a search finds it. (We'd have to be sure
that VACUUM would still find it to remove it, of course, but that
doesn't use a normal search.) You're right that it poses a hazard of
subsequent inserts deciding that they don't need to do work on upper
levels because the lower ones look OK already. But depending on the
details of the search algorithm, this might be a non-problem: if you
remember that the upper level entries didn't cover your key when you
descended, you'd still know you need to recompute them.

Something else I just noticed is that WAL replay isn't capable of
completely fixing the index anyway:

* To complete insert we can't use basic insertion algorithm because
* during insertion we can't call user-defined support functions of opclass.
* So, we insert 'invalid' tuples without real key and do it by separate algorithm.
* 'invalid' tuple should be updated by vacuum full.

Given that there's no more vacuum full, and nobody has been expected to
run it routinely for a long time anyway, this fixup approach seems
pretty completely broken anyhow.

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-11 18:12:21
Message-ID: 4CDC3205.6030001@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.11.2010 17:16, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> GiST is different. When you insert a key to a leaf page, you (sometimes)
>> need to adjust the parent pointer to reflect the new key as well. B-tree
>> tolerates incomplete splits with the 'next page' pointer, but that is
>> not applicable to gist. Teodor described the issue back in 2005 when
>> WAL-logging was added to GiST
>> (http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):
>> Reading that I wonder: what harm would an incomplete insert cause if we
>> just left it in the tree? Imagine that you insert a key to a leaf page,
>> but crash before updating the parent. If you search for the key starting
>> from the root, you'll fail to find it, because the parent pointer claims
>> that there are no entries with such a key on the child page. But that's
>> OK, the inserting transaction aborted with the crash!
>
> I think it'd be okay as far as that one entry is concerned, since as you
> say it doesn't matter whether a search finds it. (We'd have to be sure
> that VACUUM would still find it to remove it, of course, but that
> doesn't use a normal search.) You're right that it poses a hazard of
> subsequent inserts deciding that they don't need to do work on upper
> levels because the lower ones look OK already. But depending on the
> details of the search algorithm, this might be a non-problem: if you
> remember that the upper level entries didn't cover your key when you
> descended, you'd still know you need to recompute them.

Hmm, we don't currently keep track of that when we descend the tree to
choose the target page, but perhaps an extra Consistent call to check
that would be acceptable. We already call Penalty for every tuple on
each internal node on the way, so compared to that one more call should
not be too expensive. If we do that, I think it would simplify the
algorithm quite a bit to just update all the parents on the way down,
instead of traversing up from the bottom after inserting the tuple to
the leaf.

> Something else I just noticed is that WAL replay isn't capable of
> completely fixing the index anyway:
>
> * To complete insert we can't use basic insertion algorithm because
> * during insertion we can't call user-defined support functions of opclass.
> * So, we insert 'invalid' tuples without real key and do it by separate algorithm.
> * 'invalid' tuple should be updated by vacuum full.
>
> Given that there's no more vacuum full, and nobody has been expected to
> run it routinely for a long time anyway, this fixup approach seems
> pretty completely broken anyhow.

The 'invalid' tuples don't affect correctness, but are a drag on
performance, so they are similar to incomplete b-tree splits. I suspect
the overhead of an invalid gist pointer is much bigger than the overhead
of an incomplete b-tree split, though. I agree we should get rid of
that, it's not comforting to get a stream of messages in the logs saying
you should run VACUUM FULL.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-11 18:34:46
Message-ID: 18661.1289500486@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Hmm, we don't currently keep track of that when we descend the tree to
> choose the target page, but perhaps an extra Consistent call to check
> that would be acceptable. We already call Penalty for every tuple on
> each internal node on the way, so compared to that one more call should
> not be too expensive. If we do that, I think it would simplify the
> algorithm quite a bit to just update all the parents on the way down,
> instead of traversing up from the bottom after inserting the tuple to
> the leaf.

Oh, that's a really good idea, I think. But what about page splits?
I guess in the case of a split, you'd be replacing the parent entry
anyway, so having previously updated it to something larger doesn't
really cause a problem other than wasting a few cycles --- which are
probably still less than you save by not having to traverse back up.

If we supported UNIQUE GIST indexes then you could criticize this plan
on the grounds that parent entries would get uselessly enlarged before
detecting a uniqueness failure; but we don't and I know of no plans to.
So on the whole I think it sounds good. Teodor, what do you think?

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-12 19:20:55
Message-ID: 4CDD9397.8040600@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.11.2010 20:34, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Hmm, we don't currently keep track of that when we descend the tree to
>> choose the target page, but perhaps an extra Consistent call to check
>> that would be acceptable. We already call Penalty for every tuple on
>> each internal node on the way, so compared to that one more call should
>> not be too expensive. If we do that, I think it would simplify the
>> algorithm quite a bit to just update all the parents on the way down,
>> instead of traversing up from the bottom after inserting the tuple to
>> the leaf.
>
> Oh, that's a really good idea, I think. But what about page splits?
> I guess in the case of a split, you'd be replacing the parent entry
> anyway, so having previously updated it to something larger doesn't
> really cause a problem other than wasting a few cycles --- which are
> probably still less than you save by not having to traverse back up.

I started looking at this, and run into a problem with page splits. The
right-links in GiST work differently from b-tree, a right-link is only
followed if we detect a concurrent page split. A concurrent split is
detected by comparing the "NSN" field on the child page against the LSN
that we saw on the parent when walking down. It means that if you just
leave the incompletely split page in the tree, where one half is missing
the parent pointer, scans will not find any tuples on that page. They
would at first, but as soon as the the parent page is updated due to
some unrelated insertion, the LSN of the parent is bumped above the NSN
stored on the child, and the page becomes invisible to scanners.

We avoid that problem during normal operation by keeping the parent page
locked while the child is split, until the downlink is inserted into the
parent. That blocks any other modifications to the parent page that
would bump the LSN, until our downlink has been inserted. That doesn't
work after crash recovery, as all the locks are released.

I think we can work around that with a small modification to the page
split algorithm. In a nutshell, when the child page is split, put a flag
on the left half indicating that the rightlink must always be followed,
regardless of the NSN. When the downlink is inserted to the parent,
clear the flag. Setting and clearing of these flags need to be performed
during WAL replay as well.

So to split a page:

(0. Lock the page to be split)
1. Split the page. Mark the rightlink in the left half with a flag
indicating that it always needs to be followed.
2. Lock the parent.
3. Insert downlink. (The parent may need to be split too)
4. Clear the flag in the child, and update NSN to the LSN of the
downlink insertion record.
5. Release child.

6. If the parent was split in step 3, goto 2.

If we crash between steps 1 and 3, the rightlink will have the flag, so
scans will know to always follow it. If we crash after step 3, recovery
will replay steps 3 and 4, so scans will see the downlinks as usual.

After a crash, the tree can be fixed up later by vacuum or subsequent
inserts, by performing steps 2-4.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-12 22:34:51
Message-ID: AANLkTi=e356WpMC4w=btYdAC5RBWQFnBxnrodmURjwYK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I think we can work around that with a small modification to the page split
> algorithm. In a nutshell, when the child page is split, put a flag on the
> left half indicating that the rightlink must always be followed, regardless
> of the NSN. When the downlink is inserted to the parent, clear the flag.
> Setting and clearing of these flags need to be performed during WAL replay
> as well.
>

Does this not cause duplicate results? Or does GIST already have to be
prepared to deal with duplicate results?

--
greg


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-16 16:06:21
Message-ID: 4CE2ABFD.5090505@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13.11.2010 00:34, Greg Stark wrote:
> On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> I think we can work around that with a small modification to the page split
>> algorithm. In a nutshell, when the child page is split, put a flag on the
>> left half indicating that the rightlink must always be followed, regardless
>> of the NSN. When the downlink is inserted to the parent, clear the flag.
>> Setting and clearing of these flags need to be performed during WAL replay
>> as well.
>
> Does this not cause duplicate results? Or does GIST already have to be
> prepared to deal with duplicate results?

The GiST search algorithm avoids duplicate results by remembering the
LSN on the parent page when it follows a downlink. The split currently
happens like this:

0. (the child page is locked)
1. The parent page is locked.
2. The child page is split. The original page becomes the left half, and
new buffers are allocated for the right halves.
3. The downlink is inserted on the parent page (and the original
downlink is updated to reflect only the keys that stayed on the left
page). While keeping the child pages locked, the NSN field on the
children are updated with the new LSN of the parent page.

To avoid duplicates, when a scan looks at the child page, it needs to
know if it saw the parent page before or after the downlink was
inserted. If it saw it before, the scan needs to follow the rightlink to
the right half, otherwise it will follow the downlink as usual (if it
matched). The scan checks that by comparing the LSN it saw on the parent
page with the NSN on the child page. If parent LSN < NSN, we saw the
parent before the downlink was inserted.

Now, the problem with crash recovery is that the above algorithm depends
on the split to keep the parent and child locked until the downlink is
inserted in the parent. If you crash between steps 2 and 3, the locks
are gone. If a later insert then updates the parent page, because of a
split on some unrelated child page, that will bump the LSN of the parent
above the NSN on the child. Scans will see that the parent LSN > child
NSN, and will no longer follow the rightlink.

And the fix for that is to set a flag on the child page indicating that
rightlink has to be always followed regardless of the LSN/NSN, because
the downlink hasn't been inserted yet. When the downlink is inserted,
the flag is cleared and we rely on the existing LSN/NSN mechanism to
avoid duplicate results.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-03-10 20:47:21
Message-ID: 201103102047.p2AKlLa02940@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Has this been addressed?

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> On 13.11.2010 00:34, Greg Stark wrote:
> > On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
> > <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> >> I think we can work around that with a small modification to the page split
> >> algorithm. In a nutshell, when the child page is split, put a flag on the
> >> left half indicating that the rightlink must always be followed, regardless
> >> of the NSN. When the downlink is inserted to the parent, clear the flag.
> >> Setting and clearing of these flags need to be performed during WAL replay
> >> as well.
> >
> > Does this not cause duplicate results? Or does GIST already have to be
> > prepared to deal with duplicate results?
>
> The GiST search algorithm avoids duplicate results by remembering the
> LSN on the parent page when it follows a downlink. The split currently
> happens like this:
>
> 0. (the child page is locked)
> 1. The parent page is locked.
> 2. The child page is split. The original page becomes the left half, and
> new buffers are allocated for the right halves.
> 3. The downlink is inserted on the parent page (and the original
> downlink is updated to reflect only the keys that stayed on the left
> page). While keeping the child pages locked, the NSN field on the
> children are updated with the new LSN of the parent page.
>
> To avoid duplicates, when a scan looks at the child page, it needs to
> know if it saw the parent page before or after the downlink was
> inserted. If it saw it before, the scan needs to follow the rightlink to
> the right half, otherwise it will follow the downlink as usual (if it
> matched). The scan checks that by comparing the LSN it saw on the parent
> page with the NSN on the child page. If parent LSN < NSN, we saw the
> parent before the downlink was inserted.
>
> Now, the problem with crash recovery is that the above algorithm depends
> on the split to keep the parent and child locked until the downlink is
> inserted in the parent. If you crash between steps 2 and 3, the locks
> are gone. If a later insert then updates the parent page, because of a
> split on some unrelated child page, that will bump the LSN of the parent
> above the NSN on the child. Scans will see that the parent LSN > child
> NSN, and will no longer follow the rightlink.
>
> And the fix for that is to set a flag on the child page indicating that
> rightlink has to be always followed regardless of the LSN/NSN, because
> the downlink hasn't been inserted yet. When the downlink is inserted,
> the flag is cleared and we rely on the existing LSN/NSN mechanism to
> avoid duplicate results.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-03-10 20:50:29
Message-ID: 201103102050.p2AKoTr03570@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
>
> Has this been addressed?

I see we have with this commit:

9de3aa65f01fb51cbc725e8508ea233e4e92c46c

---------------------------------------------------------------------------

>
> ---------------------------------------------------------------------------
>
> Heikki Linnakangas wrote:
> > On 13.11.2010 00:34, Greg Stark wrote:
> > > On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
> > > <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> > >> I think we can work around that with a small modification to the page split
> > >> algorithm. In a nutshell, when the child page is split, put a flag on the
> > >> left half indicating that the rightlink must always be followed, regardless
> > >> of the NSN. When the downlink is inserted to the parent, clear the flag.
> > >> Setting and clearing of these flags need to be performed during WAL replay
> > >> as well.
> > >
> > > Does this not cause duplicate results? Or does GIST already have to be
> > > prepared to deal with duplicate results?
> >
> > The GiST search algorithm avoids duplicate results by remembering the
> > LSN on the parent page when it follows a downlink. The split currently
> > happens like this:
> >
> > 0. (the child page is locked)
> > 1. The parent page is locked.
> > 2. The child page is split. The original page becomes the left half, and
> > new buffers are allocated for the right halves.
> > 3. The downlink is inserted on the parent page (and the original
> > downlink is updated to reflect only the keys that stayed on the left
> > page). While keeping the child pages locked, the NSN field on the
> > children are updated with the new LSN of the parent page.
> >
> > To avoid duplicates, when a scan looks at the child page, it needs to
> > know if it saw the parent page before or after the downlink was
> > inserted. If it saw it before, the scan needs to follow the rightlink to
> > the right half, otherwise it will follow the downlink as usual (if it
> > matched). The scan checks that by comparing the LSN it saw on the parent
> > page with the NSN on the child page. If parent LSN < NSN, we saw the
> > parent before the downlink was inserted.
> >
> > Now, the problem with crash recovery is that the above algorithm depends
> > on the split to keep the parent and child locked until the downlink is
> > inserted in the parent. If you crash between steps 2 and 3, the locks
> > are gone. If a later insert then updates the parent page, because of a
> > split on some unrelated child page, that will bump the LSN of the parent
> > above the NSN on the child. Scans will see that the parent LSN > child
> > NSN, and will no longer follow the rightlink.
> >
> > And the fix for that is to set a flag on the child page indicating that
> > rightlink has to be always followed regardless of the LSN/NSN, because
> > the downlink hasn't been inserted yet. When the downlink is inserted,
> > the flag is cleared and we rely on the existing LSN/NSN mechanism to
> > avoid duplicate results.
> >
> > --
> > Heikki Linnakangas
> > EnterpriseDB http://www.enterprisedb.com
> >
> > --
> > Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
>
> --
> Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
> EnterpriseDB http://enterprisedb.com
>
> + It's impossible for everything to be true. +
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-03-11 11:26:41
Message-ID: 4D7A06F1.8050108@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10.03.2011 22:50, Bruce Momjian wrote:
> Bruce Momjian wrote:
>>
>> Has this been addressed?
>
> I see we have with this commit:
>
> 9de3aa65f01fb51cbc725e8508ea233e4e92c46c

We fixed GiST. B-tree still has the issue that if you have a checkpoint
in the middle of an insert, and crash, you might be left with a leaf
node without a downlink in the parent.

That's relatively harmless, index searches and insertions work without
the downlink. However, there's code in page deletion that ERRORs if the
parent can't be found. That should be fixed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-03-11 15:59:34
Message-ID: 19761.1299859174@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 10.03.2011 22:50, Bruce Momjian wrote:
>> Bruce Momjian wrote:
>
> Has this been addressed?
>>
>> I see we have with this commit:
>>
>> 9de3aa65f01fb51cbc725e8508ea233e4e92c46c

> We fixed GiST. B-tree still has the issue that if you have a checkpoint
> in the middle of an insert, and crash, you might be left with a leaf
> node without a downlink in the parent.

But that will be fixed during WAL replay.

> That's relatively harmless, index searches and insertions work without
> the downlink. However, there's code in page deletion that ERRORs if the
> parent can't be found. That should be fixed.

I don't like the idea of removing that consistency check, and I don't
think it's necessary to do so.

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-03-11 16:44:33
Message-ID: 4D7A5171.8000603@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.03.2011 17:59, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> On 10.03.2011 22:50, Bruce Momjian wrote:
>>> Bruce Momjian wrote:
>>
>> Has this been addressed?
>>>
>>> I see we have with this commit:
>>>
>>> 9de3aa65f01fb51cbc725e8508ea233e4e92c46c
>
>> We fixed GiST. B-tree still has the issue that if you have a checkpoint
>> in the middle of an insert, and crash, you might be left with a leaf
>> node without a downlink in the parent.
>
> But that will be fixed during WAL replay.

Not under the circumstances that started the original thread:

1. Backend splits a page
2. Checkpoint starts
3. Checkpoint runs to completion
4. Crash
(5. Backend never got to insert the parent pointer)

WAL replay starts at the checkpoint redo pointer, which is after the
page split record, so WAL replay won't insert the parent pointer. That's
an incredibly tight window to hit in practice, but it's possible in theory.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-03-11 17:41:09
Message-ID: 22501.1299865269@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 11.03.2011 17:59, Tom Lane wrote:
>> But that will be fixed during WAL replay.

> Not under the circumstances that started the original thread:

> 1. Backend splits a page
> 2. Checkpoint starts
> 3. Checkpoint runs to completion
> 4. Crash
> (5. Backend never got to insert the parent pointer)

> WAL replay starts at the checkpoint redo pointer, which is after the
> page split record, so WAL replay won't insert the parent pointer. That's
> an incredibly tight window to hit in practice, but it's possible in theory.

Hmm. It's not so improbable that checkpoint would start inside that
window, but that the parent insertion is still pending by the time the
checkpoint finishes is pretty improbable.

How about just reducing the deletion-time ERROR for missing downlink to a LOG?

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-03-11 17:49:19
Message-ID: 4D7A609F.5070008@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.03.2011 19:41, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> On 11.03.2011 17:59, Tom Lane wrote:
>>> But that will be fixed during WAL replay.
>
>> Not under the circumstances that started the original thread:
>
>> 1. Backend splits a page
>> 2. Checkpoint starts
>> 3. Checkpoint runs to completion
>> 4. Crash
>> (5. Backend never got to insert the parent pointer)
>
>> WAL replay starts at the checkpoint redo pointer, which is after the
>> page split record, so WAL replay won't insert the parent pointer. That's
>> an incredibly tight window to hit in practice, but it's possible in theory.
>
> Hmm. It's not so improbable that checkpoint would start inside that
> window, but that the parent insertion is still pending by the time the
> checkpoint finishes is pretty improbable.
>
> How about just reducing the deletion-time ERROR for missing downlink to a LOG?

Well, the code that follows expects to have a valid parent page locked,
so you can't literally do just that. But yeah, LOG and aborting the page
deletion seems fine to me.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-09-05 18:55:13
Message-ID: 201109051855.p85ItDs06048@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> On 11.03.2011 19:41, Tom Lane wrote:
> > Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> >> On 11.03.2011 17:59, Tom Lane wrote:
> >>> But that will be fixed during WAL replay.
> >
> >> Not under the circumstances that started the original thread:
> >
> >> 1. Backend splits a page
> >> 2. Checkpoint starts
> >> 3. Checkpoint runs to completion
> >> 4. Crash
> >> (5. Backend never got to insert the parent pointer)
> >
> >> WAL replay starts at the checkpoint redo pointer, which is after the
> >> page split record, so WAL replay won't insert the parent pointer. That's
> >> an incredibly tight window to hit in practice, but it's possible in theory.
> >
> > Hmm. It's not so improbable that checkpoint would start inside that
> > window, but that the parent insertion is still pending by the time the
> > checkpoint finishes is pretty improbable.
> >
> > How about just reducing the deletion-time ERROR for missing downlink to a LOG?
>
> Well, the code that follows expects to have a valid parent page locked,
> so you can't literally do just that. But yeah, LOG and aborting the page
> deletion seems fine to me.

Did this get fixed?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-09-06 10:21:28
Message-ID: 4E65F428.6030306@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05.09.2011 21:55, Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>> On 11.03.2011 19:41, Tom Lane wrote:
>>> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>>> On 11.03.2011 17:59, Tom Lane wrote:
>>>>> But that will be fixed during WAL replay.
>>>
>>>> Not under the circumstances that started the original thread:
>>>
>>>> 1. Backend splits a page
>>>> 2. Checkpoint starts
>>>> 3. Checkpoint runs to completion
>>>> 4. Crash
>>>> (5. Backend never got to insert the parent pointer)
>>>
>>>> WAL replay starts at the checkpoint redo pointer, which is after the
>>>> page split record, so WAL replay won't insert the parent pointer. That's
>>>> an incredibly tight window to hit in practice, but it's possible in theory.
>>>
>>> Hmm. It's not so improbable that checkpoint would start inside that
>>> window, but that the parent insertion is still pending by the time the
>>> checkpoint finishes is pretty improbable.
>>>
>>> How about just reducing the deletion-time ERROR for missing downlink to a LOG?
>>
>> Well, the code that follows expects to have a valid parent page locked,
>> so you can't literally do just that. But yeah, LOG and aborting the page
>> deletion seems fine to me.
>
> Did this get fixed?

Nope.

On a closer look, this isn't only a problem for page deletion. Page
splitting also barfs if it can't find the parent of a page. As the code
stands, a missing downlink is not harmless, but causes all sorts of trouble.

The window for this to happen with a checkpoint is extremely tight, but
there's another situation where you can end up with a missing downlink:
if you run out of disk space while splitting a parent page, to insert a
downlink to it.

I think we should do a similar fix to b-tree that I did to GiST, and put
a flag on pages with missing downlinks. Then we can fix the missing
downlinks in vacuum and insertion, and get rid of the code to fix
incomplete splits after WAL replay.

The way it would work is that on page split the right page is flagged
with MISSING_DOWNLINK flag. When the downlink is inserted into the
parent, the flag is cleared in the same critical section as the WAL
record for the insertion of the parent is written. Normally, a backend
would never see the flag set, because the locks on the split pages are
not released until the parent record is written and the flag cleared
again. But if inserting the downlink fails for any reason, the next
inserter or vacuum that steps on the page can finish the split by
inserting the downlink.

Unfortunately that means holding the locks on the split pages longer
than we do at the moment. Currently they are released as soon as the
parent page is locked; with this change they would need to be held until
the WAL record of the downlink insertion is done. B-tree is so heavily
used that I'm a bit hesitant to sacrifice any concurrency there, but I
don't think it would be noticeable in practice.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-09-06 13:40:08
Message-ID: CA+TgmoaH0Qy10CdRMDQH7-ELkD9FawnkYOOzE06oMYB4n3Ac9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 6, 2011 at 6:21 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Nope.
>
> On a closer look, this isn't only a problem for page deletion. Page
> splitting also barfs if it can't find the parent of a page. As the code
> stands, a missing downlink is not harmless, but causes all sorts of trouble.
>
> The window for this to happen with a checkpoint is extremely tight, but
> there's another situation where you can end up with a missing downlink: if
> you run out of disk space while splitting a parent page, to insert a
> downlink to it.
>
> I think we should do a similar fix to b-tree that I did to GiST, and put a
> flag on pages with missing downlinks. Then we can fix the missing downlinks
> in vacuum and insertion, and get rid of the code to fix incomplete splits
> after WAL replay.
>
> The way it would work is that on page split the right page is flagged with
> MISSING_DOWNLINK flag. When the downlink is inserted into the parent, the
> flag is cleared in the same critical section as the WAL record for the
> insertion of the parent is written. Normally, a backend would never see the
> flag set, because the locks on the split pages are not released until the
> parent record is written and the flag cleared again. But if inserting the
> downlink fails for any reason, the next inserter or vacuum that steps on the
> page can finish the split by inserting the downlink.
>
> Unfortunately that means holding the locks on the split pages longer than we
> do at the moment. Currently they are released as soon as the parent page is
> locked; with this change they would need to be held until the WAL record of
> the downlink insertion is done. B-tree is so heavily used that I'm a bit
> hesitant to sacrifice any concurrency there, but I don't think it would be
> noticeable in practice.

Do you really need to hold the page locks for all that time, or could
you cheat? Like... release the locks on the split pages but then go
back and reacquire them to clear the flag...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-09-06 13:45:35
Message-ID: 4E6623FF.1070500@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06.09.2011 16:40, Robert Haas wrote:
> On Tue, Sep 6, 2011 at 6:21 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> The way it would work is that on page split the right page is flagged with
>> MISSING_DOWNLINK flag. When the downlink is inserted into the parent, the
>> flag is cleared in the same critical section as the WAL record for the
>> insertion of the parent is written. Normally, a backend would never see the
>> flag set, because the locks on the split pages are not released until the
>> parent record is written and the flag cleared again. But if inserting the
>> downlink fails for any reason, the next inserter or vacuum that steps on the
>> page can finish the split by inserting the downlink.
>>
>> Unfortunately that means holding the locks on the split pages longer than we
>> do at the moment. Currently they are released as soon as the parent page is
>> locked; with this change they would need to be held until the WAL record of
>> the downlink insertion is done. B-tree is so heavily used that I'm a bit
>> hesitant to sacrifice any concurrency there, but I don't think it would be
>> noticeable in practice.
>
> Do you really need to hold the page locks for all that time, or could
> you cheat? Like... release the locks on the split pages but then go
> back and reacquire them to clear the flag...

Hmm, there's two issues with that:

1. While you're not holding the locks on the child pages, someone can
step onto the page and see that the MISSING_DOWNLINK flag is set, and
try to finish the split for you.

2. If you don't hold the page locked while you clear the flag, someone
can start and finish a checkpoint after you've inserted the downlink,
and before you've cleared the flag. You end up in a scenario where the
flag is set, but the page in fact *does* have a downlink in the parent.

So, nope, we can't cheat.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-09-06 13:49:39
Message-ID: CA+TgmoZdV6K2hdJT7ikBN=XLhe=t0Sfuugq7U=7=6OcD4XZp-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 6, 2011 at 9:45 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Do you really need to hold the page locks for all that time, or could
>> you cheat?  Like... release the locks on the split pages but then go
>> back and reacquire them to clear the flag...
>
> Hmm, there's two issues with that:
>
> 1. While you're not holding the locks on the child pages, someone can step
> onto the page and see that the MISSING_DOWNLINK flag is set, and try to
> finish the split for you.
>
> 2. If you don't hold the page locked while you clear the flag, someone can
> start and finish a checkpoint after you've inserted the downlink, and before
> you've cleared the flag. You end up in a scenario where the flag is set, but
> the page in fact *does* have a downlink in the parent.

It seems like both of these could be handled by making the code that
repairs the damage insert the downlink into the parent only if it's
not already present.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-09-06 14:03:51
Message-ID: 201109061403.p86E3pF16250@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> On Tue, Sep 6, 2011 at 9:45 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> >> Do you really need to hold the page locks for all that time, or could
> >> you cheat? ?Like... release the locks on the split pages but then go
> >> back and reacquire them to clear the flag...
> >
> > Hmm, there's two issues with that:
> >
> > 1. While you're not holding the locks on the child pages, someone can step
> > onto the page and see that the MISSING_DOWNLINK flag is set, and try to
> > finish the split for you.
> >
> > 2. If you don't hold the page locked while you clear the flag, someone can
> > start and finish a checkpoint after you've inserted the downlink, and before
> > you've cleared the flag. You end up in a scenario where the flag is set, but
> > the page in fact *does* have a downlink in the parent.
>
> It seems like both of these could be handled by making the code that
> repairs the damage insert the downlink into the parent only if it's
> not already present.

I am sorry to be dumping all these new open issues so late in the 9.1
cycle --- I only now got time to go back over my emails. Many are from
March and later.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-09-06 14:42:03
Message-ID: CA+TgmobShW1q__jUHFbXuKy+kx7b73GGpJWaaU5X-MXzzqoUxg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 6, 2011 at 10:03 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Robert Haas wrote:
>> On Tue, Sep 6, 2011 at 9:45 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> >> Do you really need to hold the page locks for all that time, or could
>> >> you cheat? ?Like... release the locks on the split pages but then go
>> >> back and reacquire them to clear the flag...
>> >
>> > Hmm, there's two issues with that:
>> >
>> > 1. While you're not holding the locks on the child pages, someone can step
>> > onto the page and see that the MISSING_DOWNLINK flag is set, and try to
>> > finish the split for you.
>> >
>> > 2. If you don't hold the page locked while you clear the flag, someone can
>> > start and finish a checkpoint after you've inserted the downlink, and before
>> > you've cleared the flag. You end up in a scenario where the flag is set, but
>> > the page in fact *does* have a downlink in the parent.
>>
>> It seems like both of these could be handled by making the code that
>> repairs the damage insert the downlink into the parent only if it's
>> not already present.
>
> I am sorry to be dumping all these new open issues so late in the 9.1
> cycle --- I only now got time to go back over my emails.  Many are from
> March and later.

Well, I don't think we're likely to do anything about this for 9.1.
For 9.2, possibly, we can improve it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-10-11 19:57:56
Message-ID: 201110111957.p9BJvv016941@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> On 11.03.2011 19:41, Tom Lane wrote:
> > Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> >> On 11.03.2011 17:59, Tom Lane wrote:
> >>> But that will be fixed during WAL replay.
> >
> >> Not under the circumstances that started the original thread:
> >
> >> 1. Backend splits a page
> >> 2. Checkpoint starts
> >> 3. Checkpoint runs to completion
> >> 4. Crash
> >> (5. Backend never got to insert the parent pointer)
> >
> >> WAL replay starts at the checkpoint redo pointer, which is after the
> >> page split record, so WAL replay won't insert the parent pointer. That's
> >> an incredibly tight window to hit in practice, but it's possible in theory.
> >
> > Hmm. It's not so improbable that checkpoint would start inside that
> > window, but that the parent insertion is still pending by the time the
> > checkpoint finishes is pretty improbable.
> >
> > How about just reducing the deletion-time ERROR for missing downlink to a LOG?
>
> Well, the code that follows expects to have a valid parent page locked,
> so you can't literally do just that. But yeah, LOG and aborting the page
> deletion seems fine to me.

Added to TODO:

Fix problem with btree page splits during checkpoints

http://archives.postgresql.org/pgsql-hackers/2010-11/msg00052.php
http://archives.postgresql.org/pgsql-hackers/2011-09/msg00184.php

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2012-08-15 22:23:15
Message-ID: 20120815222315.GR25473@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Has this been addressed? A TODO?

---------------------------------------------------------------------------

On Tue, Sep 6, 2011 at 09:49:39AM -0400, Robert Haas wrote:
> On Tue, Sep 6, 2011 at 9:45 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> >> Do you really need to hold the page locks for all that time, or could
> >> you cheat?  Like... release the locks on the split pages but then go
> >> back and reacquire them to clear the flag...
> >
> > Hmm, there's two issues with that:
> >
> > 1. While you're not holding the locks on the child pages, someone can step
> > onto the page and see that the MISSING_DOWNLINK flag is set, and try to
> > finish the split for you.
> >
> > 2. If you don't hold the page locked while you clear the flag, someone can
> > start and finish a checkpoint after you've inserted the downlink, and before
> > you've cleared the flag. You end up in a scenario where the flag is set, but
> > the page in fact *does* have a downlink in the parent.
>
> It seems like both of these could be handled by making the code that
> repairs the damage insert the downlink into the parent only if it's
> not already present.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2012-08-21 15:55:20
Message-ID: CA+TgmobTmx0WbYTpgsmM8WD-nJpocwwkAU1r+WfKR0=oeUTeSg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 15, 2012 at 6:23 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Has this been addressed? A TODO?

I don't think anything's been done about it. According to your email
of October 11, 2011, you already did add a TODO for this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2012-08-23 14:14:26
Message-ID: 20120823141426.GC19565@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 21, 2012 at 11:55:20AM -0400, Robert Haas wrote:
> On Wed, Aug 15, 2012 at 6:23 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > Has this been addressed? A TODO?
>
> I don't think anything's been done about it. According to your email
> of October 11, 2011, you already did add a TODO for this.

Ah, I see that now. Thanks.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +