Re: Bug in new buffering GiST build code

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in new buffering GiST build code
Date: 2012-05-21 22:09:57
Message-ID: CAPpHfdtVeLkeT0Je1ihbOphSjf0b5mg8A8Um2zUVvaomSqM_Lw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

On Tue, May 22, 2012 at 12:56 AM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> After staring at graphs built from gist trees for the whole day, I think I
> finally understand what's wrong:
>
> There's a thinko in the way we maintain the parent paths during
> insertions. It boils down to the fact that in a GiST index, the
> left-to-right ordering as determined by the right-links on the upper level
> does not necessarily match the left-to-right ordering at a lower level. I'm
> afraid we've inadvertently made that assumption in the code.
>
> This can happen:
>
> 1. Let's imagine that we have a tree that looks like this:
>
> root
> |
> ...
> |
> A (internal node at upper level)
> |
> |
> B
> |
> |
> C (internal node at a lower level)
> |
> ...
>
> 2. While we descend down the tree to insert a tuple, we memorize the path
> A..B..C. This is stored in the node buffer associated with node C.
>
> 3. More tuples are inserted to another subtree below B (not shown), until
> node B needs to be split. This produces tree:
>
> A
> |\
> | \
> B->B2
> |
> |
> C
>
> We still have the path A..B..C memorized in C's node buffer. The downlink
> for C is now actually in B2, but that's ok, because we have the code to
> follow the right links if we can't find the downlink for a node in the
> memorized parent.
>
> 4. More tuples are added to another subtree of A, until A has to be split.
> Picksplit decides to keep the downlink for B2 on the original page, and
> moves the downlink for B on the new page, A2:
>
> A->A2
> \ /
> X
> / \
> B->B2
> |
> |
> C
>
> Remember that we still have the path A..B..C memorized in C's node buffer.
>
> 5. More tuples are buffered, and we traverse down the tree along the path
> A2->B->... When we look up the node buffer for page B, we update the path
> stored there. It's now A2..B. This fragment of the path is shared by the
> path in C's node buffer.
>
> 6. At this point, the path memorized in C's node buffer is A2..B..C. This
> is where things go wrong. While it's true that A2 is the parent of B, and
> it's true that the parent of C can be found by following the rightlink from
> B, A2 is *not* a grandparent of C.
>
> 7. More tuples are added below C, and C has to be split. To insert the
> downlink for the new sibling, we re-find the parent for C. The memorized
> path is A2..B..C. We begin by searching for the downlink for C in page B.
> It's not there, so we move right, and find it in B2. The path we're working
> with is now A2..B2..C. When we insert the new downlink into B2, it also
> fills up and has to be split, so recurse up and have to refind the parent
> of B2. We begin looking in the memorized parent, A2. The downlink is not
> there, so we move right. But the downlink for B2 is to the left from A2, so
> we never find it. We walk right until we fall off the cliff, and you get
> the "failed to re-find parent" error.
>
>
> I think the minimal fix is that after moving right, look up the node
> buffer of the page we moved onto, and use the path memorized for that if we
> have to recurse further up the tree. So in the above example, at step 7
> after we've moved right to node B2, we should look up the memorized path
> for B2 in B2's node buffer. That would give us the correct path, A..B2..C.
>
> The management of the path stacks is a bit complicated, anyway. I'll think
> about this some more tomorrow, maybe we can make it simpler, knowing that
> we have to do those extra lookups.

WOW! You did enormous work on exploring that!
I just arrived from PGCon and start looking at it when find you've already
done comprehensive research of this problem.
On the step 5 if we've NSN in GISTBufferingInsertStack structure, we could
detect situation of changing parent of splitted page. Using this we could
save copy of GISTBufferingInsertStack for B2 with original parent A,
because we know split of B to occur after creating GISTBufferingInsertStack
but before split of A. The question is how to find this copy from C, hash?

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2012-05-21 22:37:54 Re: Patch: add conversion from pg_wchar to multibyte
Previous Message Florian Pflug 2012-05-21 22:08:51 Re: read() returns ERANGE in Mac OS X