Re: Race condition within _bt_findinsertloc()? (new page split code)

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition within _bt_findinsertloc()? (new page split code)
Date: 2014-05-27 19:19:26
Message-ID: 5384E53E.5030205@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05/27/2014 09:47 PM, Peter Geoghegan wrote:
> On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> Also note that _bt_moveright() also finishes any incomplete splits it
>> encounters (when called for an insertion). So _bt_findinsertloc() never gets
>> called on a page with the incomplete-split flag set. It might encounter one
>> when it moves right, but never the first page.
>
> Fair enough, but I don't think that affects correctness either way (I
> don't think that you meant to imply that this was a necessary
> precaution that you'd taken - right?).

Right.

>> If I understood correctly, the tree looks like this before the insertion:
>>
>> Parent page:
>> +-------------+
>> | |
>> | 9 -> A |
>> +-------------+
>>
>> Leaf A
>> +-------------+
>> | HI-key: 9 |
>> | |
>> | 7 8 9 |
>> +-------------+
>>
>> And after insertion and incomplete split:
>>
>> Parent page
>> +-------------+
>> | |
>> | 9 -> A |
>> +-------------+
>>
>> Leaf A Leaf B
>> +--------------+ +-------------+
>> | HI-key: 8 | | HI-key: 9 |
>> | (INCOMPLETE_ | | |
>> | SPLIT) | <-> | |
>> | | | |
>> | 7 7* 8 | | 9 |
>> +--------------+ +-------------+
>
>> After the split is finished, the tree looks like this:
>>
>> Parent page
>> +-------------+
>> | 8 -> A |
>> | 9 -> B |
>> +-------------+
>>
>> Leaf A Leaf B
>> +-------------+ +-------------+
>> | HI-key: 8 | | HI-key: 9 |
>> | | <-> | |
>> | 7 7* 8 | | 9 |
>> +-------------+ +-------------+
>
> How did the parent page change between before and after the final
> atomic operation (page split completion)? What happened to "9 -> A"?

Ah, sorry, I got that wrong. The downlinks store the *low* key of the
child page, not the high key as I depicted. Let me try again:

On 05/27/2014 09:17 AM, Peter Geoghegan wrote:
> Anyway, the concern is that there may be problems when we call
> _bt_finish_split() with that left sibling locked thoughout (i.e.
> finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
> itself has a right sibling from the incomplete split (which is usually
> the value lock page's right-right sibling)). I'm not concerned about
> performance, since as the comment says it ought to be an infrequent
> occurrence. I also believe that there are no deadlock hazards. But
> consider this scenario:
>
> * We insert the value 7 into an int4 unique index. We need to split
> the leaf page. We run out of memory or something, and ours is an
> incomplete page split. Our transaction is aborted. For the sake of
> argument, suppose that there are also already a bunch of dead tuples
> within the index with values of 7, 8 and 9.

So, initially the tree looks like this:

Parent page:
+-------------+
| |
| 7 -> A |
+-------------+

Leaf A
+-------------+
| HI-key: 9 |
| |
| 7 8 9 |
+-------------+

And after insertion and incomplete split:

Parent page
+-------------+
| |
| 7 -> A |
+-------------+

Leaf A Leaf B
+--------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| (INCOMPLETE_ | | |
| SPLIT) | <-> | |
| | | |
| 7 7* 8 | | 9 |
+--------------+ +-------------+

where 7* is the newly inserted key, with value 7.

(you didn't mention at what point the split happens, but in the next
paragraph you said the new downlink has value 8, which implies the above
split)

> * Another inserter of the value 7 comes along. It follows exactly the
> same downlink as the first, now aborted inserter (suppose the
> downlink's value is 9). It also locks the same leaf page to establish
> a "value lock" in precisely the same manner. Finding no room on the
> first page, it looks further right, maintaining its original "value
> lock" throughout. It finishes the first inserter's split of the
> non-value-lock page - a new downlink is inserted into the parent page,
> with the value 8. It then releases all buffer locks except the first
> one/original "value lock". A physical insertion has yet to occur.

The downlink of the original page cannot contain 9. Because, as I now
remember ;-), the downlinks contain low-keys, not high keys.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-05-27 19:27:13 Why is pg_lsn marking itself a preferred type?
Previous Message Evan Jones 2014-05-27 19:12:22 PG Manual: Clarifying the repeatable read isolation example