Re: SP-GiST bug.

Lists: pgsql-hackers
From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: SP-GiST bug.
Date: 2014-05-20 23:13:45
Message-ID: 537BE1A9.1050006@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

create table xxx ( t text);

insert into xxx select 'x' || v::text from generate_series(1, 291) as v;
insert into xxx values ('');

create index xxxidx on xxx using spgist (t);

And postgres will eat memory forever. It seems to me that checkAllTheSame
wrongly decides that all tuples are the same. All tuples except tuple to be
inserted have the same character 'x' and only new tuple has '\0'. But
checkAllTheSame() removes new tuple because set of tuples is too big to fit page
and founds that all tuples are the same. Patch attached, suppose, all branches
with spgist should be fixed.

Original data is too big to send in e-mail list or even send link, and the bug
caused not in root page as in example above.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

Attachment Content-Type Size
spgist_allthesame.patch.gz application/x-tar 763 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SP-GiST bug.
Date: 2014-05-28 22:32:18
Message-ID: 14826.1401316338@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
> create table xxx ( t text);

> insert into xxx select 'x' || v::text from generate_series(1, 291) as v;
> insert into xxx values ('');

> create index xxxidx on xxx using spgist (t);

Fun!

> And postgres will eat memory forever. It seems to me that checkAllTheSame
> wrongly decides that all tuples are the same. All tuples except tuple to be
> inserted have the same character 'x' and only new tuple has '\0'.

I don't think that checkAllTheSame is broken, but it probably would be
after this patch --- ISTM you're breaking the corner case described in the
header comment for checkAllTheSame, where we need to force splitting of a
set of existing tuples that the opclass can't distinguish.

What actually is broken, I think, is the spgist text opclass, because it's
using a zero node label for two different situations. The scenario we're
looking at here is:

1. We call picksplit with all 292 of these entries. Not surprisingly,
it puts the first 291 into a single node with label 'x' and the last one
into another node with label '\0'.

2. Because this is too many to fit on one index page, checkAllTheSame
decides it had better create an allTheSame tuple for the first 291 and
then try again to insert the empty string into that tuple. While you
could argue whether this is *necessary*, it is not *incorrect*, so ISTM
the opclass had better cope with it.

3. We call spg_text_choose with the allTheSame tuple and the empty-string
value to be inserted. It finds the empty-string value doesn't match
(since all the nodes have label 'x') so it requests a split:

out->resultType = spgSplitTuple;
out->result.splitTuple.prefixHasPrefix = in->hasPrefix;
out->result.splitTuple.prefixPrefixDatum = in->prefixDatum;
out->result.splitTuple.nodeLabel = UInt8GetDatum('\0');
out->result.splitTuple.postfixHasPrefix = false;

After splitting, we have a new upper tuple with a single node with label
'\0', pointing to a new allTheSame tuple containing the original 291
entries.

4. We call spg_text_choose with the new upper tuple and the empty-string
value to be inserted. It looks for a node labeled '\0', and finds it, so
it reports that we should descend into that node --- ie, down to the new
allTheSame tuple.

5. We call spg_text_choose with the new allTheSame tuple and the
empty-string value to be inserted. This is exactly like the situation
at step 3, so we're in an infinite loop.

It doesn't look to me like the core code has done anything wrong here;
it just did what the opclass requested. Rather, the problem is we're
using '\0' node label both to represent a "no op" label descent and
to represent "we're past the end of the string".

I'm afraid there may not be any way to fix this without breaking on-disk
compatibility for spgist text indexes :-(. It seems like we have to
distinguish the case of zero-length string from that of a dummy label
for a pushed-down allTheSame tuple.

regards, tom lane


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SP-GiST bug.
Date: 2014-05-29 14:52:09
Message-ID: 53874999.3010303@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I don't think that checkAllTheSame is broken, but it probably would be
> after this patch --- ISTM you're breaking the corner case described in the
> header comment for checkAllTheSame, where we need to force splitting of a
> set of existing tuples that the opclass can't distinguish.
INHO, this patch will not break - because it forces (in corner case) to create a
node for new tuple but exclude it from list to insert. On next iteration we will
find a needed node with empty list.

Comment:
* If we know that the leaf tuples wouldn't all fit on one page, then we
* exclude the last tuple (which is the incoming new tuple that forced a split)
* from the check to see if more than one node is used. The reason for this
* is that if the existing tuples are put into only one chain, then even if
* we move them all to an empty page, there would still not be room for the
* new tuple, so we'd get into an infinite loop of picksplit attempts.

If we reserve a node for new tuple then on next iteration we will not fall into
picksplit again - we will have an empty list. So new tuple will fall into
another page.

BTW, look for following example:
create table xxx (p point);

insert into xxx select '(1,1)'::point from generate_series(1, 226) as v;
insert into xxx values ('(3,3)'::point);

create index xxxidx on xxx using spgist (p quad_point_ops);

easy to see that the single node is allTheSame although underlying leafs are
different. Due to allTheSame search logic scans are correct but inefficient. And
patch fixes it too.

> What actually is broken, I think, is the spgist text opclass, because it's
> using a zero node label for two different situations. The scenario we're
> looking at here is:

Agree for different situations, but disagree that's broken :)

> It seems like we have to
> distinguish the case of zero-length string from that of a dummy label
> for a pushed-down allTheSame tuple.

Actually, we do this: if allTheSame is true then nodes have a dummy labels.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SP-GiST bug.
Date: 2014-05-29 19:51:38
Message-ID: 15869.1401393098@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
>> I don't think that checkAllTheSame is broken, but it probably would be
>> after this patch --- ISTM you're breaking the corner case described in the
>> header comment for checkAllTheSame, where we need to force splitting of a
>> set of existing tuples that the opclass can't distinguish.

> INHO, this patch will not break - because it forces (in corner case) to create a
> node for new tuple but exclude it from list to insert. On next iteration we will
> find a needed node with empty list.

> Comment:
> * If we know that the leaf tuples wouldn't all fit on one page, then we
> * exclude the last tuple (which is the incoming new tuple that forced a split)
> * from the check to see if more than one node is used. The reason for this
> * is that if the existing tuples are put into only one chain, then even if
> * we move them all to an empty page, there would still not be room for the
> * new tuple, so we'd get into an infinite loop of picksplit attempts.

> If we reserve a node for new tuple then on next iteration we will not fall into
> picksplit again - we will have an empty list. So new tuple will fall into
> another page.

The point I'm making is that the scenario your test case exposes is not
an infinite loop of picksplits, but an infinite loop of choose calls.
There are certainly ways to get into that loop that don't involve this
specific checkAllTheSame logic. Here's an example:

regression=# create table xxx ( t text);
CREATE TABLE
regression=# create index xxxidx on xxx using spgist (t);
CREATE INDEX
regression=# insert into xxx select repeat('x',4000);
INSERT 0 1
regression=# insert into xxx select repeat('x',4000);
INSERT 0 1
regression=# insert into xxx select repeat('x',4000);
INSERT 0 1
regression=# insert into xxx select repeat('x',3996);

In this example, we get a picksplit at the third insert, and here we
*must* take the allTheSame path because the values are, in fact, all the
same (the SPGIST_MAX_PREFIX_LENGTH constraint means we can only put 3996
characters into the prefix, and then the 3997'th character of each value
is 'x'). Then the fourth insert triggers this same infinite loop, because
spg_text_choose is asked how to insert the slightly-shorter value into the
allTheSame tuple, and it does the wrong thing.

It's certainly possible that we could/should change what checkAllTheSame
is doing --- on re-reading the comment, I'm not really sure that the
scenario it's concerned about can happen. However, that will not fix
the bug in spgtextproc.c; it will only block one avenue for reaching
the bug.

regards, tom lane


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SP-GiST bug.
Date: 2014-05-30 14:55:18
Message-ID: 53889BD6.4020401@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> The point I'm making is that the scenario your test case exposes is not
> an infinite loop of picksplits, but an infinite loop of choose calls.

Thank you, now I see what I missed before. After some brainstorming, it's
possible to use '\0' only as end of string marker. The idea is: when we split
allthesame tuple to upper/lower we copy node label to upper tuple instead of set
it to '\0'. Node labels of lower tuple will be unchanged but should be ignored
for reconstructionValue - and they are ignored because of allTheSame flag. See
attached patch.

Unfortunately, it will break on-disk compatibility. Although it will not cause a
panic/error/coredump allTheSame approach will not find tuples. Users should
recreate spgist indexes over text column.

> It's certainly possible that we could/should change what checkAllTheSame
> is doing --- on re-reading the comment, I'm not really sure that the
> scenario it's concerned about can happen. However, that will not fix

I rewrited a patch to fix missed way - allTheSame result of picksplit and tooBig
is set. I believe this patch is still needed because it could make a tree more
efficient as it was demonstrated for quad tree.

this patch doesn't break on-disk compatibility, although index recreation is
recommended.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

Attachment Content-Type Size
spgist_textproc.patch.gz application/x-tar 769 bytes
spgist_allthesame.patch.2.gz application/x-tar 895 bytes

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SP-GiST bug.
Date: 2014-05-30 15:11:51
Message-ID: 20140530151151.GQ28490@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 30, 2014 at 06:55:18PM +0400, Teodor Sigaev wrote:
> >The point I'm making is that the scenario your test case exposes is not
> >an infinite loop of picksplits, but an infinite loop of choose calls.
>
> Thank you, now I see what I missed before. After some brainstorming,
> it's possible to use '\0' only as end of string marker. The idea
> is: when we split allthesame tuple to upper/lower we copy node label
> to upper tuple instead of set it to '\0'. Node labels of lower tuple
> will be unchanged but should be ignored for reconstructionValue -
> and they are ignored because of allTheSame flag. See attached
> patch.
>
> Unfortunately, it will break on-disk compatibility. Although it will
> not cause a panic/error/coredump allTheSame approach will not find
> tuples. Users should recreate spgist indexes over text column.

If we bump the system catalog version, pg_upgrade can mark those indexes
as invalid and output a script to reindex them.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SP-GiST bug.
Date: 2014-05-30 16:19:07
Message-ID: 5388AF7B.2050205@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> If we bump the system catalog version, pg_upgrade can mark those indexes
> as invalid and output a script to reindex them.
>

Between 9.3 - 9.4 catalog version is changed anyway.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SP-GiST bug.
Date: 2014-05-30 19:22:34
Message-ID: 20140530192234.GR28490@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 30, 2014 at 08:19:07PM +0400, Teodor Sigaev wrote:
> >If we bump the system catalog version, pg_upgrade can mark those indexes
> >as invalid and output a script to reindex them.
> >
>
> Between 9.3 - 9.4 catalog version is changed anyway.

Right. I was thinking of beta users between beta1 and beta2 as well.
We would need to trigger the index invalidation on the catalog version
used for the commit that changes the index format, not the major version
like 9.4.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SP-GiST bug.
Date: 2014-06-03 17:37:40
Message-ID: 18153.1401817060@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
>> The point I'm making is that the scenario your test case exposes is not
>> an infinite loop of picksplits, but an infinite loop of choose calls.

> Thank you, now I see what I missed before. After some brainstorming, it's
> possible to use '\0' only as end of string marker. The idea is: when we split
> allthesame tuple to upper/lower we copy node label to upper tuple instead of set
> it to '\0'. Node labels of lower tuple will be unchanged but should be ignored
> for reconstructionValue - and they are ignored because of allTheSame flag. See
> attached patch.

I don't believe this patch works at all. In particular, for an allTheSame
node occurring at the root level, omitting the nodeChar from the
reconstructed value is certainly wrong since there was no parent that
could have had a label containing the same character. More generally, the
patch is assuming that any allTheSame tuple is one that is underneath a
split node; but that's surely wrong (where did such a tuple come from
before the split happened?).

If the root case were the only possible one then we could fix the problem
by adding a test on whether level == 0, but AFAICS, allTheSame tuples can
also get created at lower tree levels. All you need is a bunch of strings
that are duplicates for more than the maximum prefix length.

I think what we have to do is use a different dummy value for the node
label of a pushed-down allTheSame tuple than we do for end-of-string.
This requires widening the node labels from uint8 to (at least) int16.
However, I think that's essentially free because pass-by-value node
labels are stored as Datums anyway. In fact, it might even be upward
compatible, at least for cases that aren't broken today.

> I rewrited a patch to fix missed way - allTheSame result of picksplit and tooBig
> is set. I believe this patch is still needed because it could make a tree more
> efficient as it was demonstrated for quad tree.

The question here continues to be how sure we are that the case described
in checkAllTheSame's header comment can't actually occur. We certainly
thought it could when we originally wrote this stuff. I think possibly
we were wrong, but I'd like to see a clear proof of that before we discuss
dropping the logic that's meant to cope with the situation.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SP-GiST bug.
Date: 2014-06-05 22:39:13
Message-ID: 26213.1402007953@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I think what we have to do is use a different dummy value for the node
> label of a pushed-down allTheSame tuple than we do for end-of-string.
> This requires widening the node labels from uint8 to (at least) int16.
> However, I think that's essentially free because pass-by-value node
> labels are stored as Datums anyway. In fact, it might even be upward
> compatible, at least for cases that aren't broken today.

Here's a draft patch along these lines. AFAICT it is fully upward
compatible with existing indexes, although in cases where the current code
creates a zero node label for strings ending at that level, this code will
choose to push new strings into a new child node with label -1 instead, so
that there's some very small added inefficiency. I don't think we need to
tell people to reindex if we use this fix, though. I went with new
dummy labels of -1 and -2 partially so that an existing zero label would
not be a hazard for future insertions, and partly with the thought that if
we ever allow embedded \0 in strings, this coding would be able to handle
it with minimal adjustment.

Note that this fix slightly reduces the maximum safe length of a prefix
string (SPGIST_MAX_PREFIX_LENGTH). In an existing index, there might be
prefixes longer than that. The code will still work, but it's
theoretically possible that addition of nodes to such a tuple could result
in "SPGiST inner tuple size exceeds maximum" errors, if you managed to
populate all possible next-byte values at that tuple. I think the odds of
this being a problem in the field are negligible (and if it did happen,
reindexing would fix the problem).

regards, tom lane

Attachment Content-Type Size
spgist-zero-label-fix.patch text/x-diff 15.7 KB