GIN versus zero-key queries

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: GIN versus zero-key queries
Date: 2009-03-25 17:25:01
Message-ID: 4076.1238001901@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Our fine manual sayeth (in section 52.5)

When extractQuery returns zero keys, GIN will emit an error. Depending
on the operator, a void query might match all, some, or none of the
indexed values (for example, every array contains the empty array, but
does not overlap the empty array), and GIN cannot determine the correct
answer, nor produce a full-index-scan result if it could determine that
that was correct.

However, the behavior actually implemented by newScanKey() doesn't seem
to agree with this. If there are multiple scankeys (ie, multiple
indexable clauses) then what actually happens is you get an error
report only if *all* the clauses are zero-key queries. If some clauses
are zero-key and some are normal, it effectively ignores the zero-key
ones and sails ahead with the normal ones. This amounts to assuming
that the zero-key queries match all possible indexed values. But as
noted by the manual, that's not a correct conclusion for some operator
semantics.

I am not sure whether the statement in 52.5 is still accurate, though.
We have an API definition by which extractQuery can distinguish "all
match" from "no match". If we just legislate that "some match" isn't
a valid behavior for zero-key queries, then the code is correct and the
documentation is wrong. However, if the above quote is correct, then
I think newScanKey() is buggy.

Comments?

regards, tom lane


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: GIN versus zero-key queries
Date: 2009-03-26 20:45:37
Message-ID: 1238100337.11547.79.camel@dell.linuxdev.us.dell.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2009-03-25 at 13:25 -0400, Tom Lane wrote:
> I am not sure whether the statement in 52.5 is still accurate, though.
> We have an API definition by which extractQuery can distinguish "all
> match" from "no match". If we just legislate that "some match" isn't
> a valid behavior for zero-key queries, then the code is correct and the
> documentation is wrong. However, if the above quote is correct, then
> I think newScanKey() is buggy.

Legislating that "some match" is invalid for zero-key queries seems
reasonable to me.

If the operator class author wants it to throw an error for zero keys
(as the documentation says should happen), they can do that easily
enough without being forced. However, if the opclass author finds "all
match" to be useful behavior (which seems reasonable), I don't see a
reason to stop them.

Regards,
Jeff Davis


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: GIN versus zero-key queries
Date: 2009-04-09 15:10:20
Message-ID: 49DE0FDC.2030700@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> We have an API definition by which extractQuery can distinguish "all
> match" from "no match". If we just legislate that "some match" isn't
> a valid behavior for zero-key queries, then the code is correct and the

Right now I don't see an example with zero keys and "some match", consistent
method will not have any information about indexed tuple and hence it could not
produce any reasonable result. It seems to me, that paragraph should be removed
at all.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: GIN versus zero-key queries
Date: 2009-04-09 19:20:52
Message-ID: 3868.1239304852@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
>> We have an API definition by which extractQuery can distinguish "all
>> match" from "no match". If we just legislate that "some match" isn't
>> a valid behavior for zero-key queries, then the code is correct and the

> Right now I don't see an example with zero keys and "some match", consistent
> method will not have any information about indexed tuple and hence it could not
> produce any reasonable result. It seems to me, that paragraph should be removed
> at all.

Well, we still have to document the fact that GIN fails when presented
with a query that would require a full-index scan. I've updated section
52.5 as attached. (I removed the claim that multiple matches were a
problem, since this is obviously not true for a bitmap scan, and I
suppose that the plain indexscan code must have had a way to deal with
it too.)

More generally, though, I find myself quite unhappy with the fact that
GIN doesn't provide reasonable behavior for the no-keys corner cases.
If we are going to provide operator classes that claim to implement
index acceleration of standard operators, it is not okay for them
to not match the exact semantics of those operators. Right now it's
a mess for empty arrays, and it's even more of a mess for arrays
containing nulls. array_contain_compare() considers nulls as never
matching, which means that

regression=# select array[1,null] <@ array[1,null];
?column?
----------
f
(1 row)

which is entirely bizarre when you compare that result to

regression=# select array[1,null] = array[1,null];
?column?
----------
t
(1 row)

It's obviously too late to do anything about this for 8.4, but I would
like to have a TODO item to figure out how to do this right. We need to
adjust the behavior of the operators to be consistent, and then we need
to make it possible for GIN to implement them exactly. I wonder whether
we should not change GIN to automatically do something reasonable for
empty indexed values, ie stick them into the index in some special way
denoting "no indexable keys for this item". The dummy-value solution
is not something that operator classes should have to come up with,
and not all data types present a reasonable way to have dummy values
anyway.

regards, tom lane

<para>
<acronym>GIN</acronym> doesn't support full index scans. The reason for
this is that <function>extractValue</> is allowed to return zero keys,
as for example might happen with an empty string or empty array. In such
a case the indexed value will be unrepresented in the index. It is
therefore impossible for <acronym>GIN</acronym> to guarantee that a
scan of the index can find every row in the table.
</para>

<para>
Because of this limitation, when <function>extractQuery</function> returns
<literal>nkeys = 0</> to indicate that all values match the query,
<acronym>GIN</acronym> will emit an error. (If there are multiple ANDed
indexable operators in the query, this happens only if they all return zero
for <literal>nkeys</>.)
</para>

<para>
It is possible for an operator class to circumvent the restriction against
full index scan. To do that, <function>extractValue</> must return at least
one (possibly dummy) key for every indexed value, and
<function>extractQuery</function> must convert an unrestricted search into
a partial-match query that will scan the whole index. This is inefficient
but might be necessary to avoid corner-case failures with operators such
as <literal>LIKE</> or subset inclusion.
</para>

<para>
<acronym>GIN</acronym> assumes that indexable operators are strict.
This means that <function>extractValue</> will not be called at all on
a NULL value (so the value will go unindexed), and
<function>extractQuery</function> will not be called on a NULL comparison
value either (instead, the query is presumed to be unmatchable).
</para>

<para>
A possibly more serious limitation is that <acronym>GIN</acronym> cannot
handle NULL keys &mdash; for example, an array containing a NULL cannot
be handled except by ignoring the NULL.
</para>


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: GIN versus zero-key queries
Date: 2009-04-15 23:44:56
Message-ID: 200904152344.n3FNiuH05971@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Where are we on this?

---------------------------------------------------------------------------

Tom Lane wrote:
> Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
> >> We have an API definition by which extractQuery can distinguish "all
> >> match" from "no match". If we just legislate that "some match" isn't
> >> a valid behavior for zero-key queries, then the code is correct and the
>
> > Right now I don't see an example with zero keys and "some match", consistent
> > method will not have any information about indexed tuple and hence it could not
> > produce any reasonable result. It seems to me, that paragraph should be removed
> > at all.
>
> Well, we still have to document the fact that GIN fails when presented
> with a query that would require a full-index scan. I've updated section
> 52.5 as attached. (I removed the claim that multiple matches were a
> problem, since this is obviously not true for a bitmap scan, and I
> suppose that the plain indexscan code must have had a way to deal with
> it too.)
>
> More generally, though, I find myself quite unhappy with the fact that
> GIN doesn't provide reasonable behavior for the no-keys corner cases.
> If we are going to provide operator classes that claim to implement
> index acceleration of standard operators, it is not okay for them
> to not match the exact semantics of those operators. Right now it's
> a mess for empty arrays, and it's even more of a mess for arrays
> containing nulls. array_contain_compare() considers nulls as never
> matching, which means that
>
> regression=# select array[1,null] <@ array[1,null];
> ?column?
> ----------
> f
> (1 row)
>
> which is entirely bizarre when you compare that result to
>
> regression=# select array[1,null] = array[1,null];
> ?column?
> ----------
> t
> (1 row)
>
> It's obviously too late to do anything about this for 8.4, but I would
> like to have a TODO item to figure out how to do this right. We need to
> adjust the behavior of the operators to be consistent, and then we need
> to make it possible for GIN to implement them exactly. I wonder whether
> we should not change GIN to automatically do something reasonable for
> empty indexed values, ie stick them into the index in some special way
> denoting "no indexable keys for this item". The dummy-value solution
> is not something that operator classes should have to come up with,
> and not all data types present a reasonable way to have dummy values
> anyway.
>
> regards, tom lane
>
>
> <para>
> <acronym>GIN</acronym> doesn't support full index scans. The reason for
> this is that <function>extractValue</> is allowed to return zero keys,
> as for example might happen with an empty string or empty array. In such
> a case the indexed value will be unrepresented in the index. It is
> therefore impossible for <acronym>GIN</acronym> to guarantee that a
> scan of the index can find every row in the table.
> </para>
>
> <para>
> Because of this limitation, when <function>extractQuery</function> returns
> <literal>nkeys = 0</> to indicate that all values match the query,
> <acronym>GIN</acronym> will emit an error. (If there are multiple ANDed
> indexable operators in the query, this happens only if they all return zero
> for <literal>nkeys</>.)
> </para>
>
> <para>
> It is possible for an operator class to circumvent the restriction against
> full index scan. To do that, <function>extractValue</> must return at least
> one (possibly dummy) key for every indexed value, and
> <function>extractQuery</function> must convert an unrestricted search into
> a partial-match query that will scan the whole index. This is inefficient
> but might be necessary to avoid corner-case failures with operators such
> as <literal>LIKE</> or subset inclusion.
> </para>
>
> <para>
> <acronym>GIN</acronym> assumes that indexable operators are strict.
> This means that <function>extractValue</> will not be called at all on
> a NULL value (so the value will go unindexed), and
> <function>extractQuery</function> will not be called on a NULL comparison
> value either (instead, the query is presumed to be unmatchable).
> </para>
>
> <para>
> A possibly more serious limitation is that <acronym>GIN</acronym> cannot
> handle NULL keys &mdash; for example, an array containing a NULL cannot
> be handled except by ignoring the NULL.
> </para>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +