Re: Minimum selectivity estimate for LIKE 'prefix%'

Lists: pgsql-hackerspgsql-patches
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-patches(at)postgreSQL(dot)org
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject: Minimum selectivity estimate for LIKE 'prefix%'
Date: 2008-03-06 21:16:54
Message-ID: 23548.1204838214@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Peter complained awhile back about unrealistically small selectivity
estimates for LIKE with a fixed-prefix pattern:
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00939.php
I had suspected that this was related to the locale-specific problems
we fixed a few months ago, but having finally had a chance to look
at the data off-list, there doesn't seem to be any such component
to the problem. What it boils down to is that when we generate a
range constraint based on the prefix, such as
col >= 'prefix' AND col < 'prefiy'
if the prefix is more than a few characters long then the two endpoint
values are indistinguishable as far as comparison to the histogram
is concerned, and so we come out with a selectivity estimate that
is zero to within roundoff error. This is unreasonably optimistic
and can lead to bad plan choices.

What I propose doing about this is a small variant on Peter's original
suggestion: compute the estimated selectivity for
col = 'prefix'
and clamp the result of prefix_selectivity to be at least that.
This is plausible on intuitive grounds since the range constraint
must surely include at least these values. Furthermore, it eliminates
what had been an entirely ad-hoc choice of a lower bound (the code
was clamping to at least 1e-10, which is surely unreasonably
optimistic).

The end result of this, for the case Peter is interested in where there
are no especially common values, is that the minimum selectivity
estimate for LIKE 'prefix%' will be essentially 1/ndistinct where
ndistinct is the estimated number of distinct values in the column,
because that's what eqsel() does in such cases.

I attach a proposed patch against HEAD for this. It's a bit long
but most of the bulk is refactoring eqsel() to make it easy to use from
prefix_selectivity(). The intellectual content is just in the last
diff hunk.

To help out Peter's client, who's running 8.1.x, we'd have to backpatch
at least that far. We backpatched the last round of LIKE selectivity
fixes to 8.1, so I don't have too much hesitation about doing the same
here.

Comments?

regards, tom lane

Attachment Content-Type Size
like_minimum_estimate.patch.gz application/octet-stream 3.6 KB

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-patches(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Minimum selectivity estimate for LIKE 'prefix%'
Date: 2008-03-07 18:29:14
Message-ID: 200803071929.15837.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Am Donnerstag, 6. März 2008 schrieb Tom Lane:
> To help out Peter's client, who's running 8.1.x, we'd have to backpatch
> at least that far.  We backpatched the last round of LIKE selectivity
> fixes to 8.1, so I don't have too much hesitation about doing the same
> here.

Thanks for the offer. Personally, I'd be OK with not backpatching and dealing
with the patch myself. We've patched that area before, but I'm not sure what
the original cause of that was, but this here sounds more like a feature
improvement.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-patches(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Minimum selectivity estimate for LIKE 'prefix%'
Date: 2008-03-17 16:10:41
Message-ID: 200803171710.41400.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Am Donnerstag, 6. März 2008 schrieb Tom Lane:
> I attach a proposed patch against HEAD for this.  It's a bit long
> but most of the bulk is refactoring eqsel() to make it easy to use from
> prefix_selectivity().  The intellectual content is just in the last
> diff hunk.

Looking at the patch as committed here
<http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/selfuncs.c.diff?r1=1.243;r2=1.244>,
I am confused by one hunk:

*************** prefix_selectivity(VariableStatData *var
*** 4452,4468 ****
* "x < greaterstr".
*-------
*/
- cmpopr = get_opfamily_member(opfamily, vartype, vartype,
- BTLessStrategyNumber);
- if (cmpopr == InvalidOid)
- elog(ERROR, "no < operator for opfamily %u", opfamily);
- fmgr_info(get_opcode(cmpopr), &opproc);
-
greaterstrcon = make_greater_string(prefixcon, &opproc);
if (greaterstrcon)
{
Selectivity topsel;

topsel = ineq_histogram_selectivity(vardata, &opproc, false,
greaterstrcon->constvalue,
greaterstrcon->consttype);
--- 4503,4519 ----
* "x < greaterstr".
*-------
*/
greaterstrcon = make_greater_string(prefixcon, &opproc);
if (greaterstrcon)
{
Selectivity topsel;

+ cmpopr = get_opfamily_member(opfamily, vartype, vartype,
+ BTLessStrategyNumber);
+ if (cmpopr == InvalidOid)
+ elog(ERROR, "no < operator for opfamily %u", opfamily);
+ fmgr_info(get_opcode(cmpopr), &opproc);
+
topsel = ineq_histogram_selectivity(vardata, &opproc, false,
greaterstrcon->constvalue,
greaterstrcon->consttype);

make_greater_string() is documented as

* The caller must provide the appropriate "less than" comparison function
* for testing the strings.

but using the new code flow it appears that opproc would still refer to
the >= operator looked up earlier.

The 8.1 code calls this variable ltproc because it didn't need the >= operator.
Perhaps using two variables would also be clearer here.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Minimum selectivity estimate for LIKE 'prefix%'
Date: 2008-03-17 16:56:58
Message-ID: 14389.1205773018@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> Am Donnerstag, 6. Mrz 2008 schrieb Tom Lane:
>> I attach a proposed patch against HEAD for this. It's a bit long
>> but most of the bulk is refactoring eqsel() to make it easy to use from
>> prefix_selectivity(). The intellectual content is just in the last
>> diff hunk.

> Looking at the patch as committed here
> <http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/selfuncs.c.diff?r1=1.243;r2=1.244>,
> I am confused by one hunk:

Ack, that's a bug --- I was looking at the use of opproc by
ineq_histogram_selectivity and failed to notice make_greater_string
using it too. Will fix.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [PATCHES] Minimum selectivity estimate for LIKE 'prefix%'
Date: 2008-03-31 10:02:47
Message-ID: 200803311202.47718.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Am Donnerstag, 6. März 2008 schrieb Tom Lane:
> What I propose doing about this is a small variant on Peter's original
> suggestion: compute the estimated selectivity for
>         col = 'prefix'
> and clamp the result of prefix_selectivity to be at least that.

OK, first results with this patch are in: The selectivity estimations are
adjusted nicely, but the cost calculation doesn't change at all. Before:

Index Scan using foo_idx_3 on foo foo (cost=0.01..6.03 rows=1 width=8)

After:

Index Scan using foo_idx_3 on foo foo (cost=0.01..6.03 rows=627 width=8)

How is that possible?

Btw., the corresponding query plan for the LIKE 'constant' case is:

Index Scan using foo_idx_3 on foo foo (cost=0.00..2527.84 rows=627 width=8)

This is what we had hoped to get in the "after" case.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Minimum selectivity estimate for LIKE 'prefix%'
Date: 2008-03-31 14:03:49
Message-ID: 27057.1206972229@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> OK, first results with this patch are in: The selectivity estimations are
> adjusted nicely, but the cost calculation doesn't change at all. Before:

I've forgotten the context ... what's the whole query and plan again?
And which PG version exactly?

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Minimum selectivity estimate for LIKE 'prefix%'
Date: 2008-03-31 16:33:26
Message-ID: 200803311833.27131.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Am Montag, 31. März 2008 schrieb Tom Lane:
> Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> > OK, first results with this patch are in: The selectivity estimations are
> > adjusted nicely, but the cost calculation doesn't change at all. Before:
>
> I've forgotten the context ... what's the whole query and plan again?
> And which PG version exactly?

Please see http://archives.postgresql.org/pgsql-hackers/2008-01/msg00048.php


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Minimum selectivity estimate for LIKE 'prefix%'
Date: 2008-03-31 17:16:01
Message-ID: 14836.1206983761@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> Am Montag, 31. Mrz 2008 schrieb Tom Lane:
>> I've forgotten the context ... what's the whole query and plan again?
>> And which PG version exactly?

> Please see http://archives.postgresql.org/pgsql-hackers/2008-01/msg00048.php

Hm. Now that I think about it, the index scan cost estimate is made
using a separate estimate of rows fetched (since this will depend on the
specific index qual clauses in use, whereas the overall row estimate for
the relation doesn't vary with index). For the case at hand, the index
quals that it's looking at are the >= and < clauses with close-together
comparison values, and so it comes out with a rock-bottom rowcount
estimate. The clamping occuring over in prefix_selectivity isn't
relevant here.

Your original complaint was that the bad overall rowcount estimate was
leading to a bad join plan, and that should be fixed even though the
cost estimate for the indexscan itself is unrealistically small.

Changing the indexscan cost estimate would require patching the main
range-constraint-estimation code in clausesel.c. I don't see any very
good fix for that, since it has to deal with much more general cases
than this. In particular it doesn't really know whether it's dealing
with >= or >.

regards, tom lane