Re: Plan for relatively simple query seems to be very inefficient

Lists: pgsql-hackerspgsql-performance
From: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>
To: performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject: Plan for relatively simple query seems to be very inefficient
Date: 2005-04-06 16:52:35
Message-ID: 425413D3.5030304@vulcanus.its.tudelft.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Hi list,

I noticed on a forum a query taking a surprisingly large amount of time
in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much
better. To my surprise PostgreSQL was ten times worse on the same
machine! And I don't understand why.

I don't really need this query to be fast since I don't use it, but the
range-thing is not really an uncommon query I suppose. So I'm wondering
why it is so slow and this may point to a wrong plan being chosen or
generated.

Here are table definitions:

Table "public.postcodes"
Column | Type | Modifiers
-------------+---------------+-----------
postcode_id | smallint | not null
range_from | smallint |
range_till | smallint |
Indexes:
"postcodes_pkey" PRIMARY KEY, btree (postcode_id)
"range" UNIQUE, btree (range_from, range_till)

Table "public.data_main"
Column | Type | Modifiers
--------+----------+-----------
userid | integer | not null
range | smallint |
Indexes:
"data_main_pkey" PRIMARY KEY, btree (userid)

And here's the query I ran:

SELECT COUNT(*) FROM
data_main AS dm,
postcodes AS p
WHERE dm.range BETWEEN p.range_from AND p.range_till
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=332586.85..332586.85 rows=1 width=0) (actual
time=22712.038..22712.039 rows=1 loops=1)
-> Nested Loop (cost=3.76..328945.96 rows=1456356 width=0) (actual
time=0.054..22600.826 rows=82688 loops=1)
Join Filter: (("outer".range >= "inner".range_from) AND
("outer".range <= "inner".range_till))
-> Seq Scan on data_main dm (cost=0.00..1262.20 rows=81920
width=2) (actual time=0.020..136.930 rows=81920 loops=1)
-> Materialize (cost=3.76..5.36 rows=160 width=4) (actual
time=0.001..0.099 rows=160 loops=81920)
-> Seq Scan on postcodes p (cost=0.00..3.60 rows=160
width=4) (actual time=0.010..0.396 rows=160 loops=1)
Total runtime: 22712.211 ms

When I do something completely bogus, which will result in coupling the
data per record from data_main on one record from postcodes, it still
not very fast but acceptable:

SELECT COUNT(*) FROM
data_main AS dm,
postcodes AS p
WHERE dm.range / 10 = p.postcode_id

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=10076.98..10076.98 rows=1 width=0) (actual
time=1456.016..1456.017 rows=1 loops=1)
-> Merge Join (cost=8636.81..9913.13 rows=65537 width=0) (actual
time=1058.105..1358.571 rows=81920 loops=1)
Merge Cond: ("outer".postcode_id = "inner"."?column2?")
-> Index Scan using postcodes_pkey on postcodes p
(cost=0.00..5.76 rows=160 width=2) (actual time=0.034..0.507 rows=160
loops=1)
-> Sort (cost=8636.81..8841.61 rows=81920 width=2) (actual
time=1057.698..1169.879 rows=81920 loops=1)
Sort Key: (dm.range / 10)
-> Seq Scan on data_main dm (cost=0.00..1262.20
rows=81920 width=2) (actual time=0.020..238.886 rows=81920 loops=1)
Total runtime: 1461.156 ms

Doing something similarily bogus, but with less results is much faster,
even though it should have basically the same plan:

SELECT COUNT(*) FROM
data_main AS dm,
postcodes AS p
WHERE dm.range = p.postcode_id

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2138.63..2138.63 rows=1 width=0) (actual
time=180.667..180.668 rows=1 loops=1)
-> Hash Join (cost=4.00..2087.02 rows=20642 width=0) (actual
time=180.645..180.645 rows=0 loops=1)
Hash Cond: ("outer".range = "inner".postcode_id)
-> Seq Scan on data_main dm (cost=0.00..1262.20 rows=81920
width=2) (actual time=0.005..105.548 rows=81920 loops=1)
-> Hash (cost=3.60..3.60 rows=160 width=2) (actual
time=0.592..0.592 rows=0 loops=1)
-> Seq Scan on postcodes p (cost=0.00..3.60 rows=160
width=2) (actual time=0.025..0.349 rows=160 loops=1)
Total runtime: 180.807 ms
(7 rows)

If you like to toy around with the datasets on your heavily optimized
postgresql-installs, let me know. The data is just generated for
testing-purposes and I'd happily send a copy to anyone interested.

Best regards,

Arjen van der Meijden


From: Steve Atkins <steve(at)blighty(dot)com>
To: performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Plan for relatively simple query seems to be very inefficient
Date: 2005-04-06 17:04:12
Message-ID: 20050406170412.GB22693@gp.word-to-the-wise.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, Apr 06, 2005 at 06:52:35PM +0200, Arjen van der Meijden wrote:
> Hi list,
>
> I noticed on a forum a query taking a surprisingly large amount of time
> in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much
> better. To my surprise PostgreSQL was ten times worse on the same
> machine! And I don't understand why.
>
> I don't really need this query to be fast since I don't use it, but the
> range-thing is not really an uncommon query I suppose. So I'm wondering
> why it is so slow and this may point to a wrong plan being chosen or
> generated.

That's the wrong index type for fast range queries. You really need
something like GiST or rtree for that. I do something similar in
production and queries are down at the millisecond level with the
right index.

Cheers,
Steve

> Here are table definitions:
>
> Table "public.postcodes"
> Column | Type | Modifiers
> -------------+---------------+-----------
> postcode_id | smallint | not null
> range_from | smallint |
> range_till | smallint |
> Indexes:
> "postcodes_pkey" PRIMARY KEY, btree (postcode_id)
> "range" UNIQUE, btree (range_from, range_till)
>
> Table "public.data_main"
> Column | Type | Modifiers
> --------+----------+-----------
> userid | integer | not null
> range | smallint |
> Indexes:
> "data_main_pkey" PRIMARY KEY, btree (userid)
>
> And here's the query I ran:
>
> SELECT COUNT(*) FROM
> data_main AS dm,
> postcodes AS p
> WHERE dm.range BETWEEN p.range_from AND p.range_till


From: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>
To: Steve Atkins <steve(at)blighty(dot)com>
Cc: performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Plan for relatively simple query seems to be very inefficient
Date: 2005-04-06 17:40:29
Message-ID: 42541F0D.9010009@vulcanus.its.tudelft.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 6-4-2005 19:04, Steve Atkins wrote:
> On Wed, Apr 06, 2005 at 06:52:35PM +0200, Arjen van der Meijden wrote:
>
>>Hi list,
>>
>>I noticed on a forum a query taking a surprisingly large amount of time
>>in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much
>>better. To my surprise PostgreSQL was ten times worse on the same
>>machine! And I don't understand why.
>>
>>I don't really need this query to be fast since I don't use it, but the
>>range-thing is not really an uncommon query I suppose. So I'm wondering
>>why it is so slow and this may point to a wrong plan being chosen or
>>generated.
>
>
> That's the wrong index type for fast range queries. You really need
> something like GiST or rtree for that. I do something similar in
> production and queries are down at the millisecond level with the
> right index.

That may be, but since that table is only two pages the index would
probably not be used even if it was rtree or GiST?
Btw, "access method "rtree" does not support multicolumn indexes", I'd
need another way of storing it as well? Plus it doesn't support < and >
so the query should be changed for the way ranges are checked.

I'm not sure if the dataset is really suitable for other range checks.
It is a linear set of postal codes grouped by their number (range_from
to range_till) into regions and the query basically joins the region to
each records of a user table. Of course one could use lines on the
x-axis and define the postal-code of a specific user as a point on one
of those lines...

But nonetheless, /this/ query should be "not that slow" either, right?

Arjen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>
Cc: performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Plan for relatively simple query seems to be very inefficient
Date: 2005-04-06 17:42:14
Message-ID: 28552.1112809334@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl> writes:
> I noticed on a forum a query taking a surprisingly large amount of time
> in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much
> better. To my surprise PostgreSQL was ten times worse on the same
> machine! And I don't understand why.

Wrong index ... what you probably could use here is an index on
data_main.range, so that the query could run with postcodes as the
outer side. I get such a plan by default with empty tables:

Aggregate (cost=99177.80..99177.80 rows=1 width=0)
-> Nested Loop (cost=0.00..98021.80 rows=462400 width=0)
-> Seq Scan on postcodes p (cost=0.00..30.40 rows=2040 width=4)
-> Index Scan using rangei on data_main dm (cost=0.00..44.63 rows=227 width=2)
Index Cond: ((dm.range >= "outer".range_from) AND (dm.range <= "outer".range_till))

but I'm not sure if the planner would prefer it with the tables loaded
up. (It might not be the right thing anyway ... but seems worth
trying.)

Given the relatively small size of the postcodes table, and the fact
that each data_main row seems to join to about one postcodes row,
it's possible that what the planner did for you was actually the
optimal thing anyhow. I'm not sure that any range-capable index would
be faster than just scanning through 160 entries in memory ...

regards, tom lane


From: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Plan for relatively simple query seems to be very inefficient
Date: 2005-04-06 18:00:09
Message-ID: 425423A9.9000504@vulcanus.its.tudelft.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 6-4-2005 19:42, Tom Lane wrote:
> Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl> writes:
>
>>I noticed on a forum a query taking a surprisingly large amount of time
>>in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much
>>better. To my surprise PostgreSQL was ten times worse on the same
>>machine! And I don't understand why.
>
>
> Wrong index ... what you probably could use here is an index on
> data_main.range, so that the query could run with postcodes as the
> outer side. I get such a plan by default with empty tables:
>
> Aggregate (cost=99177.80..99177.80 rows=1 width=0)
> -> Nested Loop (cost=0.00..98021.80 rows=462400 width=0)
> -> Seq Scan on postcodes p (cost=0.00..30.40 rows=2040 width=4)
> -> Index Scan using rangei on data_main dm (cost=0.00..44.63 rows=227 width=2)
> Index Cond: ((dm.range >= "outer".range_from) AND (dm.range <= "outer".range_till))
>
> but I'm not sure if the planner would prefer it with the tables loaded
> up. (It might not be the right thing anyway ... but seems worth
> trying.)

No it didn't prefer it.

> Given the relatively small size of the postcodes table, and the fact
> that each data_main row seems to join to about one postcodes row,
> it's possible that what the planner did for you was actually the
> optimal thing anyhow. I'm not sure that any range-capable index would
> be faster than just scanning through 160 entries in memory ...
>
> regards, tom lane

Yep, there is only one or in corner cases two postcode-ranges per
postcode. Actually it should be only one, but my generated data is not
perfect.
But the sequential scan per record is not really what surprises me,
especially since the postcode table is only two pages of data, I didn't
really expect otherwise.
It is the fact that it takes 22 seconds that surprises me. Especially
since the two other examples on the same data which consider about the
same amount of records per table/record only take 1.4 and 0.18 seconds.

Best regards,

Arjen


From: Mischa <mischa(dot)Sandberg(at)telus(dot)net>
To: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>
Cc: performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Plan for relatively simple query seems to be very inefficient
Date: 2005-04-06 18:35:53
Message-ID: 1112812553.42542c091d8ce@webmail.telus.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Quoting Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>:

> Hi list,
>
> I noticed on a forum a query taking a surprisingly large amount of time
> in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much
> better. To my surprise PostgreSQL was ten times worse on the same
> machine! And I don't understand why.
>
> I don't really need this query to be fast since I don't use it, but the
> range-thing is not really an uncommon query I suppose. So I'm wondering
> why it is so slow and this may point to a wrong plan being chosen or
> generated.
>
> Here are table definitions:
>
> Table "public.postcodes"
> Column | Type | Modifiers
> -------------+---------------+-----------
> postcode_id | smallint | not null
> range_from | smallint |
> range_till | smallint |
> Indexes:
> "postcodes_pkey" PRIMARY KEY, btree (postcode_id)
> "range" UNIQUE, btree (range_from, range_till)
>
> Table "public.data_main"
> Column | Type | Modifiers
> --------+----------+-----------
> userid | integer | not null
> range | smallint |
> Indexes:
> "data_main_pkey" PRIMARY KEY, btree (userid)
>
> And here's the query I ran:
>
> SELECT COUNT(*) FROM
> data_main AS dm,
> postcodes AS p
> WHERE dm.range BETWEEN p.range_from AND p.range_till

I just posted an answer to this (via webcafe webmail; can't recall which
pg-list), that might interest you.

BTree indexes as they stand (multi-column, ...) answer what most people need for
queries. Unfortunately, out-of-the-box, they have no good way of handling range
queries. To compensate, you can use a small amount of kinky SQL. This is in the
same line as the tricks used to implement hierarchic queries in relational SQL.

[1] Create a table "widths"(wid int) of powers of 2, up to what will just cover
max(range_till-range_from). Since your "range" column is a smallint, this table
can have no more than 15 rows. You can get as fussy as you want about keeping
this table to a minimum.

[2] Change postcodes:
ALTER TABLE postcodes
ADD wid INT USING 2 ^ CEIL(LOG(range_from - range_till,2));
ALTER TABLE postcodes
ADD start INT USING range_from - (range_from % wid);
CREATE INDEX postcodes_wid_start_index ON (wid, start);
ANALYZE postcodes;

[4] Write your query as:
SELECT COUNT(*)
FROM data_main AS dm
CROSS JOIN widths -- yes, CROSS JOIN. For once, it HELPS performance.
JOIN postcodes AS p
ON dm.wid = widths.wid AND dm.start = p.range - p.range % widths.wid
WHERE dm.range BETWEEN p.range_from AND p.range_till

This uses BTREE exact-match to make a tight restriction on which rows to check.
YMMV, but this has worked even for multi-M table joins.

--
"Dreams come true, not free."


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>
Cc: performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Plan for relatively simple query seems to be very inefficient
Date: 2005-04-06 20:51:30
Message-ID: 5999.1112820690@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl> writes:
> On 6-4-2005 19:42, Tom Lane wrote:
>> Wrong index ... what you probably could use here is an index on
>> data_main.range, so that the query could run with postcodes as the
>> outer side. I get such a plan by default with empty tables:
>> but I'm not sure if the planner would prefer it with the tables loaded
>> up. (It might not be the right thing anyway ... but seems worth
>> trying.)

> No it didn't prefer it.

Planner error ... because it doesn't have any good way to estimate the
number of matching rows, it thinks that way is a bit more expensive than
data_main as the outside, but in reality it seems a good deal cheaper:

arjen=# set enable_seqscan TO 1;
SET
arjen=# explain analyze
arjen-# SELECT COUNT(*) FROM data_main AS dm, postcodes AS p WHERE dm.range BETWEEN p.range_from AND p.range_till;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=332586.85..332586.85 rows=1 width=0) (actual time=143999.678..143999.683 rows=1 loops=1)
-> Nested Loop (cost=3.76..328945.96 rows=1456356 width=0) (actual time=0.211..143549.461 rows=82688 loops=1)
Join Filter: (("outer".range >= "inner".range_from) AND ("outer".range <= "inner".range_till))
-> Seq Scan on data_main dm (cost=0.00..1262.20 rows=81920 width=2) (actual time=0.059..663.065 rows=81920 loops=1)
-> Materialize (cost=3.76..5.36 rows=160 width=4) (actual time=0.004..0.695 rows=160 loops=81920)
-> Seq Scan on postcodes p (cost=0.00..3.60 rows=160 width=4) (actual time=0.028..1.589 rows=160 loops=1)
Total runtime: 144000.415 ms
(7 rows)

arjen=# set enable_seqscan TO 0;
SET
arjen=# explain analyze
arjen-# SELECT COUNT(*) FROM data_main AS dm, postcodes AS p WHERE dm.range BETWEEN p.range_from AND p.range_till;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=100336307.18..100336307.18 rows=1 width=0) (actual time=2367.097..2367.102 rows=1 loops=1)
-> Nested Loop (cost=100000000.00..100332666.28 rows=1456356 width=0) (actual time=0.279..1918.890 rows=82688 loops=1)
-> Seq Scan on postcodes p (cost=100000000.00..100000003.60 rows=160 width=4) (actual time=0.060..1.381 rows=160 loops=1)
-> Index Scan using dm_range on data_main dm (cost=0.00..1942.60 rows=9103 width=2) (actual time=0.034..7.511 rows=517 loops=160)
Index Cond: ((dm.range >= "outer".range_from) AND (dm.range <= "outer".range_till))
Total runtime: 2368.056 ms
(6 rows)

(this machine is slower than yours, plus I have profiling enabled still...)

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>
Cc: pgsql-hackers(at)postgreSQL(dot)org, pgsql-performance(at)postgreSQL(dot)org
Subject: Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
Date: 2005-04-06 22:09:37
Message-ID: 6326.1112825377@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

I wrote:
> Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl> writes:
>> SELECT COUNT(*) FROM
>> data_main AS dm,
>> postcodes AS p
>> WHERE dm.range BETWEEN p.range_from AND p.range_till

> Planner error ... because it doesn't have any good way to estimate the
> number of matching rows, it thinks that way is a bit more expensive than
> data_main as the outside, but in reality it seems a good deal cheaper:

BTW, it would get the right answer if it had recognized the WHERE clause
as a range restriction --- it still doesn't know exactly what fraction
of rows will match, but its default estimate is a great deal tighter for
"WHERE x > something AND x < somethingelse" than it is for two unrelated
inequality constraints. Enough tighter that it would have gone for the
correct plan.

The problem is that it doesn't recognize the WHERE as a range constraint
on dm.range. I thought for a moment that this might be a
recently-introduced bug, but actually the code is operating as designed:
clauselist_selectivity says

* See if it looks like a restriction clause with a pseudoconstant
* on one side. (Anything more complicated than that might not
* behave in the simple way we are expecting.)

"Pseudoconstant" in this context means "a constant, parameter symbol, or
non-volatile functions of these" ... so comparisons against values from
another table don't qualify. It seems like we're missing a bet though.

Can anyone suggest a more general rule? Do we need for example to
consider whether the relation membership is the same in two clauses
that might be opposite sides of a range restriction? It seems like

a.x > b.y AND a.x < b.z

probably can be treated as a range restriction on a.x for this purpose,
but I'm much less sure that the same is true of

a.x > b.y AND a.x < c.z

Thoughts?

regards, tom lane


From: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>, pgsql-hackers(at)postgreSQL(dot)org, pgsql-performance(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
Date: 2005-04-06 22:25:36
Message-ID: 20050406222536.GL93835@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote:
> Can anyone suggest a more general rule? Do we need for example to
> consider whether the relation membership is the same in two clauses
> that might be opposite sides of a range restriction? It seems like
>
> a.x > b.y AND a.x < b.z

In a case like this, you could actually look at the data in b and see
what the average range size is. If you wanted to get really fancy, the
optimizer could decide how best to access a based on each row of b.

> probably can be treated as a range restriction on a.x for this purpose,
> but I'm much less sure that the same is true of
>
> a.x > b.y AND a.x < c.z

Well, this could end up being much trickier, since who knows how b and c
are related. Though thinking about it, although I threw out the
row-by-row analysis idea to be glib, that would actually work in this
case; you could take a look at what b and c look like each time 'through
the loop'.
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>, pgsql-hackers(at)postgreSQL(dot)org, pgsql-performance(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
Date: 2005-04-06 22:35:10
Message-ID: 6594.1112826910@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

"Jim C. Nasby" <decibel(at)decibel(dot)org> writes:
> On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote:
>> Can anyone suggest a more general rule? Do we need for example to
>> consider whether the relation membership is the same in two clauses
>> that might be opposite sides of a range restriction? It seems like
>>
>> a.x > b.y AND a.x < b.z

> In a case like this, you could actually look at the data in b and see
> what the average range size is.

Not with the current statistics --- you'd need some kind of cross-column
statistics involving both y and z. (That is, I doubt it would be
helpful to estimate the average range width by taking the difference of
independently-calculated mean values of y and z ...) But yeah, in
principle it would be possible to make a non-default estimate.

regards, tom lane


From: John A Meinel <john(at)arbash-meinel(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject: Re: [HACKERS] Recognizing range constraints (was Re: Plan
Date: 2005-04-06 22:54:07
Message-ID: 4254688F.6070009@arbash-meinel.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Tom Lane wrote:
> "Jim C. Nasby" <decibel(at)decibel(dot)org> writes:
>
>>On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote:
>>
>>>Can anyone suggest a more general rule? Do we need for example to
>>>consider whether the relation membership is the same in two clauses
>>>that might be opposite sides of a range restriction? It seems like
>>>
>>>a.x > b.y AND a.x < b.z
>
>
>>In a case like this, you could actually look at the data in b and see
>>what the average range size is.
>
>
> Not with the current statistics --- you'd need some kind of cross-column
> statistics involving both y and z. (That is, I doubt it would be
> helpful to estimate the average range width by taking the difference of
> independently-calculated mean values of y and z ...) But yeah, in
> principle it would be possible to make a non-default estimate.
>
> regards, tom lane

Actually, I think he was saying do a nested loop, and for each item in
the nested loop, re-evaluate if an index or a sequential scan is more
efficient.

I don't think postgres re-plans once it has started, though you could
test this in a plpgsql function.

John
=:->


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: John A Meinel <john(at)arbash-meinel(dot)com>
Cc: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject: Re: [HACKERS] Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
Date: 2005-04-06 23:06:34
Message-ID: 6827.1112828794@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

John A Meinel <john(at)arbash-meinel(dot)com> writes:
> Actually, I think he was saying do a nested loop, and for each item in
> the nested loop, re-evaluate if an index or a sequential scan is more
> efficient.

> I don't think postgres re-plans once it has started, though you could
> test this in a plpgsql function.

It doesn't, and in any case that's a microscopic view of the issue.
The entire shape of the plan might change depending on what we think
the selectivity is --- much more than could be handled by switching
scan types at the bottom level.

Also, I anticipate that bitmap-driven index scans will change things
considerably here. The range of usefulness of pure seqscans will
drop drastically...

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>, pgsql-hackers(at)postgreSQL(dot)org, pgsql-performance(at)postgreSQL(dot)org
Subject: Re: Recognizing range constraints (was Re: Plan for
Date: 2005-04-06 23:24:46
Message-ID: 1112829886.16721.1104.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, 2005-04-06 at 18:09 -0400, Tom Lane wrote:
> I wrote:
> > Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl> writes:
> >> SELECT COUNT(*) FROM
> >> data_main AS dm,
> >> postcodes AS p
> >> WHERE dm.range BETWEEN p.range_from AND p.range_till
>
> > Planner error ... because it doesn't have any good way to estimate the
> > number of matching rows, it thinks that way is a bit more expensive than
> > data_main as the outside, but in reality it seems a good deal cheaper:
>
> BTW, it would get the right answer if it had recognized the WHERE clause
> as a range restriction --- it still doesn't know exactly what fraction
> of rows will match, but its default estimate is a great deal tighter for
> "WHERE x > something AND x < somethingelse" than it is for two unrelated
> inequality constraints. Enough tighter that it would have gone for the
> correct plan.
>
> The problem is that it doesn't recognize the WHERE as a range constraint
> on dm.range.

> Can anyone suggest a more general rule? Do we need for example to
> consider whether the relation membership is the same in two clauses
> that might be opposite sides of a range restriction? It seems like
>
> a.x > b.y AND a.x < b.z

Not sure we need a more general rule. There's only three ways to view
this pair of clauses:
i) its a range constraint i.e. BETWEEN
ii) its the complement of that i.e. NOT BETWEEN
iii) its a mistake, but we're not allowed to take that path

Arjen's query and your generalisation of it above is a common type of
query - using a lookup of a reference data table with begin/end
effective dates. It would be very useful if this was supported.

> probably can be treated as a range restriction on a.x for this purpose,
> but I'm much less sure that the same is true of
>
> a.x > b.y AND a.x < c.z

I can't think of a query that would use such a construct, and might even
conclude that it was very poorly normalised model. I would suggest that
this is much less common in practical use.

Best Regards, Simon Riggs


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruno Wolff III <bruno(at)wolff(dot)to>
Cc: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>, pgsql-hackers(at)postgreSQL(dot)org, pgsql-performance(at)postgreSQL(dot)org
Subject: Re: Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
Date: 2005-04-07 14:20:24
Message-ID: 16907.1112883624@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Bruno Wolff III <bruno(at)wolff(dot)to> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Can anyone suggest a more general rule?

> I think it makes sense to guess that a smaller fraction of the rows will
> be returned when a column value is bounded above and below than if it
> is only bounded on one side, even if the bounds aren't fixed. You can
> certainly be wrong.

Yeah, the whole thing is only a heuristic anyway. I've been coming
around to the view that relation membership shouldn't matter, because
of cases like

WHERE a.x > b.y AND a.x < 42

which surely should be taken as a range constraint.

regards, tom lane


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>, pgsql-hackers(at)postgreSQL(dot)org, pgsql-performance(at)postgreSQL(dot)org
Subject: Re: Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
Date: 2005-04-07 14:31:20
Message-ID: 20050407143120.GA29575@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, Apr 06, 2005 at 18:09:37 -0400,
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Can anyone suggest a more general rule? Do we need for example to
> consider whether the relation membership is the same in two clauses
> that might be opposite sides of a range restriction? It seems like
>
> a.x > b.y AND a.x < b.z
>
> probably can be treated as a range restriction on a.x for this purpose,
> but I'm much less sure that the same is true of
>
> a.x > b.y AND a.x < c.z
>
> Thoughts?

I think it makes sense to guess that a smaller fraction of the rows will
be returned when a column value is bounded above and below than if it
is only bounded on one side, even if the bounds aren't fixed. You can
certainly be wrong. The difference between this and the normal case is that
column statistics aren't normally going to be that useful.

If date/time ranges are the common use for this construct, it might be better
to create date and/or time range types that use rtree or gist indexes.


From: Mischa <mischa(dot)Sandberg(at)telus(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
Date: 2005-04-07 21:26:38
Message-ID: 1112909198.4255a58eae0d8@webmail.telus.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Quoting Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Yeah, the whole thing is only a heuristic anyway. I've been coming
> around to the view that relation membership shouldn't matter, because
> of cases like
>
> WHERE a.x > b.y AND a.x < 42
>
> which surely should be taken as a range constraint.

Out of curiosity, will the planner induce "b.y < 42" out of this?

--
"Dreams come true, not free."


From: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>, pgsql-hackers(at)postgreSQL(dot)org, pgsql-performance(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
Date: 2005-04-07 21:40:15
Message-ID: 20050407214015.GT93835@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, Apr 06, 2005 at 06:35:10PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <decibel(at)decibel(dot)org> writes:
> > On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote:
> >> Can anyone suggest a more general rule? Do we need for example to
> >> consider whether the relation membership is the same in two clauses
> >> that might be opposite sides of a range restriction? It seems like
> >>
> >> a.x > b.y AND a.x < b.z
>
> > In a case like this, you could actually look at the data in b and see
> > what the average range size is.
>
> Not with the current statistics --- you'd need some kind of cross-column
> statistics involving both y and z. (That is, I doubt it would be
> helpful to estimate the average range width by taking the difference of
> independently-calculated mean values of y and z ...) But yeah, in
> principle it would be possible to make a non-default estimate.

Actually, it might be possible to take a SWAG at it using the histogram
and correlation stats.

You know... since getting universally useful cross-platform stats seems
to be pretty pie-in-the-sky, would it be possible to generate more
complex stats on the fly from a sampling of a table? If you're looking
at a fairly sizeable table ISTM it would be worth sampling the rows on
10 or 20 random pages to see what you get. In this case, you'd want to
know the average difference between two fields. Other queries might want
something different.
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mischa <mischa(dot)Sandberg(at)telus(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
Date: 2005-04-07 23:58:56
Message-ID: 23152.1112918336@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Mischa <mischa(dot)Sandberg(at)telus(dot)net> writes:
> Quoting Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
>> WHERE a.x > b.y AND a.x < 42

> Out of curiosity, will the planner induce "b.y < 42" out of this?

No. There's some smarts about transitive equality, but none about
transitive inequalities. Offhand I'm not sure if it'd be useful to add
such. The transitive-equality code pulls its weight because you so
often have situations like

create view v as select a.x, ... from a join b on (a.x = b.y);

select * from v where x = 42;

but I'm less able to think of common use-cases for transitive
inequality ...

regards, tom lane


From: a3a18850(at)telus(dot)net
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mischa <mischa(dot)Sandberg(at)telus(dot)net>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
Date: 2005-04-08 00:30:11
Message-ID: 1112920211.4255d0934e23d@webmail.telus.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Quoting Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Mischa <mischa(dot)Sandberg(at)telus(dot)net> writes:
> > Quoting Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> >> WHERE a.x > b.y AND a.x < 42
>
> > Out of curiosity, will the planner induce "b.y < 42" out of this?
>
> No. There's some smarts about transitive equality, but none about
> transitive inequalities. Offhand I'm not sure if it'd be useful to add
> such. The transitive-equality code pulls its weight [...]
> but I'm less able to think of common use-cases for transitive
> inequality ...

Thanks. My apologies for not just going and looking at the code first.

Equality-transitives: yes, worth their weight in gold.
Inequality-transitivies: I see in OLAP queries (usually ranges), or in queries
against big UNION ALL views, where const false inequalities are the norm.
"a.x > b.y and a.x < c.z" comes up in OLAP, too, usually inside an EXISTS(...),
where you are doing something analogous to finding a path.