Re: near identical queries have vastly different plans

Lists: pgsql-performance
From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: near identical queries have vastly different plans
Date: 2011-06-30 08:53:14
Message-ID: BANLkTin9E_2ZGOPoSvsewv2xHe6nv2Pnjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Here's the setup:

I'm cross joining two dimensions before left outer joining to a fact table
so that I can throw a default value into the resultset wherever a value is
missing from the fact table. I have a time dimension and another dimension.
I want the cross join to only cross a subset of rows from each dimension,
so I have some filters in the ON clause of the inner join.

I've got 2 nearly identical queries that perform incredibly differently.
The only difference between them is that the fast query uses two text
columns from the sensor dimension when filtering the inner join while the
slow query uses bigint id values, where the ids correlate to the text
strings in the fast query. The string columns are device_name and
dpoint_name, where device_name is unique but many devices have dpoints with
the same name. The bigint columns are device_id and dpoint_id, where both
device_id and dpoint_id map to a single row. There are indexes on all of
them, so the only real difference is that an index on dpoint_name will have
more rows for a given value than the index on dpoint_id because dpoints with
the same name will still have different ids if attached to different
devices. In both queries, exactly 35 rows from the sensor dimension will be
returned. Note also that I'm aggregating fact table rows via avg() because
I have duplicate rows in the fact table, but I want to extract only a single
row for each time and sensor row. The left outer join allows me to populate
any missing rows with a default value and the aggregation via avg() combines
duplicate rows so that they appear only once.

I can easily just use the fast query, but my concern is that without
understanding why the queries are executing differently, I might suddenly
discover my code using the slow query plan instead of the fast one at some
point in the future, even when using the varchar columns instead of the
bigint ids for filtering. They differ in execution speed by about 5x (which
translates to many minutes), so that would be a big problem. If I could
figure out either a query structure or an index structure which will force
the fast query plan, I'd be much happier. So that is what I am looking for
- an explanation of how I might convince the planner to always use the fast
plan.

Its a CentOS host - Quad core AMD Opteron 1.6Ghz, 2GB of RAM, single 75GB
disk with everything on it. I'm not looking for outright performance, just
relative performance between the 2 queries. DB config was taken wholesale
from pg_tune with no changes, IIRC. It isn't a production box. I have yet
to spec out production hardware for this application, so I don't know what
it will eventually be.

# select version();
version

------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.4.5 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)
4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit

name | current_setting

------------------------------+-------------------------------------------------
checkpoint_completion_target | 0.9
checkpoint_segments | 64
default_statistics_target | 100
effective_cache_size | 1408MB
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
log_directory | pg_log
log_filename | postgresql-%a.log
log_rotation_age | 1d
log_rotation_size | 0
log_truncate_on_rotation | on
logging_collector | on
maintenance_work_mem | 240MB
max_connections | 20
max_stack_depth | 2MB
port | 5432
server_encoding | UTF8
shared_buffers | 480MB
TimeZone | UTC
wal_buffers | 32MB
work_mem | 48MB

time dimension looks like this:

Column | Type | Modifiers
--------------------------+-----------------------------+-----------
time_zone | character varying(64) |
tstamp | timestamp without time zone |
tstamptz | timestamp with time zone |
local_key | bigint |
utc_key | bigint |
Indexes:
"idx_time_ny_local_key" btree (local_key)
"idx_time_ny_tstamp" btree (tstamp) CLUSTER
"idx_time_ny_tstamptz" btree (tstamptz)
"idx_time_ny_utc_key" btree (utc_key)

plus lots of other columns (approx 25 columns, mostly integer) that aren't
relevant to this query. It has 420,480 rows where each row is 300 seconds
after the previous row. local_key and utc_key are bigint columns in the
form yyyyMMddHHmm (utc_key in UTC time and the other in local time for the
time zone represented by the table) and tstamp is the same value as an
actual timestamp type. tstamptz is just a convenient column for when I need
to do time zone conversions. tstamp is a timestamp without timezone that
stores the date and time for that row in the local time zone for the table
in question.

The other dimension looks like this:

Column | Type |
Modifiers
---------------+-----------------------------+------------------------------------------------------------
sensor_pk | bigint | not null default
nextval('sensor_sensor_pk_seq'::regclass)
building_id | bigint |
device_id | bigint |
device_type | character varying(16) |
device_name | character varying(64) |
dpoint_id | bigint |
dpoint_name | character varying(64) |
Indexes:
"sensor_pkey" PRIMARY KEY, btree (sensor_pk)
"idx_sensor_device_id" btree (device_id)
"idx_sensor_device_name" btree (device_name)
"idx_sensor_device_type" btree (device_type)
"idx_sensor_dpoint_id" btree (dpoint_id)
"idx_sensor_dpoint_name" btree (dpoint_name)

There are other columns, but again, they aren't relevant - about 10 more
columns, half bigint and half varchar. Row count is less than 400 rows at
the moment, but it will potentially get much larger unless I partition on
building_id, which I may well do - in which case, row count will never
exceed 100,000 and rarely exceed 10,000.

The fact table is as follows:

Table "facts.bldg_1_thermal_fact"
Column | Type | Modifiers
--------------+-----------------------------+---------------
time_fk | bigint | not null
sensor_fk | bigint | not null
tstamp | timestamp without time zone |
dpoint_value | real |
device_mode | bigint |

With actual data in child tables:

Table "facts.bldg_1_thermal_fact_20110501"
Column | Type | Modifiers
--------------+-----------------------------+---------------
time_fk | bigint | not null
sensor_fk | bigint | not null
tstamp | timestamp without time zone |
dpoint_value | real |
device_mode | bigint |
Indexes:
"idx_bldg_1_thermal_fact_20110501_sensor_fk" btree (sensor_fk)
"idx_bldg_1_thermal_fact_20110501_time_fk" btree (time_fk) CLUSTER
Check constraints:
"bldg_1_thermal_fact_20110501_time_fk_check" CHECK (time_fk >=
201105010000::bigint AND time_fk < 201106010000::bigint)
"bldg_1_thermal_fact_20110501_tstamp_check" CHECK (tstamp >= '2011-05-01
00:00:00'::timestamp without time zone AND tstamp < '2011-06-01
00:00:00'::timestamp without time zone)
Inherits: bldg_1_thermal_fact

One of the 2 partitions that exists at the moment contains 2 million rows,
none of which are relevant to the query in question. The other partition
has 4 million rows and contains 100% of the rows returned by the query.

fast query is as follows:

SELECT t.local_key,
s.device_name||'_'||s.dpoint_name,
CASE WHEN avg(f.dpoint_value) IS NULL THEN -999999
ELSE avg(f.dpoint_value)::numeric(10,2)
END as dpoint_value
FROM dimensions.sensor s
INNER JOIN dimensions.time_ny t
ON s.building_id=1
AND ((s.device_name='AHU_1M02' AND s.dpoint_name='SpaceTemp') OR
(s.device_name='VAV_1M_01' AND
s.dpoint_name='EffectSetPt')OR(s.device_name='VAV_1M_01' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_02A' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_02A' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_02' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_02' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_03' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_03' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_04' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_04' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_05A' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_05A' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_05' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_05' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_06' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_06' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_07' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_07' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_08' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_08' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_09' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_09' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_10' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_10' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_11' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_11' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_12' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_12' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_13' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_13' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_14' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_14' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_15' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_15' AND
s.dpoint_name='SpaceTemp'))
AND t.tstamp BETWEEN '15-Jun-2011 00:00' AND '29-Jun-2011 00:00'
LEFT OUTER JOIN facts.bldg_1_thermal_fact f
ON f.time_fk=t.local_key
AND f.sensor_fk=s.sensor_pk
GROUP BY 1,2
ORDER BY 1,2

Note: as complicated as that big filter on the inner join is, it will never
result in more than a few tens of rows being selected.

Fast explain is here:
http://explain.depesz.com/s/FAaR

Slow query is as follows:

SELECT t.local_key,
s.device_name||'_'||s.dpoint_name,
CASE WHEN avg(f.dpoint_value) IS NULL THEN -999999
ELSE avg(f.dpoint_value)::numeric(10,2)
END as dpoint_value
FROM dimensions.sensor s
INNER JOIN dimensions.time_ny t
ON s.building_id=1
AND ((s.device_id=33 AND s.dpoint_id=183) OR (s.device_id=33 AND
s.dpoint_id=184) OR (s.device_id=34 AND s.dpoint_id=187) OR (s.device_id=34
AND s.dpoint_id=188) OR (s.device_id=35 AND s.dpoint_id=191) OR
(s.device_id=35 AND s.dpoint_id=192) OR (s.device_id=36 AND s.dpoint_id=195)
OR (s.device_id=36 AND s.dpoint_id=196) OR (s.device_id=77 AND
s.dpoint_id=364) OR (s.device_id=20 AND s.dpoint_id=131) OR (s.device_id=20
AND s.dpoint_id=132) OR (s.device_id=21 AND s.dpoint_id=135) OR
(s.device_id=21 AND s.dpoint_id=136) OR (s.device_id=22 AND s.dpoint_id=139)
OR (s.device_id=22 AND s.dpoint_id=140) OR (s.device_id=30 AND
s.dpoint_id=171) OR (s.device_id=30 AND s.dpoint_id=172) OR (s.device_id=23
AND s.dpoint_id=143) OR (s.device_id=23 AND s.dpoint_id=144) OR
(s.device_id=24 AND s.dpoint_id=147) OR (s.device_id=24 AND s.dpoint_id=148)
OR (s.device_id=25 AND s.dpoint_id=151) OR (s.device_id=25 AND
s.dpoint_id=152) OR (s.device_id=26 AND s.dpoint_id=155) OR (s.device_id=26
AND s.dpoint_id=156) OR (s.device_id=27 AND s.dpoint_id=159) OR
(s.device_id=27 AND s.dpoint_id=160) OR (s.device_id=28 AND s.dpoint_id=163)
OR (s.device_id=28 AND s.dpoint_id=164) OR (s.device_id=29 AND
s.dpoint_id=167) OR (s.device_id=29 AND s.dpoint_id=168) OR (s.device_id=31
AND s.dpoint_id=175) OR (s.device_id=31 AND s.dpoint_id=176) OR
(s.device_id=32 AND s.dpoint_id=179) OR (s.device_id=32 AND
s.dpoint_id=180))
AND t.tstamp BETWEEN '15-Jun-2011 00:00' AND '29-Jun-2011 00:00'
LEFT OUTER JOIN facts.bldg_1_thermal_fact f
ON f.time_fk=t.local_key
AND f.sensor_fk=s.sensor_pk
GROUP BY 1,2
ORDER BY 1,2

Slow explain is here:
http://explain.depesz.com/s/qao

I tried rewriting the slow query such that the cross join is expressed as a
subquery left outer joined to the fact table, but it had no effect. the
query plan was identical.

The query plan also doesn't change if I replace the 2nd column with
s.sensor_pk (which maps one to one to the string I am constructing, but
which does have an index). This query is actually being used as the first
query in a call to crosstab(text, text) and that second column is just the
pivot column. It doesn't matter to me whether it is a string or the id,
since those labels wind up being re-written as column names of the function
result. (select * from crosstab() as q(local_key bigint, column1 real,
column2 real, ...).


From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: near identical queries have vastly different plans
Date: 2011-06-30 09:40:10
Message-ID: BANLkTimz9GDfyn7-c6UmRjZ9RHaDranjrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Jun 30, 2011 at 1:53 AM, Samuel Gendler
<sgendler(at)ideasculptor(dot)com>wrote:

> If I could figure out either a query structure or an index structure which
> will force the fast query plan, I'd be much happier. So that is what I am
> looking for - an explanation of how I might convince the planner to always
> use the fast plan.
>
>
For the record, "set enable_nestloop=false" does force a more effective plan
when using the 'slow' query. It is not quite identical in structure - it
materializes the other side of the query, resulting in about 10% less
performance - but it is close enough that I'm tempted to disable nestloop
whenever I run the query in the hope that it will prevent the planner from
switching to the really awful plan. I know that's kind of a drastic
measure, so hopefully someone out there will suggest a config fix which
accomplishes the same thing without requiring special handling for this
query, but at least it works (for now).

Incidentally, upgrading to 9.0.x is not out of the question if it is
believed that doing so might help here. I'm only running 8.4 because I've
got another project in production on 8.4 and I don't want to have to deal
with running both versions on my development laptop. But that's a pretty
weak reason for not upgrading, I know.

--sam


From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: near identical queries have vastly different plans
Date: 2011-06-30 20:34:52
Message-ID: BANLkTimWvZ8n7Tm9Yom8k-t2jKyGGQMqTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Jun 30, 2011 at 1:53 AM, Samuel Gendler
<sgendler(at)ideasculptor(dot)com>wrote:

> If I could figure out either a query structure or an index structure which
> will force the fast query plan, I'd be much happier. So that is what I am
> looking for - an explanation of how I might convince the planner to always
> use the fast plan.
>
>
I eventually noticed that constraint_exclusion didn't seem to be working and
remembered that it only works when the filter is on the partitioned table
itself, not when the table is being filtered via a join. Adding a where
clause which limits f.time_fk to the appropriate range not only fixed
constraint_exclusion behaviour, but also caused the query planner to produce
the same plan for both versions of the query - a plan that was an order of
magnitude faster than the previous fastest plan. It went from 20 seconds
to just less than 2 seconds.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: near identical queries have vastly different plans
Date: 2011-07-01 22:46:49
Message-ID: 6960.1309560409@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Samuel Gendler <sgendler(at)ideasculptor(dot)com> writes:
> I've got 2 nearly identical queries that perform incredibly differently.

The reason the slow query sucks is that the planner is estimating at
most one "s" row will match that complicated AND/OR condition, so it
goes for a nestloop. In the "fast" query there is another complicated
AND/OR filter condition, but it's not so far off on the number of
matching rows, so you get a better plan choice. Can't tell from the
given information whether the better guess is pure luck, or there's some
difference in the column statistics that makes it able to get a better
estimate for that.

In general, though, you're skating on thin ice anytime you ask the
planner to derive statistical estimates about combinations of correlated
columns --- and these evidently are correlated. Think about refactoring
the table definitions so that you're only testing a single column, which
ANALYZE will be able to provide stats about. Or maybe you can express
it as a test on a computed expression, which you could then keep an
index on, prompting ANALYZE to gather stats about that.

regards, tom lane


From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: near identical queries have vastly different plans
Date: 2011-07-02 00:51:32
Message-ID: BANLkTi=rpGc84TfLzDGKB3c1krqqC+Umzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Fri, Jul 1, 2011 at 3:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Samuel Gendler <sgendler(at)ideasculptor(dot)com> writes:
> > I've got 2 nearly identical queries that perform incredibly differently.
>
> The reason the slow query sucks is that the planner is estimating at
> most one "s" row will match that complicated AND/OR condition, so it
> goes for a nestloop. In the "fast" query there is another complicated
> AND/OR filter condition, but it's not so far off on the number of
> matching rows, so you get a better plan choice. Can't tell from the
> given information whether the better guess is pure luck, or there's some
> difference in the column statistics that makes it able to get a better
> estimate for that.
>
> In general, though, you're skating on thin ice anytime you ask the
> planner to derive statistical estimates about combinations of correlated
> columns --- and these evidently are correlated. Think about refactoring
> the table definitions so that you're only testing a single column, which
> ANALYZE will be able to provide stats about. Or maybe you can express
> it as a test on a computed expression, which you could then keep an
> index on, prompting ANALYZE to gather stats about that.
>

Thanks. There is actually already a column in s which is a primary key for
the 2 columns that are currently being tested for. I didn't write the
application code which generates the query, so can't say for sure why it is
being generated as it is, but I'll ask the engineer in question to try the
primary key column instead and see what happens.