near identical queries have vastly different plans

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: near identical queries have vastly different plans
Date: 2011-06-30 08:53:14
Message-ID: BANLkTin9E_2ZGOPoSvsewv2xHe6nv2Pnjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Here's the setup:

I'm cross joining two dimensions before left outer joining to a fact table
so that I can throw a default value into the resultset wherever a value is
missing from the fact table. I have a time dimension and another dimension.
I want the cross join to only cross a subset of rows from each dimension,
so I have some filters in the ON clause of the inner join.

I've got 2 nearly identical queries that perform incredibly differently.
The only difference between them is that the fast query uses two text
columns from the sensor dimension when filtering the inner join while the
slow query uses bigint id values, where the ids correlate to the text
strings in the fast query. The string columns are device_name and
dpoint_name, where device_name is unique but many devices have dpoints with
the same name. The bigint columns are device_id and dpoint_id, where both
device_id and dpoint_id map to a single row. There are indexes on all of
them, so the only real difference is that an index on dpoint_name will have
more rows for a given value than the index on dpoint_id because dpoints with
the same name will still have different ids if attached to different
devices. In both queries, exactly 35 rows from the sensor dimension will be
returned. Note also that I'm aggregating fact table rows via avg() because
I have duplicate rows in the fact table, but I want to extract only a single
row for each time and sensor row. The left outer join allows me to populate
any missing rows with a default value and the aggregation via avg() combines
duplicate rows so that they appear only once.

I can easily just use the fast query, but my concern is that without
understanding why the queries are executing differently, I might suddenly
discover my code using the slow query plan instead of the fast one at some
point in the future, even when using the varchar columns instead of the
bigint ids for filtering. They differ in execution speed by about 5x (which
translates to many minutes), so that would be a big problem. If I could
figure out either a query structure or an index structure which will force
the fast query plan, I'd be much happier. So that is what I am looking for
- an explanation of how I might convince the planner to always use the fast
plan.

Its a CentOS host - Quad core AMD Opteron 1.6Ghz, 2GB of RAM, single 75GB
disk with everything on it. I'm not looking for outright performance, just
relative performance between the 2 queries. DB config was taken wholesale
from pg_tune with no changes, IIRC. It isn't a production box. I have yet
to spec out production hardware for this application, so I don't know what
it will eventually be.

# select version();
version

------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.4.5 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)
4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit

name | current_setting

------------------------------+-------------------------------------------------
checkpoint_completion_target | 0.9
checkpoint_segments | 64
default_statistics_target | 100
effective_cache_size | 1408MB
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
log_directory | pg_log
log_filename | postgresql-%a.log
log_rotation_age | 1d
log_rotation_size | 0
log_truncate_on_rotation | on
logging_collector | on
maintenance_work_mem | 240MB
max_connections | 20
max_stack_depth | 2MB
port | 5432
server_encoding | UTF8
shared_buffers | 480MB
TimeZone | UTC
wal_buffers | 32MB
work_mem | 48MB

time dimension looks like this:

Column | Type | Modifiers
--------------------------+-----------------------------+-----------
time_zone | character varying(64) |
tstamp | timestamp without time zone |
tstamptz | timestamp with time zone |
local_key | bigint |
utc_key | bigint |
Indexes:
"idx_time_ny_local_key" btree (local_key)
"idx_time_ny_tstamp" btree (tstamp) CLUSTER
"idx_time_ny_tstamptz" btree (tstamptz)
"idx_time_ny_utc_key" btree (utc_key)

plus lots of other columns (approx 25 columns, mostly integer) that aren't
relevant to this query. It has 420,480 rows where each row is 300 seconds
after the previous row. local_key and utc_key are bigint columns in the
form yyyyMMddHHmm (utc_key in UTC time and the other in local time for the
time zone represented by the table) and tstamp is the same value as an
actual timestamp type. tstamptz is just a convenient column for when I need
to do time zone conversions. tstamp is a timestamp without timezone that
stores the date and time for that row in the local time zone for the table
in question.

The other dimension looks like this:

Column | Type |
Modifiers
---------------+-----------------------------+------------------------------------------------------------
sensor_pk | bigint | not null default
nextval('sensor_sensor_pk_seq'::regclass)
building_id | bigint |
device_id | bigint |
device_type | character varying(16) |
device_name | character varying(64) |
dpoint_id | bigint |
dpoint_name | character varying(64) |
Indexes:
"sensor_pkey" PRIMARY KEY, btree (sensor_pk)
"idx_sensor_device_id" btree (device_id)
"idx_sensor_device_name" btree (device_name)
"idx_sensor_device_type" btree (device_type)
"idx_sensor_dpoint_id" btree (dpoint_id)
"idx_sensor_dpoint_name" btree (dpoint_name)

There are other columns, but again, they aren't relevant - about 10 more
columns, half bigint and half varchar. Row count is less than 400 rows at
the moment, but it will potentially get much larger unless I partition on
building_id, which I may well do - in which case, row count will never
exceed 100,000 and rarely exceed 10,000.

The fact table is as follows:

Table "facts.bldg_1_thermal_fact"
Column | Type | Modifiers
--------------+-----------------------------+---------------
time_fk | bigint | not null
sensor_fk | bigint | not null
tstamp | timestamp without time zone |
dpoint_value | real |
device_mode | bigint |

With actual data in child tables:

Table "facts.bldg_1_thermal_fact_20110501"
Column | Type | Modifiers
--------------+-----------------------------+---------------
time_fk | bigint | not null
sensor_fk | bigint | not null
tstamp | timestamp without time zone |
dpoint_value | real |
device_mode | bigint |
Indexes:
"idx_bldg_1_thermal_fact_20110501_sensor_fk" btree (sensor_fk)
"idx_bldg_1_thermal_fact_20110501_time_fk" btree (time_fk) CLUSTER
Check constraints:
"bldg_1_thermal_fact_20110501_time_fk_check" CHECK (time_fk >=
201105010000::bigint AND time_fk < 201106010000::bigint)
"bldg_1_thermal_fact_20110501_tstamp_check" CHECK (tstamp >= '2011-05-01
00:00:00'::timestamp without time zone AND tstamp < '2011-06-01
00:00:00'::timestamp without time zone)
Inherits: bldg_1_thermal_fact

One of the 2 partitions that exists at the moment contains 2 million rows,
none of which are relevant to the query in question. The other partition
has 4 million rows and contains 100% of the rows returned by the query.

fast query is as follows:

SELECT t.local_key,
s.device_name||'_'||s.dpoint_name,
CASE WHEN avg(f.dpoint_value) IS NULL THEN -999999
ELSE avg(f.dpoint_value)::numeric(10,2)
END as dpoint_value
FROM dimensions.sensor s
INNER JOIN dimensions.time_ny t
ON s.building_id=1
AND ((s.device_name='AHU_1M02' AND s.dpoint_name='SpaceTemp') OR
(s.device_name='VAV_1M_01' AND
s.dpoint_name='EffectSetPt')OR(s.device_name='VAV_1M_01' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_02A' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_02A' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_02' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_02' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_03' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_03' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_04' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_04' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_05A' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_05A' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_05' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_05' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_06' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_06' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_07' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_07' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_08' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_08' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_09' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_09' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_10' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_10' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_11' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_11' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_12' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_12' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_13' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_13' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_14' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_14' AND
s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_15' AND
s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_15' AND
s.dpoint_name='SpaceTemp'))
AND t.tstamp BETWEEN '15-Jun-2011 00:00' AND '29-Jun-2011 00:00'
LEFT OUTER JOIN facts.bldg_1_thermal_fact f
ON f.time_fk=t.local_key
AND f.sensor_fk=s.sensor_pk
GROUP BY 1,2
ORDER BY 1,2

Note: as complicated as that big filter on the inner join is, it will never
result in more than a few tens of rows being selected.

Fast explain is here:
http://explain.depesz.com/s/FAaR

Slow query is as follows:

SELECT t.local_key,
s.device_name||'_'||s.dpoint_name,
CASE WHEN avg(f.dpoint_value) IS NULL THEN -999999
ELSE avg(f.dpoint_value)::numeric(10,2)
END as dpoint_value
FROM dimensions.sensor s
INNER JOIN dimensions.time_ny t
ON s.building_id=1
AND ((s.device_id=33 AND s.dpoint_id=183) OR (s.device_id=33 AND
s.dpoint_id=184) OR (s.device_id=34 AND s.dpoint_id=187) OR (s.device_id=34
AND s.dpoint_id=188) OR (s.device_id=35 AND s.dpoint_id=191) OR
(s.device_id=35 AND s.dpoint_id=192) OR (s.device_id=36 AND s.dpoint_id=195)
OR (s.device_id=36 AND s.dpoint_id=196) OR (s.device_id=77 AND
s.dpoint_id=364) OR (s.device_id=20 AND s.dpoint_id=131) OR (s.device_id=20
AND s.dpoint_id=132) OR (s.device_id=21 AND s.dpoint_id=135) OR
(s.device_id=21 AND s.dpoint_id=136) OR (s.device_id=22 AND s.dpoint_id=139)
OR (s.device_id=22 AND s.dpoint_id=140) OR (s.device_id=30 AND
s.dpoint_id=171) OR (s.device_id=30 AND s.dpoint_id=172) OR (s.device_id=23
AND s.dpoint_id=143) OR (s.device_id=23 AND s.dpoint_id=144) OR
(s.device_id=24 AND s.dpoint_id=147) OR (s.device_id=24 AND s.dpoint_id=148)
OR (s.device_id=25 AND s.dpoint_id=151) OR (s.device_id=25 AND
s.dpoint_id=152) OR (s.device_id=26 AND s.dpoint_id=155) OR (s.device_id=26
AND s.dpoint_id=156) OR (s.device_id=27 AND s.dpoint_id=159) OR
(s.device_id=27 AND s.dpoint_id=160) OR (s.device_id=28 AND s.dpoint_id=163)
OR (s.device_id=28 AND s.dpoint_id=164) OR (s.device_id=29 AND
s.dpoint_id=167) OR (s.device_id=29 AND s.dpoint_id=168) OR (s.device_id=31
AND s.dpoint_id=175) OR (s.device_id=31 AND s.dpoint_id=176) OR
(s.device_id=32 AND s.dpoint_id=179) OR (s.device_id=32 AND
s.dpoint_id=180))
AND t.tstamp BETWEEN '15-Jun-2011 00:00' AND '29-Jun-2011 00:00'
LEFT OUTER JOIN facts.bldg_1_thermal_fact f
ON f.time_fk=t.local_key
AND f.sensor_fk=s.sensor_pk
GROUP BY 1,2
ORDER BY 1,2

Slow explain is here:
http://explain.depesz.com/s/qao

I tried rewriting the slow query such that the cross join is expressed as a
subquery left outer joined to the fact table, but it had no effect. the
query plan was identical.

The query plan also doesn't change if I replace the 2nd column with
s.sensor_pk (which maps one to one to the string I am constructing, but
which does have an index). This query is actually being used as the first
query in a call to crosstab(text, text) and that second column is just the
pivot column. It doesn't matter to me whether it is a string or the id,
since those labels wind up being re-written as column names of the function
result. (select * from crosstab() as q(local_key bigint, column1 real,
column2 real, ...).

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Svetlin Manavski 2011-06-30 09:02:07 Re: is parallel union all possible over dblink?
Previous Message Jeff Janes 2011-06-30 03:59:25 Re: sequential scan unduly favored over text search gin index