Why is plan (and performance) different on partitioned table?

Lists: pgsql-performance
From: "Mark Liberman" <mliberman(at)mixedsignals(dot)com>
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Why is plan (and performance) different on partitioned table?
Date: 2006-05-02 01:37:03
Message-ID: 9D938282F8C6EE43B748B910386DE93E0138B43C@srvgpimail1.GPI.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi,

I have recently implemented table partitioning in our postgres 8.1 db. Upon analyzing query performance, I have realized that, even when only a single one of the "partitions" has to be scanned, the plan is drastically different, and performs much worse, when I query against the master table (uses merge join), vs. a direct query against the partition directly (uses a hash join). The majority of our queries only access a single partition.

Any insight into why this happens and what can be done to improve performance would be greatly appreciated.

br_1min is my partitioned table:

explain analyze
SELECT *
FROM br_1min br1 JOIN br_mods mod on br1.modules_id = mod.id
WHERE ((end_time >= '2006-05-01 17:12:18-07' AND end_time < '2006-05-01 17:13:18-07'))
AND mod.downloads_id IN (153226,153714,153730,153728,153727,153724,153713,153725,153739,153722) ;

----------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=73.99..223.43 rows=1 width=109) (actual time=2925.629..3082.188 rows=45 loops=1)
Merge Cond: ("outer".id = "inner".modules_id)
-> Index Scan using br_mods_id_pkey on br_mods mod (cost=0.00..40861.18 rows=282 width=77) (actual time=2922.223..3078.335 rows=45 loops=1)
Filter: ((downloads_id = 153226) OR (downloads_id = 153714) OR (downloads_id = 153730) OR (downloads_id = 153728) OR (downloads_id = 153727) OR (downloads_id = 153724) OR (downloads_id = 153713) OR (downloads_id = 153725) OR (downloads_id = 153739) OR (downloads_id = 153722))
-> Sort (cost=73.99..76.26 rows=906 width=32) (actual time=3.334..3.508 rows=348 loops=1)
Sort Key: br1.modules_id
-> Append (cost=0.00..29.49 rows=906 width=32) (actual time=0.133..2.169 rows=910 loops=1)
-> Index Scan using br_1min_end_idx on br_1min br1 (cost=0.00..2.02 rows=1 width=32) (actual time=0.029..0.029 rows=0 loops=1)
Index Cond: ((end_time >= '2006-05-01 17:12:18-07'::timestamp with time zone) AND (end_time < '2006-05-01 17:13:18-07'::timestamp with time zone))
-> Index Scan using br_1min_20557_end_idx on br_1min_20557 br1 (cost=0.00..27.48 rows=905 width=32) (actual time=0.101..1.384 rows=910 loops=1)
Index Cond: ((end_time >= '2006-05-01 17:12:18-07'::timestamp with time zone) AND (end_time < '2006-05-01 17:13:18-07'::timestamp with time zone))
Total runtime: 3082.450 ms
(12 rows)

Now, If I query directly against br_1min_20557, my partition, I get:

explain analyze
SELECT *
FROM br_1min_20557 br1 JOIN br_mods mod on br1.modules_id = mod.id
WHERE ((end_time >= '2006-05-01 17:12:18-07' AND end_time < '2006-05-01 17:13:18-07'))
AND mod.downloads_id IN (153226,153714,153730,153728,153727,153724,153713,153725,153739,153722) ;

----------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=764.74..796.94 rows=1 width=109) (actual time=2.488..2.865 rows=45 loops=1)
Hash Cond: ("outer".modules_id = "inner".id)
-> Index Scan using br_1min_20557_end_idx on br_1min_20557 br1 (cost=0.00..27.62 rows=914 width=32) (actual time=0.084..1.886 rows=910 loops=1)
Index Cond: ((end_time >= '2006-05-01 17:12:18-07'::timestamp with time zone) AND (end_time < '2006-05-01 17:13:18-07'::timestamp with time zone))
-> Hash (cost=764.03..764.03 rows=282 width=77) (actual time=0.284..0.284 rows=45 loops=1)
-> Bitmap Heap Scan on br_mods mod (cost=20.99..764.03 rows=282 width=77) (actual time=0.154..0.245 rows=45 loops=1)
Recheck Cond: ((downloads_id = 153226) OR (downloads_id = 153714) OR (downloads_id = 153730) OR (downloads_id = 153728) OR (downloads_id = 153727) OR (downloads_id = 153724) OR (downloads_id = 153713) OR (downloads_id = 153725) OR (downloads_id = 153739) OR (downloads_id = 153722))
-> BitmapOr (cost=20.99..20.99 rows=282 width=0) (actual time=0.144..0.144 rows=0 loops=1)
-> Bitmap Index Scan on br_mods_downloads_id_idx (cost=0.00..2.10 rows=28 width=0) (actual time=0.031..0.031 rows=14 loops=1)
Index Cond: (downloads_id = 153226)
-> Bitmap Index Scan on br_mods_downloads_id_idx (cost=0.00..2.10 rows=28 width=0) (actual time=0.011..0.011 rows=2 loops=1)
Index Cond: (downloads_id = 153714)
-> Bitmap Index Scan on br_mods_downloads_id_idx (cost=0.00..2.10 rows=28 width=0) (actual time=0.007..0.007 rows=2 loops=1)
Index Cond: (downloads_id = 153730)
-> Bitmap Index Scan on br_mods_downloads_id_idx (cost=0.00..2.10 rows=28 width=0) (actual time=0.007..0.007 rows=2 loops=1)
Index Cond: (downloads_id = 153728)
-> Bitmap Index Scan on br_mods_downloads_id_idx (cost=0.00..2.10 rows=28 width=0) (actual time=0.008..0.008 rows=2 loops=1)
Index Cond: (downloads_id = 153727)
-> Bitmap Index Scan on br_mods_downloads_id_idx (cost=0.00..2.10 rows=28 width=0) (actual time=0.008..0.008 rows=2 loops=1)
Index Cond: (downloads_id = 153724)
-> Bitmap Index Scan on br_mods_downloads_id_idx (cost=0.00..2.10 rows=28 width=0) (actual time=0.008..0.008 rows=2 loops=1)
Index Cond: (downloads_id = 153713)
-> Bitmap Index Scan on br_mods_downloads_id_idx (cost=0.00..2.10 rows=28 width=0) (actual time=0.008..0.008 rows=2 loops=1)
Index Cond: (downloads_id = 153725)
-> Bitmap Index Scan on br_mods_downloads_id_idx (cost=0.00..2.10 rows=28 width=0) (actual time=0.041..0.041 rows=16 loops=1)
Index Cond: (downloads_id = 153739)
-> Bitmap Index Scan on br_mods_downloads_id_idx (cost=0.00..2.10 rows=28 width=0) (actual time=0.010..0.010 rows=1 loops=1)
Index Cond: (downloads_id = 153722)
Total runtime: 3.017 ms
(29 rows)

The difference is night-and-day. Any suggestions?

Thanks alot,

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mark Liberman" <mliberman(at)mixedsignals(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Why is plan (and performance) different on partitioned table?
Date: 2006-05-02 02:59:16
Message-ID: 22530.1146538756@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Mark Liberman" <mliberman(at)mixedsignals(dot)com> writes:
> I have recently implemented table partitioning in our postgres 8.1 db. =
> Upon analyzing query performance, I have realized that, even when only a =
> single one of the "partitions" has to be scanned, the plan is =
> drastically different, and performs much worse, when I query against the =
> master table (uses merge join), vs. a direct query against the partition =
> directly (uses a hash join). The majority of our queries only access a =
> single partition.

Joins against partitioned tables suck in 8.1 :-(. There is code in CVS
HEAD to improve this, but it didn't get done in time for 8.1.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mark Liberman" <mliberman(at)mixedsignals(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Why is plan (and performance) different on partitioned table?
Date: 2006-05-02 04:44:16
Message-ID: 24805.1146545056@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

I wrote:
> Joins against partitioned tables suck in 8.1 :-(.

Actually ... while the above is a true statement, it's too flippant a
response for your problem. The reason the planner is going for a
mergejoin in your example is that it thinks the mergejoin will terminate
early. (Notice that the cost estimate for the mergejoin is actually
quite a bit less than the estimate for its first input.) This estimate
can only be made if the planner has statistics that say that one of the
join columns has a max value much less than the other's. Well, that's
fine, but where the heck did it get the stats for the partitioned table?
We don't compute union statistics for partitions. The answer is that
it's confused and is using the stats for just the parent table as if
they were representative for the whole inheritance tree.

I think this behavior was intentional back when it was coded, but when
inheritance is being used for partitioning, it's clearly brain-dead.
We should either not assume anything about the statistics for an
inheritance tree, or make a real effort to compute them.

For the moment, I've applied a quick patch that makes sure we don't
assume anything.

If you don't have anything in the parent table br_1min, then deleting
the (presumably obsolete) pg_statistic rows for it should fix your
immediate problem. Otherwise, consider applying the attached.

regards, tom lane

Index: src/backend/optimizer/path/allpaths.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/path/allpaths.c,v
retrieving revision 1.137.2.2
diff -c -r1.137.2.2 allpaths.c
*** src/backend/optimizer/path/allpaths.c 13 Feb 2006 16:22:29 -0000 1.137.2.2
--- src/backend/optimizer/path/allpaths.c 2 May 2006 04:31:27 -0000
***************
*** 264,269 ****
--- 264,276 ----
errmsg("SELECT FOR UPDATE/SHARE is not supported for inheritance queries")));

/*
+ * We might have looked up indexes for the parent rel, but they're
+ * really not relevant to the appendrel. Reset the pointer to avoid
+ * any confusion.
+ */
+ rel->indexlist = NIL;
+
+ /*
* Initialize to compute size estimates for whole inheritance tree
*/
rel->rows = 0;
Index: src/backend/utils/adt/selfuncs.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/selfuncs.c,v
retrieving revision 1.191.2.1
diff -c -r1.191.2.1 selfuncs.c
*** src/backend/utils/adt/selfuncs.c 22 Nov 2005 18:23:22 -0000 1.191.2.1
--- src/backend/utils/adt/selfuncs.c 2 May 2006 04:31:27 -0000
***************
*** 2970,2988 ****
(varRelid == 0 || varRelid == ((Var *) basenode)->varno))
{
Var *var = (Var *) basenode;
! Oid relid;

vardata->var = basenode; /* return Var without relabeling */
vardata->rel = find_base_rel(root, var->varno);
vardata->atttype = var->vartype;
vardata->atttypmod = var->vartypmod;

! relid = getrelid(var->varno, root->parse->rtable);

! if (OidIsValid(relid))
{
vardata->statsTuple = SearchSysCache(STATRELATT,
! ObjectIdGetDatum(relid),
Int16GetDatum(var->varattno),
0, 0);
}
--- 2970,2996 ----
(varRelid == 0 || varRelid == ((Var *) basenode)->varno))
{
Var *var = (Var *) basenode;
! RangeTblEntry *rte;

vardata->var = basenode; /* return Var without relabeling */
vardata->rel = find_base_rel(root, var->varno);
vardata->atttype = var->vartype;
vardata->atttypmod = var->vartypmod;

! rte = rt_fetch(var->varno, root->parse->rtable);

! if (rte->inh)
! {
! /*
! * XXX This means the Var represents a column of an append relation.
! * Later add code to look at the member relations and try to derive
! * some kind of combined statistics?
! */
! }
! else if (rte->rtekind == RTE_RELATION)
{
vardata->statsTuple = SearchSysCache(STATRELATT,
! ObjectIdGetDatum(rte->relid),
Int16GetDatum(var->varattno),
0, 0);
}