MIN/MAX optimization for partitioned table

Lists: pgsql-hackers
From: Alan Li <ali(at)truviso(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: MIN/MAX optimization for partitioned table
Date: 2009-07-17 20:35:33
Message-ID: 782056770907171335h69c48070o99c497143073dd49@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Consider the following schema:

create table foo_archive (a int, b timestamp);
create index foo_archive_idx on foo_archive(b);
CREATE TABLE foo_archive_2007_01_01 (CONSTRAINT
foo_archive_2007_01_01_b_check CHECK (((b >= '2007-01-01'::date) AND (b <
'2007-01-02'::date)))) INHERITS (foo_archive);
CREATE INDEX foo_archive_2007_01_01_idx ON foo_archive_2007_01_01 USING
btree (b);
CREATE TABLE foo_archive_2007_01_02 (CONSTRAINT
foo_archive_2007_01_02_b_check CHECK (((b >= '2007-01-02'::date) AND (b <
'2007-01-03'::date)))) INHERITS (foo_archive);
CREATE INDEX foo_archive_2007_01_02_idx ON foo_archive_2007_01_02 USING
btree (b);
...

Currently the optimizer yields the following plan:

postgres=# explain select max(b) from foo_archive;
QUERY
PLAN

-----------------------------------------------------------------------------------------------------
Aggregate (cost=18602.00..18602.01 rows=1 width=8)
-> Append (cost=0.00..16005.00 rows=1038800 width=8)
-> Seq Scan on foo_archive (cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_01 foo_archive
(cost=0.00..1331.99 rows=86399 width=8)
-> Seq Scan on foo_archive_2007_01_02 foo_archive
(cost=0.00..1332.00 rows=86400 width=8)
-> Seq Scan on foo_archive_2007_01_03 foo_archive
(cost=0.00..1332.00 rows=86400 width=8)
-> Seq Scan on foo_archive_2007_01_04 foo_archive
(cost=0.00..1332.00 rows=86400 width=8)
-> Seq Scan on foo_archive_2007_01_05 foo_archive
(cost=0.00..1332.00 rows=86400 width=8)
-> Seq Scan on foo_archive_2007_01_06 foo_archive
(cost=0.00..1332.00 rows=86400 width=8)
-> Seq Scan on foo_archive_2007_01_07 foo_archive
(cost=0.00..1332.00 rows=86400 width=8)
-> Seq Scan on foo_archive_2007_01_08 foo_archive
(cost=0.00..1332.00 rows=86400 width=8)
-> Seq Scan on foo_archive_2007_01_09 foo_archive
(cost=0.00..1332.00 rows=86400 width=8)
-> Seq Scan on foo_archive_2007_01_10 foo_archive
(cost=0.00..1332.00 rows=86400 width=8)
-> Seq Scan on foo_archive_2007_01_11 foo_archive
(cost=0.00..1332.00 rows=86400 width=8)
-> Seq Scan on foo_archive_2007_01_12 foo_archive
(cost=0.00..765.01 rows=49601 width=8)
-> Seq Scan on foo_archive_2007_01_13 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_14 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_15 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_16 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_17 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_18 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_19 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_20 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_21 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_22 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_23 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_24 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_25 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_26 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_27 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_28 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_29 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_30 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
-> Seq Scan on foo_archive_2007_01_31 foo_archive
(cost=0.00..29.40 rows=1940 width=8)
(34 rows)

As we can see, the optimizer does not take advantage of the indexes on
column b in the children relations.

Attached is a patch that will take advantage of the indexes (when they're
available and if the index path is cheaper) and yield the following plan.

postgres=# explain select max(b) from foo_archive;
QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1.54..1.55 rows=1 width=0)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_idx on foo_archive
(cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 2 (returns $1)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_01_idx on
foo_archive_2007_01_01 foo_archive (cost=0.00..2719.24 rows=86399 width=8)
Filter: (b IS NOT NULL)
InitPlan 3 (returns $2)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_02_idx on
foo_archive_2007_01_02 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
Filter: (b IS NOT NULL)
InitPlan 4 (returns $3)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_03_idx on
foo_archive_2007_01_03 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
Filter: (b IS NOT NULL)
InitPlan 5 (returns $4)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_04_idx on
foo_archive_2007_01_04 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
Filter: (b IS NOT NULL)
InitPlan 6 (returns $5)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_05_idx on
foo_archive_2007_01_05 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
Filter: (b IS NOT NULL)
InitPlan 7 (returns $6)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_06_idx on
foo_archive_2007_01_06 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
Filter: (b IS NOT NULL)
InitPlan 8 (returns $7)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_07_idx on
foo_archive_2007_01_07 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
Filter: (b IS NOT NULL)
InitPlan 9 (returns $8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_08_idx on
foo_archive_2007_01_08 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
Filter: (b IS NOT NULL)
InitPlan 10 (returns $9)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_09_idx on
foo_archive_2007_01_09 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
Filter: (b IS NOT NULL)
InitPlan 11 (returns $10)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_10_idx on
foo_archive_2007_01_10 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
Filter: (b IS NOT NULL)
InitPlan 12 (returns $11)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_11_idx on
foo_archive_2007_01_11 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
Filter: (b IS NOT NULL)
InitPlan 13 (returns $12)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_12_idx on
foo_archive_2007_01_12 foo_archive (cost=0.00..1568.27 rows=49601 width=8)
Filter: (b IS NOT NULL)
InitPlan 14 (returns $13)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_13_idx on
foo_archive_2007_01_13 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 15 (returns $14)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_14_idx on
foo_archive_2007_01_14 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 16 (returns $15)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_15_idx on
foo_archive_2007_01_15 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 17 (returns $16)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_16_idx on
foo_archive_2007_01_16 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 18 (returns $17)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_17_idx on
foo_archive_2007_01_17 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 19 (returns $18)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_18_idx on
foo_archive_2007_01_18 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 20 (returns $19)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_19_idx on
foo_archive_2007_01_19 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 21 (returns $20)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_20_idx on
foo_archive_2007_01_20 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 22 (returns $21)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_21_idx on
foo_archive_2007_01_21 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 23 (returns $22)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_22_idx on
foo_archive_2007_01_22 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 24 (returns $23)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_23_idx on
foo_archive_2007_01_23 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 25 (returns $24)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_24_idx on
foo_archive_2007_01_24 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 26 (returns $25)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_25_idx on
foo_archive_2007_01_25 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 27 (returns $26)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_26_idx on
foo_archive_2007_01_26 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 28 (returns $27)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_27_idx on
foo_archive_2007_01_27 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 29 (returns $28)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_28_idx on
foo_archive_2007_01_28 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 30 (returns $29)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_29_idx on
foo_archive_2007_01_29 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 31 (returns $30)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_30_idx on
foo_archive_2007_01_30 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
InitPlan 32 (returns $31)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_31_idx on
foo_archive_2007_01_31 foo_archive (cost=0.00..73.35 rows=1940 width=8)
Filter: (b IS NOT NULL)
-> Append (cost=0.00..0.32 rows=32 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
(162 rows)

Does this patch make sense for 8.5 to optimize for this particular class of
queries? Specifically queries of the form:

SELECT MAX(col) FROM tab WHERE ...;
SELECT MIN(col) FROM tab WHERE ...;

Thanks, Alan

Attachment Content-Type Size
optimize_append_minmax.patch text/x-patch 6.1 KB

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Alan Li <ali(at)truviso(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: MIN/MAX optimization for partitioned table
Date: 2009-07-17 21:45:59
Message-ID: 407d949e0907171445m5c192ffeqe43f7aef5496acd0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Neat! I haven't read the patch yet but I have some questions.

Does this handle the case where some partitions have an index and
others don't? Ie. Does it just apply the regular optimization to each
partition and then slap on the aggregate node? I think that's actually
a realistic case because people often don't have indexes on empty
partitions like the parent partition or a new partition which has just
been added and doesn't have indexes yet.

Is there any overlap with the ordered-append patch which is also in
the pipeline? afaict it covers similar cases but doesn't actually
overlap since the min/max optimization avoids having to do a sort
anywhere.

--
greg
http://mit.edu/~gsstark/resume.pdf


From: Alan Li <ali(at)truviso(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: MIN/MAX optimization for partitioned table
Date: 2009-07-17 22:05:52
Message-ID: 782056770907171505h428ee9el33dec178b5c792ac@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 17, 2009 at 2:45 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:

> Neat! I haven't read the patch yet but I have some questions.
>
> Does this handle the case where some partitions have an index and
> others don't? Ie. Does it just apply the regular optimization to each
> partition and then slap on the aggregate node? I think that's actually
> a realistic case because people often don't have indexes on empty
> partitions like the parent partition or a new partition which has just
> been added and doesn't have indexes yet.
>
> Is there any overlap with the ordered-append patch which is also in
> the pipeline? afaict it covers similar cases but doesn't actually
> overlap since the min/max optimization avoids having to do a sort
> anywhere.
>
> --
> greg
> http://mit.edu/~gsstark/resume.pdf <http://mit.edu/%7Egsstark/resume.pdf>
>
>
> Hi Greg,

My colleague, Jeff Davis, just pointed me to the work that you're doing with
MergeAppend. I didn't know about it.

Yes, it does handle the case where no index exists in the child partition.
It defaults to the Seqscan plan for that particular partition because it
still depends on the aggregate node on top of the append node.

I haven't looked at your MergeAppend patch so I don't know how much overlap
there is. Based on my limited understanding of it, I think it may be two
different approaches to optimizing the same problem with yours being a more
general solution that solves a wider set of optimizations for partitioned
tables while I'm trying to solve a very specific problem. You are also
correct that my patch will not have to sort on partitions without the
appropriate index, so the plan it generates should be cheaper.

Any more thoughts about my patch or ways of making the two patches work
together would be greatly appreciated.

Thanks, Alan


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Alan Li <ali(at)truviso(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: MIN/MAX optimization for partitioned table
Date: 2009-07-18 10:13:19
Message-ID: 407d949e0907180313g349d1f2dref273631c138559d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This part:

! /* only try to optimize children rel's */
! foreach (lc, root->append_rel_list)
! {
! AppendRelInfo *a = (AppendRelInfo *) lfirst(lc);
!
! if (a->child_relid == childrel->relid &&
! a->parent_relid == parentrel->relid)
! {
! appinfo = a;
! break;
! }
! }

Looks like O(n^2). I guess that's a bigger problem than with just this
patch. Perhaps append_rel_list should be a dynahash in general. I
never really understood why it was simpler to have a single global
append_rel_list anyways.

The other thing is it would be nice if we could avoid making separate
subplans for each child and instead make one for the whole structure
including the aggregate. It would at the very least make the explain
output prettier, but I think it would avoid repeated execution of the
aggregate in some cases as well.

--
greg
http://mit.edu/~gsstark/resume.pdf


From: Alan Li <ali(at)truviso(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: MIN/MAX optimization for partitioned table
Date: 2009-07-18 17:35:52
Message-ID: 782056770907181035t34e71864j95fd214e4b3cf88d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jul 18, 2009 at 3:13 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:

> This part:
>
> ! /* only try to optimize children rel's */
> ! foreach (lc, root->append_rel_list)
> ! {
> ! AppendRelInfo *a = (AppendRelInfo *) lfirst(lc);
> !
> ! if (a->child_relid == childrel->relid &&
> ! a->parent_relid == parentrel->relid)
> ! {
> ! appinfo = a;
> ! break;
> ! }
> ! }
>
> Looks like O(n^2). I guess that's a bigger problem than with just this
> patch. Perhaps append_rel_list should be a dynahash in general. I
> never really understood why it was simpler to have a single global
> append_rel_list anyways.
>

Yeah, are you running into the same issue as well? I tried to figure out a
way around the O(n^2) behavior, but there were no existing direct
relationship between the child subpath and its associated AppendRelInfo. I
think an AppenRelInfo dynahash would work and just have it hash by the
child_relid.

>
> The other thing is it would be nice if we could avoid making separate
> subplans for each child and instead make one for the whole structure
> including the aggregate. It would at the very least make the explain
> output prettier, but I think it would avoid repeated execution of the
> aggregate in some cases as well.

How would this plan look? I think the repeated execution of the aggregate
comes from having to process the output of each child. So it's O(n)
executions where n is the number of children, assuming each child has the
appropriate index for this optimization.

The other optimization that I thought was that the optimizer might use the
check constraints (in the range partitioning) on the column that I want to
do a MIN/MAX on. Then it'll only have to execute the subplan in one child
partition and the parent partition. But it was more of a wild idea on my
part, not sure if that's possible.

>
>
> --
> greg
> http://mit.edu/~gsstark/resume.pdf <http://mit.edu/%7Egsstark/resume.pdf>
>
>
>


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Alan Li <ali(at)truviso(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: MIN/MAX optimization for partitioned table
Date: 2009-07-18 19:04:42
Message-ID: 407d949e0907181204r549e98c5m558873098b50275@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jul 18, 2009 at 6:35 PM, Alan Li<ali(at)truviso(dot)com> wrote:
> Yeah, are you running into the same issue as well?  I tried to figure out a
> way around the O(n^2) behavior, but there were no existing direct
> relationship between the child subpath and its associated AppendRelInfo.  I
> think an AppenRelInfo dynahash would work and just have it hash by the
> child_relid.

I don't see any place in my patch where I had to do anything like
this. I do have to loop through all the appendrel elements and skip
over the ones which don't belong to this appendrel which seems weird
to me but it isn't n^2.

Generating a normal Append node you're generating a brand new path for
each child and attaching them to the append path. It looks like you're
allowing the append rel to generate paths and then digging around
looking at those paths. I wonder if that's the right approach.

>> The other thing is it would be nice if we could avoid making separate
>> subplans for each child and instead make one for the whole structure
>> including the aggregate. It would at the very least make the explain
>> output prettier, but I think it would avoid repeated execution of the
>> aggregate in some cases as well.
>
> How would this plan look?  I think the repeated execution of the aggregate
> comes from having to process the output of each child.  So it's O(n)
> executions where n is the number of children, assuming each child has the
> appropriate index for this optimization.

No I just mean instead of generating

InitPlan 1
Limit
Index Scan
InitPlan 2
Limit
Index Scan
Aggregate
Append
Result
Result

I think it would be better to generate this:

InitPlan 1
Aggregate
Append
Limit 1
IndexScan
Limit 1
IndexScan
Result

The reason I think this is better is because if the subquery is
executed multiple times it will just return the one precalculated
result. In your plan it will have all the child minimums precalculated
but will still have to re-execute the aggregate and append node.
That's not terribly expensive but we might as well generate the
simpler more efficient plan.

> The other optimization that I thought was that the optimizer might use the
> check constraints (in the range partitioning) on the column that I want to
> do a MIN/MAX on.  Then it'll only have to execute the subplan in one child
> partition and the parent partition.  But it was more of a wild idea on my
> part, not sure if that's possible.

Yes, I think this is the long-term goal. That's the whole "better
representation of partitions" plotline. To do this efficiently the
planner should know what the partition key is and what the
partitioning structure is.

In an ideal world we would be able to collapse the whole structure and
eliminate the append relation entirely. To do that we need some way to
indicate that the parent relation is provably empty. I had in mind to
mark it as read-only and the statistics as authoritative since that
seems more useful than just being able to mark it empty. I think there
are a lot of other interesting things we could do with statistics if
we knew they were authoritative for a read-only table. (The read-only
property we would need here is very strong. There would have to be a
vacuum freeze and moreover we would have to know that all tuples were
successfully frozen.)

--
greg
http://mit.edu/~gsstark/resume.pdf


From: Alan Li <ali(at)truviso(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: MIN/MAX optimization for partitioned table
Date: 2009-07-20 18:40:53
Message-ID: 782056770907201140h6c449398n6fb2bf19c3305fa8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jul 18, 2009 at 12:04 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:

> On Sat, Jul 18, 2009 at 6:35 PM, Alan Li<ali(at)truviso(dot)com> wrote:
> > Yeah, are you running into the same issue as well? I tried to figure out
> a
> > way around the O(n^2) behavior, but there were no existing direct
> > relationship between the child subpath and its associated AppendRelInfo.
> I
> > think an AppenRelInfo dynahash would work and just have it hash by the
> > child_relid.
>
> I don't see any place in my patch where I had to do anything like
> this. I do have to loop through all the appendrel elements and skip
> over the ones which don't belong to this appendrel which seems weird
> to me but it isn't n^2.

> Generating a normal Append node you're generating a brand new path for
> each child and attaching them to the append path. It looks like you're
> allowing the append rel to generate paths and then digging around
> looking at those paths. I wonder if that's the right approach.
>
>
> >> The other thing is it would be nice if we could avoid making separate
> >> subplans for each child and instead make one for the whole structure
> >> including the aggregate. It would at the very least make the explain
> >> output prettier, but I think it would avoid repeated execution of the
> >> aggregate in some cases as well.
> >
> > How would this plan look? I think the repeated execution of the
> aggregate
> > comes from having to process the output of each child. So it's O(n)
> > executions where n is the number of children, assuming each child has the
> > appropriate index for this optimization.
>
> No I just mean instead of generating
>
> InitPlan 1
> Limit
> Index Scan
> InitPlan 2
> Limit
> Index Scan
> Aggregate
> Append
> Result
> Result
>
> I think it would be better to generate this:
>
> InitPlan 1
> Aggregate
> Append
> Limit 1
> IndexScan
> Limit 1
> IndexScan
> Result
>
> The reason I think this is better is because if the subquery is
> executed multiple times it will just return the one precalculated
> result. In your plan it will have all the child minimums precalculated
> but will still have to re-execute the aggregate and append node.
> That's not terribly expensive but we might as well generate the
> simpler more efficient plan.
>
>
> > The other optimization that I thought was that the optimizer might use
> the
> > check constraints (in the range partitioning) on the column that I want
> to
> > do a MIN/MAX on. Then it'll only have to execute the subplan in one
> child
> > partition and the parent partition. But it was more of a wild idea on my
> > part, not sure if that's possible.
>
> Yes, I think this is the long-term goal. That's the whole "better
> representation of partitions" plotline. To do this efficiently the
> planner should know what the partition key is and what the
> partitioning structure is.
>
> In an ideal world we would be able to collapse the whole structure and
> eliminate the append relation entirely. To do that we need some way to
> indicate that the parent relation is provably empty. I had in mind to
> mark it as read-only and the statistics as authoritative since that
> seems more useful than just being able to mark it empty. I think there
> are a lot of other interesting things we could do with statistics if
> we knew they were authoritative for a read-only table. (The read-only
> property we would need here is very strong. There would have to be a
> vacuum freeze and moreover we would have to know that all tuples were
> successfully frozen.)
>
> --
> greg
> http://mit.edu/~gsstark/resume.pdf <http://mit.edu/%7Egsstark/resume.pdf>
>
>
>
Attached is an updated patch that removes the O(n^2) behavior and the
awkwardness of optimizing the seqscan path as the plan was about to be
created. Now, the optimization is considered when appendrel is generating
the paths.

I also changed the plan as you had suggested. It now looks like this:

postgres=# explain select max(b) from foo_archive;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1.22..1.23 rows=1 width=8)
-> Append (cost=0.00..1.13 rows=32 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_idx on foo_archive
(cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_01_idx on
foo_archive_2007_01_01 foo_archive (cost=0.00..2723.24 rows=86399 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_02_idx on
foo_archive_2007_01_02 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_03_idx on
foo_archive_2007_01_03 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_04_idx on
foo_archive_2007_01_04 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_05_idx on
foo_archive_2007_01_05 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_06_idx on
foo_archive_2007_01_06 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_07_idx on
foo_archive_2007_01_07 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_08_idx on
foo_archive_2007_01_08 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_09_idx on
foo_archive_2007_01_09 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_10_idx on
foo_archive_2007_01_10 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_11_idx on
foo_archive_2007_01_11 foo_archive (cost=0.00..2719.26 rows=86400 width=8)
-> Limit (cost=0.00..0.03 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_12_idx on
foo_archive_2007_01_12 foo_archive (cost=0.00..1568.27 rows=49601 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_13_idx on
foo_archive_2007_01_13 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_14_idx on
foo_archive_2007_01_14 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_15_idx on
foo_archive_2007_01_15 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_16_idx on
foo_archive_2007_01_16 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_17_idx on
foo_archive_2007_01_17 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_18_idx on
foo_archive_2007_01_18 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_19_idx on
foo_archive_2007_01_19 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_20_idx on
foo_archive_2007_01_20 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_21_idx on
foo_archive_2007_01_21 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_22_idx on
foo_archive_2007_01_22 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_23_idx on
foo_archive_2007_01_23 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_24_idx on
foo_archive_2007_01_24 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_25_idx on
foo_archive_2007_01_25 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_26_idx on
foo_archive_2007_01_26 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_27_idx on
foo_archive_2007_01_27 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_28_idx on
foo_archive_2007_01_28 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_29_idx on
foo_archive_2007_01_29 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_30_idx on
foo_archive_2007_01_30 foo_archive (cost=0.00..73.35 rows=1940 width=8)
-> Limit (cost=0.00..0.04 rows=1 width=8)
-> Index Scan Backward using foo_archive_2007_01_31_idx on
foo_archive_2007_01_31 foo_archive (cost=0.00..73.35 rows=1940 width=8)
(66 rows)

Attachment Content-Type Size
optimize_append_minmax_1.patch text/x-patch 7.7 KB

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Alan Li <ali(at)truviso(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: MIN/MAX optimization for partitioned table
Date: 2009-07-20 19:29:09
Message-ID: 407d949e0907201229s22e375edh219cbe9e2ddd0c67@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 20, 2009 at 7:40 PM, Alan Li<ali(at)truviso(dot)com> wrote:
> Attached is an updated patch that removes the O(n^2) behavior and the
> awkwardness of optimizing the seqscan path as the plan was about to be
> created.  Now, the optimization is considered when appendrel is generating
> the paths.
>
> I also changed the plan as you had suggested.  It now looks like this:

Hm, that's not quite the plan I described either. I had in mind to
mirror the existing min/max optimization which put the whole thing in
an InitPlan and put a Result node in place of the actual plan. Your
original patch did that for each subplan but I thought it would be
better to do it for the whole aggregate.

However the more I think about it the more I don't understand why Tom
arranged to do that for the min/max optimization anyways. For
subqueries where that makes sense that would surely happen anyways
such as in the first example below. And for joins where it's necessary
the planner knows to put a Materialize node which sounds just as good.

Here's what happens with your current patch in the case I was
concerned about -- note that the planner automatically detects the
case and turns the whole subplan into an initplan anyways:

postgres=# explain select * from y where j = (select min(i) from x) ;
QUERY PLAN
----------------------------------------------------------------------------------------------
Seq Scan on y (cost=40.12..80.12 rows=12 width=4)
Filter: (j = $0)
InitPlan 1 (returns $0)
-> Aggregate (cost=40.11..40.12 rows=1 width=4)
-> Append (cost=0.00..34.10 rows=2403 width=4)
-> Limit (cost=0.00..0.03 rows=1 width=4)
-> Index Scan using xi on x (cost=0.00..80.25
rows=2400 width=4)
-> Limit (cost=0.00..0.03 rows=1 width=4)
-> Index Scan using xi2 on x2 x
(cost=0.00..80.25 rows=2400 width=4)
-> Limit (cost=0.00..0.03 rows=1 width=4)
-> Index Scan using xi3 on x3 x
(cost=0.00..80.25 rows=2400 width=4)
-> Seq Scan on x1 x (cost=0.00..34.00 rows=2400 width=4)
(12 rows)

And here's another case where you wouldn't want multiple execution --
but the planner here figures out to materialize the result:

postgres=# explain select * from y left outer join (select min(i) as i
from x) as x on (i=j);
QUERY PLAN
--------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=40.13..128.13 rows=2400 width=8)
Join Filter: ((min(public.x.i)) = y.j)
-> Seq Scan on y (cost=0.00..34.00 rows=2400 width=4)
-> Materialize (cost=40.13..40.14 rows=1 width=4)
-> Aggregate (cost=40.11..40.12 rows=1 width=4)
-> Append (cost=0.00..34.10 rows=2403 width=4)
-> Limit (cost=0.00..0.03 rows=1 width=4)
-> Index Scan using xi on x
(cost=0.00..80.25 rows=2400 width=4)
-> Limit (cost=0.00..0.03 rows=1 width=4)
-> Index Scan using xi2 on x2 x
(cost=0.00..80.25 rows=2400 width=4)
-> Limit (cost=0.00..0.03 rows=1 width=4)
-> Index Scan using xi3 on x3 x
(cost=0.00..80.25 rows=2400 width=4)
-> Seq Scan on x1 x (cost=0.00..34.00 rows=2400 width=4)
(13 rows)

So I'm a bit puzzled why Tom's min/max optimization bothers with the
whole Initplan/Result business itself anyways.

--
greg
http://mit.edu/~gsstark/resume.pdf


From: Alan Li <ali(at)truviso(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: MIN/MAX optimization for partitioned table
Date: 2009-07-20 20:09:47
Message-ID: 782056770907201309l46d4a1d9ue427f36896d69a67@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 20, 2009 at 12:29 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:

> On Mon, Jul 20, 2009 at 7:40 PM, Alan Li<ali(at)truviso(dot)com> wrote:
> > Attached is an updated patch that removes the O(n^2) behavior and the
> > awkwardness of optimizing the seqscan path as the plan was about to be
> > created. Now, the optimization is considered when appendrel is
> generating
> > the paths.
> >
> > I also changed the plan as you had suggested. It now looks like this:
>
> Hm, that's not quite the plan I described either. I had in mind to
> mirror the existing min/max optimization which put the whole thing in
> an InitPlan and put a Result node in place of the actual plan. Your
> original patch did that for each subplan but I thought it would be
> better to do it for the whole aggregate.
>
> However the more I think about it the more I don't understand why Tom
> arranged to do that for the min/max optimization anyways. For
> subqueries where that makes sense that would surely happen anyways
> such as in the first example below. And for joins where it's necessary
> the planner knows to put a Materialize node which sounds just as good.
>

> Here's what happens with your current patch in the case I was
> concerned about -- note that the planner automatically detects the
> case and turns the whole subplan into an initplan anyways:
>
> postgres=# explain select * from y where j = (select min(i) from x) ;
> QUERY PLAN
>
> ----------------------------------------------------------------------------------------------
> Seq Scan on y (cost=40.12..80.12 rows=12 width=4)
> Filter: (j = $0)
> InitPlan 1 (returns $0)
> -> Aggregate (cost=40.11..40.12 rows=1 width=4)
> -> Append (cost=0.00..34.10 rows=2403 width=4)
> -> Limit (cost=0.00..0.03 rows=1 width=4)
> -> Index Scan using xi on x (cost=0.00..80.25
> rows=2400 width=4)
> -> Limit (cost=0.00..0.03 rows=1 width=4)
> -> Index Scan using xi2 on x2 x
> (cost=0.00..80.25 rows=2400 width=4)
> -> Limit (cost=0.00..0.03 rows=1 width=4)
> -> Index Scan using xi3 on x3 x
> (cost=0.00..80.25 rows=2400 width=4)
> -> Seq Scan on x1 x (cost=0.00..34.00 rows=2400 width=4)
> (12 rows)
>
>
> And here's another case where you wouldn't want multiple execution --
> but the planner here figures out to materialize the result:
>
> postgres=# explain select * from y left outer join (select min(i) as i
> from x) as x on (i=j);
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------
> Nested Loop Left Join (cost=40.13..128.13 rows=2400 width=8)
> Join Filter: ((min(public.x.i)) = y.j)
> -> Seq Scan on y (cost=0.00..34.00 rows=2400 width=4)
> -> Materialize (cost=40.13..40.14 rows=1 width=4)
> -> Aggregate (cost=40.11..40.12 rows=1 width=4)
> -> Append (cost=0.00..34.10 rows=2403 width=4)
> -> Limit (cost=0.00..0.03 rows=1 width=4)
> -> Index Scan using xi on x
> (cost=0.00..80.25 rows=2400 width=4)
> -> Limit (cost=0.00..0.03 rows=1 width=4)
> -> Index Scan using xi2 on x2 x
> (cost=0.00..80.25 rows=2400 width=4)
> -> Limit (cost=0.00..0.03 rows=1 width=4)
> -> Index Scan using xi3 on x3 x
> (cost=0.00..80.25 rows=2400 width=4)
> -> Seq Scan on x1 x (cost=0.00..34.00 rows=2400
> width=4)
> (13 rows)
>
>
> So I'm a bit puzzled why Tom's min/max optimization bothers with the
> whole Initplan/Result business itself anyways.
>
> --
> greg
> http://mit.edu/~gsstark/resume.pdf <http://mit.edu/%7Egsstark/resume.pdf>
>
>
>
Yes, this is what I saw when I was testing this, and I think that it would
be best to leave the decision of creating an initPlan/materialize to the
optimization of the super-query. That's why I didn't bother to specifically
put an InitPlan on top of the Aggregate in the new patch.

Alan