Quick Links

Re: plan time of MASSIVE partitioning ...

Lists:	pgsql-hackers

From:	PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	plan time of MASSIVE partitioning ...
Date:	2010-09-03 09:40:37
Message-ID:	6C19B4AB-D7FE-4673-9885-4708F118FD61@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

hello everybody,

we came across an issue which turned out to be more serious than previously expected.
imagine a system with, say, 1000 partitions (heavily indexed) or so. the time taken by the planner is already fairly heavy in this case.

i tried this one with 5000 unindexed tables (just one col):

test=# \timing
Timing is on.
test=# prepare x(int4) AS select * from t_data order by id desc;
PREPARE
Time: 361.552 ms

you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.

does anybody see a potential way to do a shortcut through the planner?
a prepared query is no solution here as constraint exclusion would be dead in this case (making the entire thing an even bigger drama).

did anybody think of a solution to this problem.
or more precisely: can there be a solution to this problem?

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-03 12:04:42
Message-ID:	20100903120442.GP26232@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* PostgreSQL - Hans-Jürgen Schönig (postgres(at)cybertec(dot)at) wrote:
> did anybody think of a solution to this problem.
> or more precisely: can there be a solution to this problem?

Please post to the correct list (-performance) and provide information
like PG version, postgresql.conf, the actual table definition, the
resulting query plan, etc, etc...

Thanks,

Stephen

From:	PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-03 12:16:21
Message-ID:	F2404401-2E8D-4BCE-A932-29D520489B0A@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sep 3, 2010, at 2:04 PM, Stephen Frost wrote:

> * PostgreSQL - Hans-Jürgen Schönig (postgres(at)cybertec(dot)at) wrote:
>> did anybody think of a solution to this problem.
>> or more precisely: can there be a solution to this problem?
>
> Please post to the correct list (-performance) and provide information
> like PG version, postgresql.conf, the actual table definition, the
> resulting query plan, etc, etc...
>
> Thanks,
>
> Stephen

hello stephen,

this seems like more a developer question to me than a pre performance one.
it is not related to the table structure at all - it is basically an issue with incredibly large inheritance lists.
it applies to postgres 9 and most likely to everything before.
postgresql.conf is not relevant at all at this point.

the plan is pretty fine.
the question is rather: does anybody see a chance to handle such lists more efficiently inside postgres?
also, it is not the point if my data structure is sane or not. it is really more generic - namely a shortcut for this case inside the planing process.

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-03 12:27:36
Message-ID:	20100903122736.GQ26232@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* PostgreSQL - Hans-Jürgen Schönig (postgres(at)cybertec(dot)at) wrote:
> this seems like more a developer question to me than a pre performance one.
> it is not related to the table structure at all - it is basically an issue with incredibly large inheritance lists.
> it applies to postgres 9 and most likely to everything before.
> postgresql.conf is not relevant at all at this point.

Really? What's constraint_exclusion set to? Is GEQO getting used?
What are the GEQO parameters set to? Do you have any CHECK constraints
on the tables?

If you want someone else to build a test case and start looking into it,
it's best to not make them have to guess at what you've done.

> the plan is pretty fine.
> the question is rather: does anybody see a chance to handle such lists more efficiently inside postgres?
> also, it is not the point if my data structure is sane or not. it is really more generic - namely a shortcut for this case inside the planing process.

Coming up with cases where PG doesn't perform well in a completely
contrived unrealistic environment isn't likely to impress anyone to
do anything.

A small (but complete..) test case which mimics a real world environment
and real world problem would go alot farther. I can certainly believe
that people out there have partitions set up with lots of tables and
that it will likely grow- but they probably also have CHECK constraints,
have tweaked what constraint_exclusion is set to, have adjusted their
work_mem and related settings, maybe tweaked some planner GUCs, etc,
etc.

This is an area I'm actually interested in and curious about, I'd rather
work together on it than be told that the questions I'm asking aren't
relevant.

Thanks,

Stephen

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-03 13:27:15
Message-ID:	AANLkTik9=wjvnG7VPbPHuC9FXacvFge3U=UppmqSOocF@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>:
> i tried this one with 5000 unindexed tables (just one col):
>
> test=# \timing
> Timing is on.
> test=# prepare x(int4) AS select * from t_data order by id desc;
> PREPARE
> Time: 361.552 ms
>
> you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.

I'd like to see (1) a script to reproduce your test environment (as
Stephen also requested) and (2) gprof or oprofile results.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-03 14:40:54
Message-ID:	22483.1283524854@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?iso-8859-1?Q?PostgreSQL_-_Hans-J=FCrgen_Sch=F6nig?= <postgres(at)cybertec(dot)at> writes:
> imagine a system with, say, 1000 partitions (heavily indexed) or so. the time taken by the planner is already fairly heavy in this case.

As the fine manual points out, the current scheme for managing
partitioned tables isn't intended to scale past a few dozen partitions.

I think we'll be able to do better when we have an explicit
representation of partitioning, since then the planner won't
have to expend large amounts of effort reverse-engineering knowledge
about how an inheritance tree is partitioned. Before that happens,
it's not really worth the trouble to worry about such cases.

regards, tom lane

From:	PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-03 14:46:26
Message-ID:	BA9ED250-AB20-4AFC-9905-80F43E210C80@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sep 3, 2010, at 4:40 PM, Tom Lane wrote:

> =?iso-8859-1?Q?PostgreSQL_-_Hans-J=FCrgen_Sch=F6nig?= <postgres(at)cybertec(dot)at> writes:
>> imagine a system with, say, 1000 partitions (heavily indexed) or so. the time taken by the planner is already fairly heavy in this case.
>
> As the fine manual points out, the current scheme for managing
> partitioned tables isn't intended to scale past a few dozen partitions.
>
> I think we'll be able to do better when we have an explicit
> representation of partitioning, since then the planner won't
> have to expend large amounts of effort reverse-engineering knowledge
> about how an inheritance tree is partitioned. Before that happens,
> it's not really worth the trouble to worry about such cases.
>
> regards, tom lane
>

thank you ... - the manual is clear here but we wanted to see if there is some reasonably low hanging fruit to get around this.
it is no solution but at least a clear statement ...

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 13:54:11
Message-ID:	AANLkTi=wMcigktiaT6O=Y56Qgdwxpq-XfrF-Jyi71RB8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 7, 2010 at 2:14 PM, Boszormenyi Zoltan <zb(at)cybertec(dot)at> wrote:
> Hi,
>
> Robert Haas írta:
>> 2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>:
>>
>>> i tried this one with 5000 unindexed tables (just one col):
>>>
>>> test=# \timing
>>> Timing is on.
>>> test=# prepare x(int4) AS select * from t_data order by id desc;
>>> PREPARE
>>> Time: 361.552 ms
>>>
>>> you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.
>>>
>>
>> I'd like to see (1) a script to reproduce your test environment (as
>> Stephen also requested) and (2) gprof or oprofile results.
>>
>
> attached are the test scripts, create_tables.sql and childtables.sql.
> The following query takes 4.7 seconds according to psql with \timing on:
> EXPLAIN SELECT * FROM qdrs
> WHERE streamstart BETWEEN '2010-04-06' AND '2010-06-25'
> ORDER BY streamhash;

Neat. Have you checked what effect this has on memory consumption?

Also, don't forget to add it to
https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 14:14:29
Message-ID:	D4FCA214-6CA1-41A7-8FB3-C712FB069DB1@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

hello ...

no, we have not checked memory consumption.
there is still some stuff left to optimize away - it seems we are going close to O(n^2) somewhere.
"equal" is called really often in our sample case as well:

ach sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
18.87 0.80 0.80 4812 0.00 0.00 make_canonical_pathkey
15.33 1.45 0.65 4811 0.00 0.00
get_eclass_for_sort_expr
14.15 2.05 0.60 8342410 0.00 0.00 equal
6.13 2.31 0.26 229172 0.00 0.00 SearchCatCache
3.66 2.47 0.16 5788835 0.00 0.00 _equalList
3.07 2.60 0.13 1450043 0.00 0.00
hash_search_with_hash_value
2.36 2.70 0.10 2272545 0.00 0.00 AllocSetAlloc
2.12 2.79 0.09 811460 0.00 0.00 hash_any
1.89 2.87 0.08 3014983 0.00 0.00 list_head
1.89 2.95 0.08 574526 0.00 0.00 _bt_compare
1.77 3.02 0.08 11577670 0.00 0.00 list_head
1.42 3.08 0.06 1136 0.00 0.00 tzload
0.94 3.12 0.04 2992373 0.00 0.00 AllocSetFreeIndex

look at the number of calls ...
"equal" is scary ...

make_canonical_pathkey is fixed it seems.
get_eclass_for_sort_expr seems a little more complex to fix.

great you like it ...

regards,

hans

On Sep 8, 2010, at 3:54 PM, Robert Haas wrote:

> On Tue, Sep 7, 2010 at 2:14 PM, Boszormenyi Zoltan <zb(at)cybertec(dot)at> wrote:
>> Hi,
>>
>> Robert Haas írta:
>>> 2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>:
>>>
>>>> i tried this one with 5000 unindexed tables (just one col):
>>>>
>>>> test=# \timing
>>>> Timing is on.
>>>> test=# prepare x(int4) AS select * from t_data order by id desc;
>>>> PREPARE
>>>> Time: 361.552 ms
>>>>
>>>> you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.
>>>>
>>>
>>> I'd like to see (1) a script to reproduce your test environment (as
>>> Stephen also requested) and (2) gprof or oprofile results.
>>>
>>
>> attached are the test scripts, create_tables.sql and childtables.sql.
>> The following query takes 4.7 seconds according to psql with \timing on:
>> EXPLAIN SELECT * FROM qdrs
>> WHERE streamstart BETWEEN '2010-04-06' AND '2010-06-25'
>> ORDER BY streamhash;
>
> Neat. Have you checked what effect this has on memory consumption?
>
> Also, don't forget to add it to
> https://commitfest.postgresql.org/action/commitfest_view/open
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise Postgres Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 14:18:02
Message-ID:	20100908141802.GZ26232@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Hans-Jürgen Schönig (postgres(at)cybertec(dot)at) wrote:
> no, we have not checked memory consumption.
> there is still some stuff left to optimize away - it seems we are going close to O(n^2) somewhere.
> "equal" is called really often in our sample case as well:

Did the mail with the scripts, etc, get hung up due to size or
something..? I didn't see it on the mailing list nor in the archives..
If so, could you post them somewhere so others could look..?

Thanks,

Stephen

From:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 14:26:43
Message-ID:	72BE725D-2239-437A-965C-0DEBD8837F79@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

here is the patch again.
we accidentally attached a wrong test file to the original posting so it grew to big. we had to revoke it from the moderator (this happens if you code from 8am to 10pm).
here is just the patch - it is nice and small.

you can easily test it by making yourself a nice parent table, many subtables (hundreds or thousands) and a decent number of indexes per partition.
then run PREPARE with \timing to see what happens.
you should get scary planning times. the more potential indexes and tables the more scary it will be.

using this wonderful RB tree the time for this function call goes down to basically zero.
i hope this is something which is useful to some folks out there.

many thanks,

hans

Attachment	Content-Type	Size
canon-pathkeys-as-rbtree-3-ctxdiff.patch	application/octet-stream	15.3 KB
unknown_filename	text/plain	669 bytes

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 14:57:12
Message-ID:	20100908145712.GB26232@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> Neat. Have you checked what effect this has on memory consumption?
>
> Also, don't forget to add it to
> https://commitfest.postgresql.org/action/commitfest_view/open

Would be good to have the patch updated to be against HEAD before
posting to the commitfest.

Thanks,

Stephen

From:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 15:09:11
Message-ID:	FE95C7DE-6759-4FAA-A083-ABB8D0F86421@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sep 8, 2010, at 4:57 PM, Stephen Frost wrote:

> * Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
>> Neat. Have you checked what effect this has on memory consumption?
>>
>> Also, don't forget to add it to
>> https://commitfest.postgresql.org/action/commitfest_view/open
>
> Would be good to have the patch updated to be against HEAD before
> posting to the commitfest.
>
> Thanks,
>
> Stephen

we will definitely provide something which is for HEAD.
but, it seems the problem we are looking is not sufficiently fixed yet.
in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that more can be done to reduce this. i guess we have to attack this as well.

regards,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 15:26:55
Message-ID:	20100908152655.GC26232@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Hans-Jürgen Schönig (postgres(at)cybertec(dot)at) wrote:
> but, it seems the problem we are looking is not sufficiently fixed yet.
> in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that more can be done to reduce this. i guess we have to attack this as well.

An 18% increase is certainly nice, provided it doesn't slow down or
break other things.. I'm looking through the patch now actually and
I'm not really happy with the naming, comments, or some of the code
flow, but I think the concept looks reasonable.

Thanks,

Stephen

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 15:33:26
Message-ID:	AANLkTikE8GAGmSGvKdd=BL2dRX9frnT8uzoXscu9zpUD@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2010/9/8 Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>:
> On Sep 8, 2010, at 4:57 PM, Stephen Frost wrote:
>
>> * Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
>>> Neat. Have you checked what effect this has on memory consumption?
>>>
>>> Also, don't forget to add it to
>>> https://commitfest.postgresql.org/action/commitfest_view/open
>>
>> Would be good to have the patch updated to be against HEAD before
>> posting to the commitfest.
>
>
> we will definitely provide something which is for HEAD.
> but, it seems the problem we are looking is not sufficiently fixed yet.
> in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that more can be done to reduce this. i guess we have to attack this as well.

Just remember that four small patches (say) are apt to get committed
faster than one big one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 15:37:30
Message-ID:	20100908153730.GD26232@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> 2010/9/8 Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>:
> > but, it seems the problem we are looking is not sufficiently fixed yet.
> > in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that more can be done to reduce this. i guess we have to attack this as well.
>
> Just remember that four small patches (say) are apt to get committed
> faster than one big one.

Indeed, but code like this makes me wonder if this is really working the
way it's supposed to:

+ val1 = (long)pk_left->pk_eclass;
+ val2 = (long)pk_right->pk_eclass;
+
+ if (val1 < val2)
+ return -1;
+ else if (val1 > val2)
+ return 1;

Have you compared how big the tree gets to the size of the list it's
supposed to be replacing..?

Thanks,

Stephen

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 15:39:35
Message-ID:	1283960300-sup-4372@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Stephen Frost's message of mié sep 08 11:26:55 -0400 2010:
> * Hans-Jürgen Schönig (postgres(at)cybertec(dot)at) wrote:
> > but, it seems the problem we are looking is not sufficiently fixed yet.
> > in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that more can be done to reduce this. i guess we have to attack this as well.
>
> An 18% increase is certainly nice, provided it doesn't slow down or
> break other things.. I'm looking through the patch now actually and
> I'm not really happy with the naming, comments, or some of the code
> flow, but I think the concept looks reasonable.

I don't understand the layering between pg_tree and rbtree. Why does it
exist at all? At first I thought this was another implementation of
rbtrees, but then I noticed it sits on top of it. Is this really
necessary?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 15:42:14
Message-ID:	4096.1283960534@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?iso-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= <postgres(at)cybertec(dot)at> writes:
> here is the patch again.

This patch seems remarkably documentation-free. What is it trying to
accomplish and what is it doing to the planner data structures?
(Which do have documentation BTW.) Also, what will it do to runtime
in normal cases where the pathkeys list isn't that long?

regards, tom lane

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 15:57:47
Message-ID:	4C87B27B.3010002@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost írta:
> * Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
>
>> 2010/9/8 Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>:
>>
>>> but, it seems the problem we are looking is not sufficiently fixed yet.
>>> in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that more can be done to reduce this. i guess we have to attack this as well.
>>>
>> Just remember that four small patches (say) are apt to get committed
>> faster than one big one.
>>
>
> Indeed, but code like this makes me wonder if this is really working the
> way it's supposed to:
>
> + val1 = (long)pk_left->pk_eclass;
> + val2 = (long)pk_right->pk_eclass;
> +
> + if (val1 < val2)
> + return -1;
> + else if (val1 > val2)
> + return 1;
>

The original code checked for pointers being equal among
other conditions. This was an (almost) straight conversion
to a comparison function for rbtree. Do you mean casting
the pointer to long? Yes, e.g. on 64-bit Windows it wouldn't
work. Back to plain pointer comparison.

> Have you compared how big the tree gets to the size of the list it's
> supposed to be replacing..?
>

No, but I think it's obvious. Now we have one TreeCell
where we had one ListCell.

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
http://www.postgresql.at/

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 15:58:09
Message-ID:	4C87B291.5030106@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera írta:
> Excerpts from Stephen Frost's message of mié sep 08 11:26:55 -0400 2010:
>
>> * Hans-Jürgen Schönig (postgres(at)cybertec(dot)at) wrote:
>>
>>> but, it seems the problem we are looking is not sufficiently fixed yet.
>>> in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that more can be done to reduce this. i guess we have to attack this as well.
>>>
>> An 18% increase is certainly nice, provided it doesn't slow down or
>> break other things.. I'm looking through the patch now actually and
>> I'm not really happy with the naming, comments, or some of the code
>> flow, but I think the concept looks reasonable.
>>
>
> I don't understand the layering between pg_tree and rbtree. Why does it
> exist at all? At first I thought this was another implementation of
> rbtrees, but then I noticed it sits on top of it. Is this really
> necessary?
>

No, if it's acceptable to omit PlannerInfo from outfuncs.c.
Or at least its canon_pathkeys member. Otherwise yes, it's
necessary. We need to store (Node *) in a fast searchable way.

This applies to anything else that may need to be converted
from list to tree to decrease planning time. Like ec_members
in EquivalenceClass.

Best regards,
Zoltán Böszörményi

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 16:03:48
Message-ID:	4593.1283961828@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> Indeed, but code like this makes me wonder if this is really working the
> way it's supposed to:

> + val1 = (long)pk_left->pk_eclass;
> + val2 = (long)pk_right->pk_eclass;

Ugh. Casting a pointer to long? We do have portable ways to do what
this is trying to do, but that is not one. (For example, this is
guaranteed to misbehave on 64-bit Windows.)

Offhand I think PointerGetDatum might be the best way.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 16:08:29
Message-ID:	4691.1283962109@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
> This applies to anything else that may need to be converted
> from list to tree to decrease planning time. Like ec_members
> in EquivalenceClass.

AFAIR, canonical pathkeys are the *only* thing in the planner where pure
pointer equality is interesting. So I doubt this hack is of any use for
EquivalenceClass, even if the lists were likely to be long which they
aren't.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 16:26:52
Message-ID:	5075.1283963212@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> I'm not really happy with the naming, comments, or some of the code
> flow, but I think the concept looks reasonable.

There seems to be a lot of unnecessary palloc/pfree traffic in this
implementation. Getting rid of that might help the speed.

regards, tom lane

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 17:05:59
Message-ID:	4C87C277.9000804@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane írta:
> Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
>
>> This applies to anything else that may need to be converted
>> from list to tree to decrease planning time. Like ec_members
>> in EquivalenceClass.
>>
>
> AFAIR, canonical pathkeys are the *only* thing in the planner where pure
> pointer equality is interesting. So I doubt this hack is of any use for
> EquivalenceClass, even if the lists were likely to be long which they
> aren't.
>
> regards, tom lane
>

No, for EquivalanceClass->ec_member, I need to do something
funnier, like implement compare(Node *, Node *) and use that
instead of equal(Node *, Node *)... Something like nodeToString()
on both Node * and strcmp() the resulting strings.

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 17:08:24
Message-ID:	4C87C308.4070403@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane írta:
> Stephen Frost <sfrost(at)snowman(dot)net> writes:
>
>> I'm not really happy with the naming, comments, or some of the code
>> flow, but I think the concept looks reasonable.
>>
>
> There seems to be a lot of unnecessary palloc/pfree traffic in this
> implementation. Getting rid of that might help the speed.
>
> regards, tom lane
>

This was a first WIP implementation, I will look into it, thanks.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 17:19:38
Message-ID:	6006.1283966378@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
> Tom Lane rta:
>> AFAIR, canonical pathkeys are the *only* thing in the planner where pure
>> pointer equality is interesting. So I doubt this hack is of any use for
>> EquivalenceClass, even if the lists were likely to be long which they
>> aren't.

> No, for EquivalanceClass->ec_member, I need to do something
> funnier, like implement compare(Node *, Node *) and use that
> instead of equal(Node *, Node *)... Something like nodeToString()
> on both Node * and strcmp() the resulting strings.

Well, (a) that doesn't work (hint: there are fields in nodes that are
intentionally ignored by equal()), and (b) I still don't believe that
there's an actual bottleneck there. ECs generally aren't very big.

regards, tom lane

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 18:49:03
Message-ID:	4C87DA9F.8090003@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane írta:
> Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
>
>> Tom Lane írta:
>>
>>> AFAIR, canonical pathkeys are the *only* thing in the planner where pure
>>> pointer equality is interesting. So I doubt this hack is of any use for
>>> EquivalenceClass, even if the lists were likely to be long which they
>>> aren't.
>>>
>
>
>> No, for EquivalanceClass->ec_member, I need to do something
>> funnier, like implement compare(Node *, Node *) and use that
>> instead of equal(Node *, Node *)... Something like nodeToString()
>> on both Node * and strcmp() the resulting strings.
>>
>
> Well, (a) that doesn't work (hint: there are fields in nodes that are
> intentionally ignored by equal()),

Then this compare() needs to work like equal(), which can
ignore the fields that are ignored by equal(), too.
nodeToString would need more space anyway and comparing
non-equal Nodes can return the !0 result faster.

> and (b) I still don't believe that
> there's an actual bottleneck there. ECs generally aren't very big.
>

Actually, PlannerInfo->eq_classes needs to be a Tree somehow,
the comparator function is not yet final in my head.

equal() is called over 8 million times with or without our patch:

without

% cumulative self self total
time seconds seconds calls s/call s/call name
18.87 0.80 0.80 4812 0.00 0.00 make_canonical_pathkey
15.33 1.45 0.65 4811 0.00 0.00
get_eclass_for_sort_expr
14.15 2.05 0.60 8342410 0.00 0.00 equal
6.13 2.31 0.26 229172 0.00 0.00 SearchCatCache
3.66 2.47 0.16 5788835 0.00 0.00 _equalList
3.07 2.60 0.13 1450043 0.00 0.00
hash_search_with_hash_value
2.36 2.70 0.10 2272545 0.00 0.00 AllocSetAlloc
2.12 2.79 0.09 811460 0.00 0.00 hash_any
1.89 2.87 0.08 3014983 0.00 0.00 list_head
1.89 2.95 0.08 574526 0.00 0.00 _bt_compare
1.77 3.02 0.08 11577670 0.00 0.00 list_head
1.42 3.08 0.06 1136 0.00 0.00 tzload
0.94 3.12 0.04 2992373 0.00 0.00 AllocSetFreeIndex
0.94 3.16 0.04 91427 0.00 0.00 _bt_checkkeys
...

with

% cumulative self self total
time seconds seconds calls s/call s/call name
24.51 0.88 0.88 4811 0.00 0.00
get_eclass_for_sort_expr
14.21 1.39 0.51 8342410 0.00 0.00 equal
8.22 1.69 0.30 5788835 0.00 0.00 _equalList
5.29 1.88 0.19 229172 0.00 0.00 SearchCatCache
2.51 1.97 0.09 1136 0.00 0.00 tzload
2.23 2.05 0.08 3014983 0.00 0.00 list_head
2.23 2.13 0.08 2283130 0.00 0.00 AllocSetAlloc
2.09 2.20 0.08 811547 0.00 0.00 hash_any
2.09 2.28 0.08 11577670 0.00 0.00 list_head
1.95 2.35 0.07 1450180 0.00 0.00
hash_search_with_hash_value
1.39 2.40 0.05 640690 0.00 0.00 _bt_compare
1.39 2.45 0.05 157944 0.00 0.00 LockAcquireExtended
1.39 2.50 0.05 11164 0.00 0.00 AllocSetCheck
1.11 2.54 0.04 3010547 0.00 0.00 AllocSetFreeIndex
1.11 2.58 0.04 874975 0.00 0.00 AllocSetFree
1.11 2.62 0.04 66211 0.00 0.00 heap_form_tuple
0.84 2.65 0.03 888128 0.00 0.00 LWLockRelease
...

The number of calls are the same for equal and _equalList
in both cases as you can see.

And if you compare the number of AllocSetAlloc calls,
it's actually smaller with our patch, so it's not that the
conversion to Tree caused this.

Best regards,
Zoltán Böszörményi

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-08 19:27:29
Message-ID:	8408.1283974049@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
> equal() is called over 8 million times with or without our patch:

From where, though? You've provided not a shred of evidence that
searching large ec_member lists is the problem.

Also, did the test case you're using ever make it to the list?

regards, tom lane

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-09-10 09:14:41
Message-ID:	4C89F701.7030408@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane írta:
> Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
>
>> equal() is called over 8 million times with or without our patch:
>>
>
> From where, though? You've provided not a shred of evidence that
> searching large ec_member lists is the problem.
>

Indeed. I have put elog(NOTICE) calls in there to see which
lists is how long. It turned out that the length of ec_members is either 0
or 1, mostly 1, but the length of eq_classes is constantly growing.
This is what I need to attack then.

> Also, did the test case you're using ever make it to the list?
>

No, because it was too large and because of the test case
accidentally contained confidential information, we asked
Bruce to delete it from the moderator queue.

Attached is a shortened test case that does the same and
also shows the same problem. The create_table.sql
creates the parent table, the table_generator.sql produces
the list of constraint excluded child tables and their indexes.

The gprof output of this is only slightly modified, the
number of equal() calls is still over 8 million, it is also
attached.

Best regards,
Zoltán Böszörményi

Attachment	Content-Type	Size
create_table.sql	text/plain	211 bytes
table_generator.sql	text/plain	2.5 KB
gmon.log1.gz	application/x-tar	125.2 KB

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-19 13:32:12
Message-ID:	4CBD9DDC.4040304@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

attached is a WIP patch against 9.1 current GIT that converts
eq_classes and canon_pathkeys in PlannerInfo.

Also attached is the test case again the slow query is:

explain select * from inh_parent
where timestamp1 between '2010-04-06' and '2010-06-25'
order by timestamp2;

There is intentionally no data, the planning time is slow.
The currect GIT version plans this query in 2.4 seconds,
the patched version does it in 0.59 seconds according to
gprof. The gprof outputs are also attached.

There is one problem with the patch, it doesn't survive
"make check". One of the regression tests fails the
Assert(!cur_em->em_is_child);
line in process_equivalence() in equivclass.c, but I couldn't
yet find it what causes it. The "why" is vaguely clear:
something modifies the ec_members list in the eq_classes'
tree nodes while the node is in the tree. Because I didn't find
the offender yet, I couldn't fix it, so I send this patch as is.
I'll try to fix it if someone doesn't beat me in fixing it. :)

The query produces the same EXPLAIN output for both the
stock and the patched version, they were checked with diff.
I didn't attach it to this mail because of the size constraints.
Almost all files are compressed because of this.

Best regards,
Zoltán Böszörményi

Attachment	Content-Type	Size
9.1-planner-speedup.patch.gz	application/x-tar	20.9 KB
create_table.sql	text/plain	210 bytes
childtables.sql.gz	application/x-tar	18.1 KB
stock-gmon.log.gz	application/x-tar	113.6 KB
patched-gmon.log.gz	application/x-tar	112.2 KB

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-19 19:50:55
Message-ID:	4CBDF69F.9030205@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Boszormenyi Zoltan írta:
> There is one problem with the patch, it doesn't survive
> "make check". One of the regression tests fails the
> Assert(!cur_em->em_is_child);
> line in process_equivalence() in equivclass.c, but I couldn't
> yet find it what causes it. The "why" is vaguely clear:
> something modifies the ec_members list in the eq_classes'
> tree nodes while the node is in the tree. Because I didn't find
> the offender yet, I couldn't fix it, so I send this patch as is.
> I'll try to fix it if someone doesn't beat me in fixing it. :)
>

I am a little closer to this bug, maybe I even found the cause of it.
I found that process_equivalence() is called from:

path/equivclass.c:
reconsider_outer_join_clause()
reconsider_full_join_clause()
plan/initsplan.c:
distribute_qual_to_rels()

The problem is with the two functions in path/equivclass.c,
as process_equivalance() and those functions are all walk
the tree, and the current RBTree code can only deal with
one walk at a time. We need to push/pop the iterator state
to be able to serve more than one walkers.

Also, we need to split out the tree modifying part from
process_equivalence() somehow, as the tree walking
also cannot deal with node additions and deletions.

Best regards,
Zoltán Böszörményi

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-19 20:14:43
Message-ID:	8193.1287519283@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
> The problem is with the two functions in path/equivclass.c,
> as process_equivalance() and those functions are all walk
> the tree, and the current RBTree code can only deal with
> one walk at a time. We need to push/pop the iterator state
> to be able to serve more than one walkers.

Good luck with that --- the iteration state is embedded in the rbtree.

> Also, we need to split out the tree modifying part from
> process_equivalence() somehow, as the tree walking
> also cannot deal with node additions and deletions.

That's not happening either. Maybe you need to think of some other data
structure to use. Hashtable maybe? dynahash.c at least has reasonably
well-defined limitations in this area.

regards, tom lane

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-26 15:34:56
Message-ID:	4CC6F520.1030408@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Tom Lane írta:
> Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
>
>> The problem is with the two functions in path/equivclass.c,
>> as process_equivalance() and those functions are all walk
>> the tree, and the current RBTree code can only deal with
>> one walk at a time. We need to push/pop the iterator state
>> to be able to serve more than one walkers.
>>
>
> Good luck with that --- the iteration state is embedded in the rbtree.
>
>
>> Also, we need to split out the tree modifying part from
>> process_equivalence() somehow, as the tree walking
>> also cannot deal with node additions and deletions.
>>
>
> That's not happening either. Maybe you need to think of some other data
> structure to use. Hashtable maybe? dynahash.c at least has reasonably
> well-defined limitations in this area.
>
> regards, tom lane
>

thank you very much for pointing me to dynahash, here is the
next version that finally seems to work.

Two patches are attached, the first is the absolute minimum for
making it work, this still has the Tree type for canon_pathkeys
and eq_classes got the same treatment as join_rel_list/join_rel_hash
has in the current sources: if the list grows larger than 32, a hash table
is created. It seems to be be enough for doing in for
get_eclass_for_sort_expr()
only, the other users of eq_classes aren't bothered by this change.

The total speedup figure is in the 70+ percent range from these
two changes, a little later GIT version than the previous tree
I tested with before shows 1.74 vs. 0.41 second runtime for the
example query. These are with asserts and profiling enabled of
course. Without asserts and profiling enabled, the "time psql"
figures are:

$ time psql -p 54321 -c "explain select * from inh_parent where
timestamp1 between '2010-04-06' and '2010-06-25' order by timestamp2"
>/dev/null

real 0m1.932s
user 0m0.035s
sys 0m0.002s

vs.

real 0m0.630s
user 0m0.033s
sys 0m0.002s

The second patch contains extra infrastructure for the Tree type,
it's currently unused, it was created for experimenting with eq_classes
being a tree. It may be useful for someone, though.

Best regards,
Zoltán Böszörményi

Attachment	Content-Type	Size
9.1-planner-speedup-v5.patch	text/x-patch	28.7 KB
9.1-planner-speedup-v5-extra-ifs.patch	text/x-patch	75.5 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-27 15:41:00
Message-ID:	4CC8480C.409@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
> thank you very much for pointing me to dynahash, here is the
> next version that finally seems to work.
>
> Two patches are attached, the first is the absolute minimum for
> making it work, this still has the Tree type for canon_pathkeys
> and eq_classes got the same treatment as join_rel_list/join_rel_hash
> has in the current sources: if the list grows larger than 32, a hash table
> is created. It seems to be be enough for doing in for
> get_eclass_for_sort_expr()
> only, the other users of eq_classes aren't bothered by this change.

That's better, but can't you use dynahash for canon_pathkeys as well?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-28 09:33:22
Message-ID:	4CC94362.1000402@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas írta:
> On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
>> thank you very much for pointing me to dynahash, here is the
>> next version that finally seems to work.
>>
>> Two patches are attached, the first is the absolute minimum for
>> making it work, this still has the Tree type for canon_pathkeys
>> and eq_classes got the same treatment as join_rel_list/join_rel_hash
>> has in the current sources: if the list grows larger than 32, a hash
>> table
>> is created. It seems to be be enough for doing in for
>> get_eclass_for_sort_expr()
>> only, the other users of eq_classes aren't bothered by this change.
>
> That's better, but can't you use dynahash for canon_pathkeys as well?

Here's a purely dynahash solution. It's somewhat slower than
the tree version, 0.45 vs 0.41 seconds in the cached case for the
previously posted test case.

Best regards,
Zoltán Böszörményi

Attachment	Content-Type	Size
9.1-planner-speedup-v6.patch	text/x-patch	15.5 KB

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-28 09:40:28
Message-ID:	4CC9450C.8070507@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Boszormenyi Zoltan írta:
> Heikki Linnakangas írta:
>
>> On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
>>
>>> thank you very much for pointing me to dynahash, here is the
>>> next version that finally seems to work.
>>>
>>> Two patches are attached, the first is the absolute minimum for
>>> making it work, this still has the Tree type for canon_pathkeys
>>> and eq_classes got the same treatment as join_rel_list/join_rel_hash
>>> has in the current sources: if the list grows larger than 32, a hash
>>> table
>>> is created. It seems to be be enough for doing in for
>>> get_eclass_for_sort_expr()
>>> only, the other users of eq_classes aren't bothered by this change.
>>>
>> That's better, but can't you use dynahash for canon_pathkeys as well?
>>
>
> Here's a purely dynahash solution. It's somewhat slower than
> the tree version, 0.45 vs 0.41 seconds in the cached case for the
> previously posted test case.
>

And now in context diff, sorry for my affection towards unified diffs. :-)

> Best regards,
> Zoltán Böszörményi
>
>

Attachment	Content-Type	Size
9.1-planner-speedup-v6-ctxdiff.patch	text/x-patch	16.9 KB

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-28 10:54:59
Message-ID:	4CC95683.4020101@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Boszormenyi Zoltan írta:
> Boszormenyi Zoltan írta:
>
>> Heikki Linnakangas írta:
>>
>>
>>> On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
>>>
>>>
>>>> thank you very much for pointing me to dynahash, here is the
>>>> next version that finally seems to work.
>>>>
>>>> Two patches are attached, the first is the absolute minimum for
>>>> making it work, this still has the Tree type for canon_pathkeys
>>>> and eq_classes got the same treatment as join_rel_list/join_rel_hash
>>>> has in the current sources: if the list grows larger than 32, a hash
>>>> table
>>>> is created. It seems to be be enough for doing in for
>>>> get_eclass_for_sort_expr()
>>>> only, the other users of eq_classes aren't bothered by this change.
>>>>
>>>>
>>> That's better, but can't you use dynahash for canon_pathkeys as well?
>>>
>>>
>> Here's a purely dynahash solution. It's somewhat slower than
>> the tree version, 0.45 vs 0.41 seconds in the cached case for the
>> previously posted test case.
>>
>>
>
> And now in context diff, sorry for my affection towards unified diffs. :-)
>

A little better version, no need for the heavy hash_any, hash_uint32
on the lower 32 bits on pk_eclass is enough. The profiling runtime
is now 0.42 seconds vs the previous 0.41 seconds for the tree version.

Best regards,
Zoltán Böszörményi

Attachment	Content-Type	Size
9.1-planner-speedup-v7-ctxdiff.patch	text/x-patch	16.9 KB

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-28 11:29:30
Message-ID:	4CC95E9A.5000605@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Boszormenyi Zoltan írta:
> Boszormenyi Zoltan írta:
>
>> Boszormenyi Zoltan írta:
>>
>>
>>> Heikki Linnakangas írta:
>>>
>>>
>>>
>>>> On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
>>>>
>>>>
>>>>
>>>>> thank you very much for pointing me to dynahash, here is the
>>>>> next version that finally seems to work.
>>>>>
>>>>> Two patches are attached, the first is the absolute minimum for
>>>>> making it work, this still has the Tree type for canon_pathkeys
>>>>> and eq_classes got the same treatment as join_rel_list/join_rel_hash
>>>>> has in the current sources: if the list grows larger than 32, a hash
>>>>> table
>>>>> is created. It seems to be be enough for doing in for
>>>>> get_eclass_for_sort_expr()
>>>>> only, the other users of eq_classes aren't bothered by this change.
>>>>>
>>>>>
>>>>>
>>>> That's better, but can't you use dynahash for canon_pathkeys as well?
>>>>
>>>>
>>>>
>>> Here's a purely dynahash solution. It's somewhat slower than
>>> the tree version, 0.45 vs 0.41 seconds in the cached case for the
>>> previously posted test case.
>>>
>>>
>>>
>> And now in context diff, sorry for my affection towards unified diffs. :-)
>>
>>
>
> A little better version, no need for the heavy hash_any, hash_uint32
> on the lower 32 bits on pk_eclass is enough. The profiling runtime
> is now 0.42 seconds vs the previous 0.41 seconds for the tree version.
>
> Best regards,
> Zoltán Böszörményi
>

Btw, the top entries in the current gprof output are:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
19.05 0.08 0.08 482 0.17 0.29
add_child_rel_equivalences
11.90 0.13 0.05 1133447 0.00 0.00 bms_is_subset
9.52 0.17 0.04 331162 0.00 0.00
hash_search_with_hash_value
7.14 0.20 0.03 548971 0.00 0.00 AllocSetAlloc
4.76 0.22 0.02 2858 0.01 0.01 get_tabstat_entry
4.76 0.24 0.02 1136 0.02 0.02 tzload

This means add_child_rel_equivalences() is still takes
too much time, the previously posted test case calls this
function 482 times, it's called for almost every 10th entry
added to eq_classes. The elog() I put into this function says
that at the last call list_length(eq_classes) == 4754.

Best regards,
Zoltán Böszörményi

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-28 11:35:23
Message-ID:	4CC95FFB.4010502@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 28.10.2010 13:54, Boszormenyi Zoltan wrote:
> A little better version, no need for the heavy hash_any, hash_uint32
> on the lower 32 bits on pk_eclass is enough. The profiling runtime
> is now 0.42 seconds vs the previous 0.41 seconds for the tree version.

Actually, I wonder if we could just have a separate canon_pathkeys list
for each EquivalenceClass, instead of one big list in PlannerInfo. I'm
not too familiar with equivalence classes and all that, but the attached
patch at least passes the regressions. I haven't done any performance
testing, but I would expect this to be even faster than the hashtable or
tree implementations, and a lot simpler.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
per-eclass-canon-pathkeys.patch	text/x-diff	3.5 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-28 13:36:57
Message-ID:	11493.1288273017@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
> This means add_child_rel_equivalences() is still takes
> too much time, the previously posted test case calls this
> function 482 times, it's called for almost every 10th entry
> added to eq_classes. The elog() I put into this function says
> that at the last call list_length(eq_classes) == 4754.

That seems like a ridiculously large number of ECs. What is the
test query again?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-28 13:49:06
Message-ID:	11634.1288273746@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Actually, I wonder if we could just have a separate canon_pathkeys list
> for each EquivalenceClass, instead of one big list in PlannerInfo. I'm
> not too familiar with equivalence classes and all that,

Hm. I don't like getting rid of the main canon_pathkeys list like that.
The whole point of a canonical pathkey is that there is only one, so
it seems like we need a central list. But it might be sane for each
EC to have an additional, side list of PKs made from it.

regards, tom lane

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-28 17:11:13
Message-ID:	4CC9AEB1.7070304@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane írta:
> Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
>
>> This means add_child_rel_equivalences() is still takes
>> too much time, the previously posted test case calls this
>> function 482 times, it's called for almost every 10th entry
>> added to eq_classes. The elog() I put into this function says
>> that at the last call list_length(eq_classes) == 4754.
>>
>
> That seems like a ridiculously large number of ECs. What is the
> test query again?
>
> regards, tom lane
>

The test case is here:
http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at

create_table.sql for the main table plus childtables.sql.gz, the EXPLAIN
query is in the message body.

Basically, it's a model for a lot of data for three months, partitioned by
4 hour intervals for every day. Imagine the call list handled by a
phone company in a large country.

Best regards,
Zoltán Böszörményi

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-28 22:56:49
Message-ID:	27684.1288306609@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
> Tom Lane rta:
>> That seems like a ridiculously large number of ECs. What is the
>> test query again?

> The test case is here:
> http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at

After poking through that a bit, I think that the real issue is in this
division of labor:

index_pathkeys = build_index_pathkeys(root, index,
ForwardScanDirection);
useful_pathkeys = truncate_useless_pathkeys(root, rel,
index_pathkeys);

If you trace what is happening here, the index pathkeys that actually
survive the "usefulness" test all refer to exactly ONE equivalence
class, namely the one arising from the query's "order by timestamp2"
clause. All the other pathkeys that get created are immediately
discarded as being irrelevant to the query. The reason that we end up
with so many equivalence classes is that there is nothing causing the
variables of the different child tables to be recognized as all
sort-equivalent. Maybe that's a bug in itself, but I would argue that
the right way to make this faster is to refactor things so that we
don't generate useless equivalence classes in the first place, or
at least don't keep them around in the planner's lists once we realize
they're useless.

I like Heikki's hack to cut down on searching in make_canonical_pathkey,
but I think that complicating the data structure searching beyond that
is just a band-aid. Reasonably-sized queries shouldn't contain very
many equivalence classes: they should only come from equality clauses
or sort conditions that appeared in the query text. Therefore, there
also shouldn't be all that many distinct pathkeys.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 06:08:34
Message-ID:	24391.1288332514@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> the right way to make this faster is to refactor things so that we
> don't generate useless equivalence classes in the first place, or
> at least don't keep them around in the planner's lists once we realize
> they're useless.

After a bit of hacking, I propose the attached patch.

> I like Heikki's hack to cut down on searching in make_canonical_pathkey,
> but I think that complicating the data structure searching beyond that
> is just a band-aid.

With the given test case and this patch, we end up with exactly two
canonical pathkeys referencing a single EquivalenceClass. So as far
as I can tell there's not a lot of point in refining the pathkey
searching. Now, the EquivalenceClass has got 483 members, which
means that there's still some O(N^2) behavior in
get_eclass_for_sort_expr. There might be some use in refining the
search for a matching eclass member. It's not sticking out in
profiling like it did before though.

regards, tom lane

Attachment	Content-Type	Size
avoid-useless-eclasses.patch	text/x-patch	27.6 KB

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 08:00:29
Message-ID:	4CCA7F1D.8010509@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane írta:
> I wrote:
>
>> the right way to make this faster is to refactor things so that we
>> don't generate useless equivalence classes in the first place, or
>> at least don't keep them around in the planner's lists once we realize
>> they're useless.
>>
>
> After a bit of hacking, I propose the attached patch.
>
>
>> I like Heikki's hack to cut down on searching in make_canonical_pathkey,
>> but I think that complicating the data structure searching beyond that
>> is just a band-aid.
>>
>
> With the given test case and this patch, we end up with exactly two
> canonical pathkeys referencing a single EquivalenceClass. So as far
> as I can tell there's not a lot of point in refining the pathkey
> searching. Now, the EquivalenceClass has got 483 members, which
> means that there's still some O(N^2) behavior in
> get_eclass_for_sort_expr. There might be some use in refining the
> search for a matching eclass member. It's not sticking out in
> profiling like it did before though.
>
> regards, tom lane
>

Thanks, this patch made get_eclass_from_sort_expr almost,
make_canonical_pathkeys and add_child_rel_equivalences
completely disappear from the gprof timing.

+1 for including this into 9.1.

On the other hand, if I use a similar test case to my original one
(i.e. the tables are much wider) then the query planning takes
1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
output now looks like this:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
21.13 0.30 0.30 235091 0.00 0.00 SearchCatCache
7.04 0.40 0.10 1507206 0.00 0.00
hash_search_with_hash_value
3.52 0.45 0.05 2308219 0.00 0.00 AllocSetAlloc
3.52 0.50 0.05 845776 0.00 0.00 hash_any
3.52 0.55 0.05 341637 0.00 0.00 HeapTupleSatisfiesNow
3.52 0.60 0.05 1136 0.00 0.00 tzload
2.82 0.64 0.04 547 0.00 0.00 get_rel_data_width
2.11 0.67 0.03 669414 0.00 0.00 hash_search
2.11 0.70 0.03 235091 0.00 0.00 SearchSysCache
2.11 0.73 0.03 192590 0.00 0.00 copyObject
2.11 0.76 0.03 164457 0.00 0.00 pgstat_initstats
2.11 0.79 0.03 152999 0.00 0.00 index_getnext
...

Use the attached synthetic create_table_wide.sql together with the
previous childtables.sql. The full compressed gprof output is attached.
Your patch creates a 70% speedup in planning time, which is excellent.

Best regards,
Zoltán Böszörményi

Attachment	Content-Type	Size
create_table_wide.sql	text/plain	2.7 KB
gmon.log.gz	application/x-tar	114.6 KB

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 08:48:59
Message-ID:	403806.56889.qm@web29020.mail.ird.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On the other hand, if I use a similar test case to my original one
> (i.e. the tables are much wider) then the query planning takes
> 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
> as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
> output now looks like this:

Hi,

I'm really interested in this patch.

I tried a simple test case:

create table t (a integer, b text);

DO $$DECLARE i int;
BEGIN
FOR i IN 0..9000 LOOP
EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || '
and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)';
EXECUTE 'create index tidx' || i || ' ON t' || i || ' (a)';
END LOOP;
END$$;

explain select * from t where a > 1060 and a < 1090;

but I don't get any gain from the patch... explain time is still around 250 ms.

Tried with 9000 partitions, time is still 2 secs.

Maybe I've missed completely the patch purpose?

(I tried the test case at

http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at

and that, in fact, gets a boost with this patch).

Leonardo

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 08:57:20
Message-ID:	570592.14631.qm@web29006.mail.ird.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> but I don't get any gain from the patch... explain time is still around 250
>ms.
> Tried with 9000 partitions, time is still 2 secs.

Small correction: I tried with 3000 partitions (FOR i IN 0..3000 ...)
and got 250ms with both versions, with 9000 partitions 2 secs (again
no gain from the patch)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 15:23:38
Message-ID:	2008.1288365818@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Leonardo Francalanci <m_lists(at)yahoo(dot)it> writes:
> I tried a simple test case:

> create table t (a integer, b text);

> DO $$DECLARE i int;
> BEGIN
> FOR i IN 0..9000 LOOP
> EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || '
> and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)';
> EXECUTE 'create index tidx' || i || ' ON t' || i || ' (a)';
> END LOOP;
> END$$;

> explain select * from t where a > 1060 and a < 1090;

This is going to be dominated by constraint exclusion checking. There's
basically no fix for that except a more explicit representation of the
partitioning rules. If the planner has to make 8999 theorem proofs to
remove the 8999 unwanted partitions from the plan, it's gonna take
awhile.

regards, tom lane

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 15:41:18
Message-ID:	919158.72344.qm@web29006.mail.ird.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> This is going to be dominated by constraint exclusion checking. There's
> basically no fix for that except a more explicit representation of the
> partitioning rules.

Damn, I knew that was going to be more complicated :)

So in which case does this patch help? I guess in a multi-index
scenario? childtables.sql is kind of hard to read (I think a FOR loop
would have been more auto-explaining?).

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 16:53:30
Message-ID:	3644.1288371210@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
> On the other hand, if I use a similar test case to my original one
> (i.e. the tables are much wider) then the query planning takes
> 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
> as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
> output now looks like this:

> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 21.13 0.30 0.30 235091 0.00 0.00 SearchCatCache
> 7.04 0.40 0.10 1507206 0.00 0.00 hash_search_with_hash_value
> 3.52 0.45 0.05 2308219 0.00 0.00 AllocSetAlloc

Yeah, for me it looks even worse: oprofile shows about 77% of time in
SearchCatCache. I poked around a little and it seems that probably most
of the time is going into searches of the STATRELATTINH syscache, which
looks like this:

$13 = {id = 41, cc_next = 0x2b43a60,
cc_relname = 0x7f6bc6ed2218 "pg_statistic", cc_reloid = 2619,
cc_indexoid = 2696, cc_relisshared = 0 '\000', cc_tupdesc = 0x7f6bc6ed11d8,
cc_ntup = 68922, cc_nbuckets = 1024, cc_nkeys = 3, cc_key = {1, 2, 3, 0},
...

Most of those entries are "negative" cache entries, since we don't have
any actual stats in this toy example.

I think that we probably should be very circumspect about believing that
this example is still a good guide to what to optimize next; in
particular, in a real-world example with real stats, I'm not sure that
the hot spots will still be in the same places. I'd advise loading up
some real data and doing more profiling.

However, if the hot spot does stay in SearchCatCache, I can't help
noticing that those bucket chains are looking a bit overloaded ---
sixty-plus entries per bucket ain't good. Maybe it's time to teach
catcache.c how to reorganize its hashtables once the load factor
exceeds a certain level. Or more drastically, maybe it should lose
its private hashtable logic and use dynahash.c; I'm not sure at the
moment if the private implementation has any important characteristics
dynahash hasn't got.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 17:15:55
Message-ID:	3943.1288372555@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> This is going to be dominated by constraint exclusion checking.

Hmm, maybe I spoke too soon. With 9000 child tables I get a profile
like this:

samples % symbol name
447433 47.1553 get_tabstat_entry
185458 19.5456 find_all_inheritors
53064 5.5925 SearchCatCache
33864 3.5690 pg_strtok
26301 2.7719 hash_search_with_hash_value
22577 2.3794 AllocSetAlloc
6696 0.7057 MemoryContextAllocZeroAligned
6250 0.6587 expression_tree_walker
5141 0.5418 LockReleaseAll
4779 0.5037 get_relation_info
4506 0.4749 MemoryContextAlloc
4467 0.4708 expression_tree_mutator
4136 0.4359 pgstat_initstats
3914 0.4125 relation_excluded_by_constraints

get_tabstat_entry and find_all_inheritors are both obviously O(N^2) in
the number of tables they have to deal with. However, the constant
factors are small enough that you need a heck of a lot of tables
before they become significant consumers of runtime. I'm not convinced
that we should be optimizing for 9000-child-table cases.

It'd be worth fixing these if we can do it without either introducing a
lot of complexity, or slowing things down for typical cases that have
only a few tables. Offhand not sure about how to do either.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 17:16:44
Message-ID:	AANLkTikSAUSgP4SRaOjJy9i55KFF9k1E9UuhWr_MooA-@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 29, 2010 at 12:53 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
>> On the other hand, if I use a similar test case to my original one
>> (i.e. the tables are much wider) then the query planning takes
>> 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
>> as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
>> output now looks like this:
>
>> % cumulative self self total
>> time seconds seconds calls s/call s/call name
>> 21.13 0.30 0.30 235091 0.00 0.00 SearchCatCache
>> 7.04 0.40 0.10 1507206 0.00 0.00 hash_search_with_hash_value
>> 3.52 0.45 0.05 2308219 0.00 0.00 AllocSetAlloc
>
> Yeah, for me it looks even worse: oprofile shows about 77% of time in
> SearchCatCache. I poked around a little and it seems that probably most
> of the time is going into searches of the STATRELATTINH syscache, which
> looks like this:
>
> $13 = {id = 41, cc_next = 0x2b43a60,
> cc_relname = 0x7f6bc6ed2218 "pg_statistic", cc_reloid = 2619,
> cc_indexoid = 2696, cc_relisshared = 0 '\000', cc_tupdesc = 0x7f6bc6ed11d8,
> cc_ntup = 68922, cc_nbuckets = 1024, cc_nkeys = 3, cc_key = {1, 2, 3, 0},
> ...
>
> Most of those entries are "negative" cache entries, since we don't have
> any actual stats in this toy example.
>
> I think that we probably should be very circumspect about believing that
> this example is still a good guide to what to optimize next; in
> particular, in a real-world example with real stats, I'm not sure that
> the hot spots will still be in the same places. I'd advise loading up
> some real data and doing more profiling.
>
> However, if the hot spot does stay in SearchCatCache, I can't help
> noticing that those bucket chains are looking a bit overloaded ---
> sixty-plus entries per bucket ain't good. Maybe it's time to teach
> catcache.c how to reorganize its hashtables once the load factor
> exceeds a certain level. Or more drastically, maybe it should lose
> its private hashtable logic and use dynahash.c; I'm not sure at the
> moment if the private implementation has any important characteristics
> dynahash hasn't got.

I'm not sure what's happening in this particular case, but I seem to
remember poking at a case a while back where we were doing a lot of
repeated statistics lookups for the same columns. If that's also the
the case here and if there is some way to avoid it (hang a pointer to
the stats off the node tree somewhere?) we might be able to cut down
on the number of hash probes, as an alternative to or in addition to
making them faster.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 17:22:12
Message-ID:	131872.8446.qm@web29013.mail.ird.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Hmm, maybe I spoke too soon. With 9000 child tables I get a profile
> like this:

Well, the 9000-table-test-case was meant to check the difference in
performance with/without the patch... I don't see the reason for trying
to optimize such an unrealistic case.

BTW can someone explain to me which are the cases where the
patch actually helps?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 17:31:32
Message-ID:	4207.1288373492@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Leonardo Francalanci <m_lists(at)yahoo(dot)it> writes:
> BTW can someone explain to me which are the cases where the
> patch actually helps?

Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes
per child table, only one of which was relevant to the query. In your
test case there are no irrelevant indexes, which is why the runtime
didn't change.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 17:44:14
Message-ID:	4411.1288374254@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Oct 29, 2010 at 12:53 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> However, if the hot spot does stay in SearchCatCache, I can't help
>> noticing that those bucket chains are looking a bit overloaded ---
>> sixty-plus entries per bucket ain't good. Maybe it's time to teach
>> catcache.c how to reorganize its hashtables once the load factor
>> exceeds a certain level. Or more drastically, maybe it should lose
>> its private hashtable logic and use dynahash.c; I'm not sure at the
>> moment if the private implementation has any important characteristics
>> dynahash hasn't got.

> I'm not sure what's happening in this particular case, but I seem to
> remember poking at a case a while back where we were doing a lot of
> repeated statistics lookups for the same columns. If that's also the
> the case here and if there is some way to avoid it (hang a pointer to
> the stats off the node tree somewhere?) we might be able to cut down
> on the number of hash probes, as an alternative to or in addition to
> making them faster.

I think there are already layers of caching in the planner to avoid
fetching the same stats entries more than once per query. The problem
here is that there are so many child tables that even fetching stats
once per table per query starts to add up. (Also, as I said, I'm
worried that we're being misled by the fact that there are no stats to
fetch --- so we're not seeing the costs of actually doing something with
the stats if they existed.)

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 19:11:50
Message-ID:	1288379037-sup-1073@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Tom Lane's message of vie oct 29 14:15:55 -0300 2010:
> I wrote:
> > This is going to be dominated by constraint exclusion checking.
>
> Hmm, maybe I spoke too soon. With 9000 child tables I get a profile
> like this:
>
> samples % symbol name
> 447433 47.1553 get_tabstat_entry

Is there a reason for keeping the pgstat info in plain lists? We could
use dynahash there too, I think. It would have more palloc overhead
that way, though (hmm, but perhaps that can be fixed by having a smart
"alloc" function for it, preallocating a bunch of entries instead of one
by one).

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 19:37:39
Message-ID:	6681.1288381059@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Tom Lane's message of vie oct 29 14:15:55 -0300 2010:
>> samples % symbol name
>> 447433 47.1553 get_tabstat_entry

> Is there a reason for keeping the pgstat info in plain lists?

Yeah: anything else loses for small numbers of tables per query, which
is the normal case. I'd guess you'd need ~100 tables touched in
a single transaction before a hashtable is even worth thinking about.

We could possibly adopt a solution similar to the planner's approach for
joinrels: start with a simple list, and switch over to hashing if the
list gets too long. But I'm really doubtful that it's worth the code
space. Even with Zoltan's 500-or-so-table case, this wasn't on the
radar screen.

regards, tom lane

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 20:08:57
Message-ID:	950077.88955.qm@web29019.mail.ird.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes
> per child table, only one of which was relevant to the query. In your
> test case there are no irrelevant indexes, which is why the runtime
> didn't change.

Mmh... I must be doing something wrong. It looks to me it's not just
the irrelevant indexes: it's the "order by" that counts. If I remove that
times are the same with and without the patch:

using the test case:

explain select * from inh_parent
where timestamp1 between '2010-04-06' and '2010-06-25'

this one runs in the same time with the patch; but adding:

order by timestamp2

made the non-patched version run 3 times slower.

My test case:

create table t (a integer, b integer, c integer, d integer, e text);

DO $$DECLARE i int;
BEGIN
FOR i IN 0..2000 LOOP
EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || '
and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)';
EXECUTE 'create index taidx' || i || ' ON t' || i || ' (a)';
EXECUTE 'create index tbidx' || i || ' ON t' || i || ' (b)';
EXECUTE 'create index tcidx' || i || ' ON t' || i || ' (c)';
EXECUTE 'create index tdidx' || i || ' ON t' || i || ' (d)';
END LOOP;
END$$;

explain select * from t where a > 1060 and a < 109000

this runs in 1.5 secs with and without the patch. But if I add

order by b

the non-patched version runs in 10 seconds.

Am I getting it wrong?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-10-29 20:23:36
Message-ID:	8406.1288383816@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Leonardo Francalanci <m_lists(at)yahoo(dot)it> writes:
>> Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes
>> per child table, only one of which was relevant to the query. In your
>> test case there are no irrelevant indexes, which is why the runtime
>> didn't change.

> Mmh... I must be doing something wrong. It looks to me it's not just
> the irrelevant indexes: it's the "order by" that counts.

Ah, I oversimplified a bit: actually, if you don't have an ORDER BY or
any mergejoinable join clauses, then the possibly_useful_pathkeys test
in find_usable_indexes figures out that we aren't interested in the sort
ordering of *any* indexes, so the whole thing gets short-circuited.
You need at least the possibility of interest in sorted output from an
indexscan before any of this code runs.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-11-01 14:18:27
Message-ID:	1609.1288621107@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> samples % symbol name
> 447433 47.1553 get_tabstat_entry
> 185458 19.5456 find_all_inheritors
> 53064 5.5925 SearchCatCache
> 33864 3.5690 pg_strtok

> get_tabstat_entry and find_all_inheritors are both obviously O(N^2) in
> the number of tables they have to deal with. However, the constant
> factors are small enough that you need a heck of a lot of tables
> before they become significant consumers of runtime. I'm not convinced
> that we should be optimizing for 9000-child-table cases.

> It'd be worth fixing these if we can do it without either introducing a
> lot of complexity, or slowing things down for typical cases that have
> only a few tables. Offhand not sure about how to do either.

I had a thought about how to make get_tabstat_entry() faster without
adding overhead: what if we just plain remove the search, and always
assume that a new entry has to be added to the tabstat array?

The existing code seems to be designed to make no assumptions about
how it's being used, but that's a bit silly. We know that the links are
coming from the relcache, which will have only one entry per relation,
and that the relcache is designed to hang onto the links for (at least)
the life of a transaction. So rather than optimizing for the case where
the relcache fails to remember the tabstat link, maybe we should
optimize for the case where it does remember.

The worst-case consequence AFAICS would be multiple tabstat entries for
the same relation, which seems pretty noncritical anyway.

Thoughts?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-11-05 02:59:10
Message-ID:	10754.1288925950@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

[ for the archives' sake ]

I wrote:
> I had a thought about how to make get_tabstat_entry() faster without
> adding overhead: what if we just plain remove the search, and always
> assume that a new entry has to be added to the tabstat array?

I spent some time looking into this idea. It doesn't really work,
because there are places that will break if a transaction has more than
one tabstat entry for the same relation. The one that seems most
difficult to fix is that pgstat_recv_tabstat() clamps its n_live_tuples
and n_dead_tuples values to be nonnegative after adding in each delta
received from a backend. That is a good idea because it prevents insane
results if some messages get lost --- but if a transaction's updates get
randomly spread into several tabstat items, the intermediate counts
might get clamped, resulting in a wrong final answer even though nothing
was lost.

I also added some instrumentation printouts and found that in our
regression tests:
* about 10% of get_tabstat_entry() calls find an existing entry
for the relation OID. This seems to happen only when a
relcache entry gets flushed mid-transaction, but that does
happen, and not so infrequently either.
* about half of the transactions use as many as 20 tabstats,
and 10% use 50 or more; but it drops off fast after that.
Almost no transactions use as many as 100 tabstats.
It's not clear that these numbers are representative of typical
database applications, but they're something to start with anyway.

I also did some testing to compare the cost of get_tabstat_entry's
linear search against a dynahash.c table with OID key. As I suspected,
a hash table would make this code a *lot* slower for small numbers of
tabstat entries: about a factor of 10 slower. You need upwards of 100
tabstats touched in a transaction before the hash table begins to pay
for itself. This is largely because dynahash doesn't have any cheap way
to reset a hashtable to empty, so you have to initialize and destroy the
table for each transaction. By the time you've eaten that overhead,
you've already expended as many cycles as the linear search takes to
handle several dozen entries.

I conclude that if we wanted to do something about this, the most
practical solution would be the one of executing linear searches until
we get to 100+ tabstat entries in a transaction, and then building a
hashtable for subsequent searches. However, it's exceedingly unclear
that it will ever be worth the effort or code space to do that.

regards, tom lane

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-11-13 03:47:02
Message-ID:	201011130347.oAD3l2X16785@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Leonardo Francalanci <m_lists(at)yahoo(dot)it> writes:
> >> Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes
> >> per child table, only one of which was relevant to the query. In your
> >> test case there are no irrelevant indexes, which is why the runtime
> >> didn't change.
>
> > Mmh... I must be doing something wrong. It looks to me it's not just
> > the irrelevant indexes: it's the "order by" that counts.
>
> Ah, I oversimplified a bit: actually, if you don't have an ORDER BY or
> any mergejoinable join clauses, then the possibly_useful_pathkeys test
> in find_usable_indexes figures out that we aren't interested in the sort
> ordering of *any* indexes, so the whole thing gets short-circuited.
> You need at least the possibility of interest in sorted output from an
> indexscan before any of this code runs.

FYI, I always wondered if the rare use of mergejoins justified the extra
planning time of carrying around all those joinpaths.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-11-13 03:55:54
Message-ID:	9022.1289620554@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> FYI, I always wondered if the rare use of mergejoins justified the extra
> planning time of carrying around all those joinpaths.

They're hardly rare.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plan time of MASSIVE partitioning ...
Date:	2010-11-13 22:26:34
Message-ID:	AANLkTimu6qcEyZQE6YHF6-1Hgg_t_wbUf6AxjpuZe0qF@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 12, 2010 at 10:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>> FYI, I always wondered if the rare use of mergejoins justified the extra
>> planning time of carrying around all those joinpaths.
>
> They're hardly rare.

They fairly rare in the sorts of queries I normally issue, but I'd
quibble with the statement on other grounds: IME, we generate far more
nest loops paths than anything else. The comment in
match_unsorted_outer() says it all:

* We always generate a nestloop path for each available outer path.
* In fact we may generate as many as five: one on the cheapest-total-cost
* inner path, one on the same with materialization, one on the
* cheapest-startup-cost inner path (if different), one on the
* cheapest-total inner-indexscan path (if any), and one on the
* cheapest-startup inner-indexscan path (if different).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company