Re: Path question

Lists: pgsql-hackers
From: Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Hans-Juergen Schoenig <hs(at)cybertec(dot)at>
Subject: Path question
Date: 2010-09-01 13:57:12
Message-ID: 4C7E5BB8.40502@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

we are experimenting with modifying table partitioning
so the ORDER BY clause can be pushed down to
child nodes on the grounds that:
1. smaller datasets are faster to sort, e.g. two datasets that almost
spill out to disk are faster to sort in memory and later merge them
than the union set that spills out to disk, or two larger sets
that spill out to disk are faster to sort individually than the
union dataset (because of the longer seeks, etc)
2. individual child nodes can have indexes that produce
the sorted output already

Currently I am abusing the AppendPath node but later I will add a new
executor node that will merge the sorted input coming from the children.

I added the pathkey to the AppendPath to reflect that it produces
sorted output and I am adding the Sort plan to the children.

My problem is:

zozo=# explain select * from t1 where d = '2008-01-01' order by d;
QUERY
PLAN
---------------------------------------------------------------------------------------------------
Result (cost=8.28..33.13 rows=4 width=40)
-> Append (cost=8.28..33.13 rows=4 width=40)
-> Sort (cost=8.28..8.28 rows=1 width=40)
Sort Key: public.t1.d
-> Index Scan using t1_d_key on t1 (cost=0.00..8.27
rows=1 width=40)
Index Cond: (d = '2008-01-01'::date)
-> Sort (cost=8.28..8.28 rows=1 width=40)
Sort Key: public.t1.d
-> Index Scan using t1_2008_d_key on t1_2008 t1
(cost=0.00..8.27 rows=1 width=40)
Index Cond: (d = '2008-01-01'::date)
-> Sort (cost=8.28..8.28 rows=1 width=40)
Sort Key: public.t1.d
-> Index Scan using t1_2009_d_key on t1_2009 t1
(cost=0.00..8.27 rows=1 width=40)
Index Cond: (d = '2008-01-01'::date)
-> Sort (cost=8.28..8.28 rows=1 width=40)
Sort Key: public.t1.d
-> Index Scan using t1_2010_d_key on t1_2010 t1
(cost=0.00..8.27 rows=1 width=40)
Index Cond: (d = '2008-01-01'::date)
(18 rows)

In one leaf, e.g.:

-> Sort (cost=8.28..8.28 rows=1 width=40)
Sort Key: public.t1.d
-> Index Scan using t1_2010_d_key on t1_2010 t1
(cost=0.00..8.27 rows=1 width=40)
Index Cond: (d = '2008-01-01'::date)

The plan is scanning the t_2010 child table, but the Sort is trying to
sort by the fully qualified parent table. I think this is a problem
but I don't know how to solve it. I have tried transforming the
parent query with

adjust_appendrel_attrs((Node *) parse, appinfo)

where parse is

Query *parse = root->parse;

in set_append_rel_pathlist() and the transformed query trees are
used for the children with

make_sort_from_sortclauses(root, query->sortClause, subplan)

in create_append_plan(). adjust_appendrel_attrs() seems to be prepared
to be called with a Query * , so I don't know why the above leaf plan
doesn't show "Sort Key: public.t1_2010.d" and so on.

Can someone help me?

Best regards,
Zoltán Böszötményi


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Hans-Juergen Schoenig <hs(at)cybertec(dot)at>
Subject: Re: Path question
Date: 2010-09-01 14:10:21
Message-ID: 23862.1283350221@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
> we are experimenting with modifying table partitioning
> so the ORDER BY clause can be pushed down to
> child nodes on the grounds that:

This is really premature, and anything you do along those lines now will
probably never get committed. The problem is that the transformation
you propose is wrong unless the planner can prove that the different
child tables contain nonoverlapping ranges of the sort key. Now you
might be intending to add logic to try to prove that from inspection of
constraints, but I don't believe that reverse-engineering such knowledge
on the fly is a sane approach: it will be hugely expensive and will add
that cost even in many situations where the optimization fails to apply.

The project direction is that we are going to add some explicit
representation of partitioned tables. After that, the planner can just
know immediately that a range-partitioned sort key is amenable to this
treatment, and at that point it'll make sense to work on it.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Hans-Juergen Schoenig <hs(at)cybertec(dot)at>
Subject: Re: Path question
Date: 2010-09-01 14:21:25
Message-ID: AANLkTi=t9rC9Pj4ywBEkpgU2rJZ_-cKfA0zvaELpDi9r@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/9/1 Boszormenyi Zoltan <zb(at)cybertec(dot)at>:
> we are experimenting with modifying table partitioning
> so the ORDER BY clause can be pushed down to
> child nodes on the grounds that:
> 1. smaller datasets are faster to sort, e.g. two datasets that almost
>    spill out to disk are faster to sort in memory and later merge them
>    than the union set that spills out to disk, or two larger sets
>    that spill out to disk are faster to sort individually than the
>    union dataset (because of the longer seeks, etc)

For what it's worth this logic doesn't necessarily hold. Sorting two
data sets in memory is only faster if you only need one result set at
a time. If you know the first set contains only records which come
before the second then you can do this but if not then you'll need all
the result sets to be available at the same time which will require
spilling them to disk. If you have to do that then you might as well
do a tapesort anyways.

> 2. individual child nodes can have indexes that produce
>    the sorted output already

This is the big win. Currently Append nodes mean throwing out any
ordering from the child nodes and that's important information to be
losing.

> Currently I am abusing the AppendPath node but later I will add a new
> executor node that will merge the sorted input coming from the children.

You realize I already did this a few years ago. Search for "merge
append" on the lists. The hard part -- as usual -- ends up being
figuring out in the planner when to do this optimization and how to
avoid exponential growth in the number of plans considered and the
amount of data retained about each plan.

For what it's worth I disagree with Tom. I think this is a situation
where we need *both* types of solution. Ideally we will be able to use
a plain Append node for cases where we know the relative ordering of
the data in different partitions, but there will always be cases where
the structured partition data doesn't actually match up with the
ordering requested and we'll need to fall back to a merge-append node.

--
greg


From: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-01 14:28:06
Message-ID: EEB9A3DD-86F4-4299-B94C-BDAEEAEE564A@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sep 1, 2010, at 4:10 PM, Tom Lane wrote:

> Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
>> we are experimenting with modifying table partitioning
>> so the ORDER BY clause can be pushed down to
>> child nodes on the grounds that:
>
> This is really premature, and anything you do along those lines now will
> probably never get committed. The problem is that the transformation
> you propose is wrong unless the planner can prove that the different
> child tables contain nonoverlapping ranges of the sort key. Now you
> might be intending to add logic to try to prove that from inspection of
> constraints, but I don't believe that reverse-engineering such knowledge
> on the fly is a sane approach: it will be hugely expensive and will add
> that cost even in many situations where the optimization fails to apply.
>

well, why non-overlapping? the idea is to make append smart enough to take the sorted lists from below and merge them which will give sorted output as well.
my original idea was what you described but given Martijn van Oosterhout's posting we were pretty confident that we can get along without non-overlapping partitions.

> The project direction is that we are going to add some explicit
> representation of partitioned tables. After that, the planner can just
> know immediately that a range-partitioned sort key is amenable to this
> treatment, and at that point it'll make sense to work on it.
>

can you outline some ideas here and maybe point to some useful discussion here?

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc: Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-01 15:00:12
Message-ID: 24742.1283353212@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

=?iso-8859-1?Q?PostgreSQL_-_Hans-J=FCrgen_Sch=F6nig?= <postgres(at)cybertec(dot)at> writes:
> On Sep 1, 2010, at 4:10 PM, Tom Lane wrote:
>> This is really premature, and anything you do along those lines now will
>> probably never get committed.

> well, why non-overlapping? the idea is to make append smart enough to
> take the sorted lists from below and merge them which will give sorted
> output as well.

Well, an extra merge step is going to change the cost comparisons quite
a bit; see Greg Starks' comments. But in any case, my point wasn't that
this is something we should never do; it was that it makes more sense to
wait till something has happened with explicit partitioning.

>> The project direction is that we are going to add some explicit
>> representation of partitioned tables.

> can you outline some ideas here and maybe point to some useful discussion here?

There's been boatloads of discussion of partitioning, and at least one
submitted patch, over the past year or so ...

regards, tom lane


From: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-01 15:26:15
Message-ID: 620D5B0A-EEFD-412E-B5C2-111568548405@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

hello tom,

yeah, we have followed quite a lot of discussion as well ...
and yes, no patches.

as far as this problem is concerned: we are working on a patch and did some prototyping inside the planner already (attached).
the code we have is pretty limited atm (such as checking for a sort clause explicitly and so on - it has no support for windowing related optimizations and so on so far).

the cost model is not our problem - it is a lot easier to read than the code we are fighting with (it is a different level of complexity). i think costs can be handled.

yes, this merging adds some costs for sure. we might see a hell amount of operators being called - but i think given a reasonable number of partitions it is still a lot cheaper than actually resorting ... and, it is a lot more space efficient.
in my practical case i cannot resort because we would simply run out of space. an index scan is expensive but needs no additional sort space ...
and, merge is O(n) which sort is clearly not.

advise is highly appreciated.

many thanks,

hans

Attachment Content-Type Size
push-down-sort-into-inh-2.patch application/octet-stream 7.2 KB
unknown_filename text/plain 1.2 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Hans-Juergen Schoenig <hs(at)cybertec(dot)at>
Subject: Re: Path question
Date: 2010-09-01 23:20:12
Message-ID: 7D02C7EC-93D2-4B09-9A0D-1655AF87179A@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sep 1, 2010, at 10:21 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> For what it's worth I disagree with Tom. I think this is a situation
> where we need *both* types of solution. Ideally we will be able to use
> a plain Append node for cases where we know the relative ordering of
> the data in different partitions, but there will always be cases where
> the structured partition data doesn't actually match up with the
> ordering requested and we'll need to fall back to a merge-append node.

I agree. Explicit partitioning may open up some additional optimization possibilities in certain cases, but Merge Append is more general and extremely valuable in its own right.

...Robert


From: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-03 13:16:58
Message-ID: BC159D3B-EAB2-4396-BFA6-A47058B9C10D@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Sep 2, 2010, at 1:20 AM, Robert Haas wrote:

> On Sep 1, 2010, at 10:21 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> For what it's worth I disagree with Tom. I think this is a situation
>> where we need *both* types of solution. Ideally we will be able to use
>> a plain Append node for cases where we know the relative ordering of
>> the data in different partitions, but there will always be cases where
>> the structured partition data doesn't actually match up with the
>> ordering requested and we'll need to fall back to a merge-append node.
>
> I agree. Explicit partitioning may open up some additional optimization possibilities in certain cases, but Merge Append is more general and extremely valuable in its own right.
>
> ...Robert
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

we have revised greg's wonderful work and ported the entire thing to head.
it solves the problem of merge_append. i did some testing earlier on today and it seems most important cases are working nicely.

here are some test cases:

test=# \d t_data
Table "public.t_data"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
tstamp | date |

test=# \d t_data_1
Table "public.t_data_1"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
tstamp | date |
Indexes:
"idx_1" btree (id)
Check constraints:
"t_data_1_id_check" CHECK (id >= 1 AND id <= 10000)
Inherits: t_data

test=# \d t_data_2
Table "public.t_data_2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
tstamp | date |
Indexes:
"idx_2" btree (id)
Check constraints:
"t_data_2_id_check" CHECK (id >= 10001 AND id <= 20000)
Inherits: t_data

test=# \d t_data_3
Table "public.t_data_3"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
tstamp | date |
Indexes:
"idx_3" btree (id)
Check constraints:
"t_data_3_id_check" CHECK (id >= 20001 AND id <= 30000)
Inherits: t_data

simple windowing ...

test=# explain select *, max(id) OVER ( ORDER BY id) from t_data ;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
WindowAgg (cost=149.99..2154.43 rows=32140 width=8)
-> Result (cost=149.99..1672.33 rows=32140 width=8)
-> Append (cost=149.99..1672.33 rows=32140 width=8)
-> Sort (cost=149.78..155.13 rows=2140 width=8)
Sort Key: public.t_data.id
-> Seq Scan on t_data (cost=0.00..31.40 rows=2140 width=8)
-> Index Scan using idx_1 on t_data_1 t_data (cost=0.00..318.25 rows=10000 width=8)
-> Index Scan using idx_2 on t_data_2 t_data (cost=0.00..318.25 rows=10000 width=8)
-> Index Scan using idx_3 on t_data_3 t_data (cost=0.00..318.25 rows=10000 width=8)
(9 rows)

it does a nice index scan; merges the stuff and puts it up into the high level doing the windowing.

test=# select *, max(id) OVER ( ORDER BY id) from t_data LIMIT 10;
id | tstamp | max
----+------------+-----
1 | 2010-01-01 | 1
2 | 2010-01-01 | 2
3 | 2010-01-01 | 3
4 | 2010-01-01 | 4
5 | 2010-01-01 | 5
6 | 2010-01-01 | 6
7 | 2010-01-01 | 7
8 | 2010-01-01 | 8
9 | 2010-01-01 | 9
10 | 2010-01-01 | 10
(10 rows)

the cost model does what it should as well:

test=# explain select *, max(id) OVER ( ORDER BY id) from t_data ;
QUERY PLAN
---------------------------------------------------------------------------------------------
WindowAgg (cost=2872.41..3434.86 rows=32140 width=8)
-> Sort (cost=2872.41..2952.76 rows=32140 width=8)
Sort Key: public.t_data.id
-> Result (cost=0.00..466.40 rows=32140 width=8)
-> Append (cost=0.00..466.40 rows=32140 width=8)
-> Seq Scan on t_data (cost=0.00..31.40 rows=2140 width=8)
-> Seq Scan on t_data_1 t_data (cost=0.00..145.00 rows=10000 width=8)
-> Seq Scan on t_data_2 t_data (cost=0.00..145.00 rows=10000 width=8)
-> Seq Scan on t_data_3 t_data (cost=0.00..145.00 rows=10000 width=8)
(9 rows)

it has proven to be really valuable in my first tests.
maybe this is helpful for some people out there.

many thanks,

hans

Attachment Content-Type Size
merge-append-91-v1.diff application/octet-stream 28.6 KB
unknown_filename text/plain 126 bytes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-21 02:57:00
Message-ID: AANLkTikmJ=BmiaA7PRriTUhbz1G-=vY6gkOyKNv0rRSK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/9/3 Hans-Jürgen Schönig <hs(at)cybertec(dot)at>:
> On Sep 2, 2010, at 1:20 AM, Robert Haas wrote:
>> I agree. Explicit partitioning may open up some additional optimization possibilities in certain cases, but Merge Append is more general and extremely valuable in its own right.
>
> we have revised greg's wonderful work and ported the entire thing to head.
> it solves the problem of merge_append. i did some testing earlier on today and it seems most important cases are working nicely.

First, thanks for merging this up to HEAD. I took a look through this
patch tonight, and the previous reviews thereof that I was able to
find, most notably Tom's detailed review on 2009-07-26. I'm not sure
whether or not it's accidental that this didn't get added to the CF,
but here's an attempt to enumerate the things that seem like they need
to be fixed. The quotes labeled "TGL" are from the aforementioned
review by Tom.

1. The code in set_append_rel_pathlist() that accumulates the pathkeys
of all sub-paths is, as it says, and as previous discussed, O(n^2).
In a previous email on this topic, Tom suggested on possible approach
for this problem: choose the largest child relation and call it the
leader, and consider only the pathkeys for that relation rather than
all of them. I think it would be nice if we can find a way to be a
bit smarter, though, because that won't necessarily always find the
best path. One idea I had is to choose some arbitrary limit on how
long the all_pathkeys list is allowed to become and iterate over the
children from largest to smallest, stopping early if you hit that
limit. But thinking about it a little more, can't we just adjust the
way we do this so that it's not O(n^2)? It seems we're only concerned
with equality here, so what about using a hash table? We could hash
the PathKey pointers in each list, but not the lists or listcells
obviously.

2. TGL: "you need an explicit flag to say 'we'll do a merge', not just
rely on whether the path has pathkeys." This makes sense and doesn't
seem difficult.

3. TGL: "Speaking of sorting, it's not entirely clear to me how the
patch ensures that all the child plans produce the necessary sort keys
as output columns, and especially not how it ensures that they all get
produced in the *same* output columns. This might accidentally manage
to work because of the "throwaway" call to make_sort_from_pathkeys(),
but at the very least that's a misleading comment." I'm not sure what
needs to be done about this; I'm going to look at this further.

4. TGL: "In any case, I'm amazed that it's not failing regression
tests all over the place with those critical tests in
make_sort_from_pathkeys lobotomized by random #ifdef FIXMEs. Perhaps
we need some more regression tests...". Obviously, we need to remove
that lobotomy and insert the correct fix for whatever problem it was
trying to solve. Adding some regression tests seems wise, too.

5. TGL: "In the same vein, the hack to 'short circuit' the append
stuff for a single child node is simply wrong, because it doesn't
allow for column number variances. Please remove it." This seems
like straightforward cleanup, and maybe another candidate for a
regression test. (Actually, I notice that the patch has NO regression
tests at all, which surely can't be right for something of this type,
though admittedly since we didn't have EXPLAIN (COSTS OFF) when this
was first written it might have been difficult to write anything
meaningful at the time.)

6. The dummy call to cost_sort() seems like a crock; what that
function does doesn't seem particularly relevant to the cost of the
merge operation. Just off the top of my head, it looks like the cost
of the merge step will be roughly O(lg n) * the cost of comparing two
tuples * the total number of tuples from all child paths. In practice
it might be less, because once some of the paths run out of tuples the
number of comparisons will drop, I think. But the magnitude of that
effect seems difficult to predict, and may be rather small, so perhaps
we should just ignore it.

7. I think there's some basic code cleanup needed here, also: comment
formatting, variable naming, etc.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: David Fetter <david(at)fetter(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-21 04:29:17
Message-ID: 20100921042917.GA5363@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 20, 2010 at 10:57:00PM -0400, Robert Haas wrote:
> 2010/9/3 Hans-Jürgen Schönig <hs(at)cybertec(dot)at>:
> > On Sep 2, 2010, at 1:20 AM, Robert Haas wrote:
> >> I agree. Explicit partitioning may open up some additional
> >> optimization possibilities in certain cases, but Merge Append is
> >> more general and extremely valuable in its own right.
> >
> > we have revised greg's wonderful work and ported the entire thing
> > to head. it solves the problem of merge_append. i did some
> > testing earlier on today and it seems most important cases are
> > working nicely.
>
> First, thanks for merging this up to HEAD. I took a look through
> this patch tonight, and the previous reviews thereof that I was able
> to find, most notably Tom's detailed review on 2009-07-26. I'm not
> sure whether or not it's accidental that this didn't get added to
> the CF,

It's because I missed putting it in, and oversight I've corrected. If
we need to bounce it on to the next one, them's the breaks.

> [points elided]
>
> 7. I think there's some basic code cleanup needed here, also: comment
> formatting, variable naming, etc.

Hans-Jürgen,

Will you be able to get to this in the next couple of days?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-23 13:29:45
Message-ID: AANLkTin5ejGgZMae208sYDqdkRABWqRdzk0389OjfkLR@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 21, 2010 at 12:29 AM, David Fetter <david(at)fetter(dot)org> wrote:
> On Mon, Sep 20, 2010 at 10:57:00PM -0400, Robert Haas wrote:
>> 2010/9/3 Hans-Jürgen Schönig <hs(at)cybertec(dot)at>:
>> > On Sep 2, 2010, at 1:20 AM, Robert Haas wrote:
>> >> I agree. Explicit partitioning may open up some additional
>> >> optimization possibilities in certain cases, but Merge Append is
>> >> more general and extremely valuable in its own right.
>> >
>> > we have revised greg's wonderful work and ported the entire thing
>> > to head.  it solves the problem of merge_append. i did some
>> > testing earlier on today and it seems most important cases are
>> > working nicely.
>>
>> First, thanks for merging this up to HEAD.  I took a look through
>> this patch tonight, and the previous reviews thereof that I was able
>> to find, most notably Tom's detailed review on 2009-07-26.  I'm not
>> sure whether or not it's accidental that this didn't get added to
>> the CF,
>
> It's because I missed putting it in, and oversight I've corrected.  If
> we need to bounce it on to the next one, them's the breaks.
>
>> [points elided]
>>
>> 7. I think there's some basic code cleanup needed here, also: comment
>> formatting, variable naming, etc.
>
> Hans-Jürgen,
>
> Will you be able to get to this in the next couple of days?

I don't see a response to this which I assume means "no" - I'm going
to take a crack at fixing some of these issues.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: David Fetter <david(at)fetter(dot)org>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-23 16:59:35
Message-ID: 3C4FE552-854D-424F-BFE5-2BFCA86D9279@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sep 23, 2010, at 3:29 PM, Robert Haas wrote:

> On Tue, Sep 21, 2010 at 12:29 AM, David Fetter <david(at)fetter(dot)org> wrote:
>> On Mon, Sep 20, 2010 at 10:57:00PM -0400, Robert Haas wrote:
>>> 2010/9/3 Hans-Jürgen Schönig <hs(at)cybertec(dot)at>:
>>>> On Sep 2, 2010, at 1:20 AM, Robert Haas wrote:
>>>>> I agree. Explicit partitioning may open up some additional
>>>>> optimization possibilities in certain cases, but Merge Append is
>>>>> more general and extremely valuable in its own right.
>>>>
>>>> we have revised greg's wonderful work and ported the entire thing
>>>> to head. it solves the problem of merge_append. i did some
>>>> testing earlier on today and it seems most important cases are
>>>> working nicely.
>>>
>>> First, thanks for merging this up to HEAD. I took a look through
>>> this patch tonight, and the previous reviews thereof that I was able
>>> to find, most notably Tom's detailed review on 2009-07-26. I'm not
>>> sure whether or not it's accidental that this didn't get added to
>>> the CF,
>>
>> It's because I missed putting it in, and oversight I've corrected. If
>> we need to bounce it on to the next one, them's the breaks.
>>
>>> [points elided]
>>>
>>> 7. I think there's some basic code cleanup needed here, also: comment
>>> formatting, variable naming, etc.
>>
>> Hans-Jürgen,
>>
>> Will you be able to get to this in the next couple of days?
>
> I don't see a response to this which I assume means "no" - I'm going
> to take a crack at fixing some of these issues.

hello ...

sorry for not getting back to you sooner. i am currently on the road for some days.
we got the top 3 things fixed already. however, some code seems to be relying on a sorted list somewhere(???).
we are in the process of sorting out most of the stuff.
i guess we will have something done next week.

sorry for the delay.

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>
Cc: David Fetter <david(at)fetter(dot)org>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-23 17:19:58
Message-ID: AANLkTina+09EkSP-_N=eUvtM2PG1P10OQFqN73Gr6qPM@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/9/23 Hans-Jürgen Schönig <hs(at)cybertec(dot)at>:
> sorry for not getting back to you sooner. i am currently on the road for some days.
> we got the top 3 things fixed already. however, some code seems to be relying on a sorted list somewhere(???).
> we are in the process of sorting out most of the stuff.
> i guess we will have something done next week.

Oh, cool. Is it possible for you to post your WIP any sooner?

I've been looking at #4 today. Further details to follow after I
finish beating my head against a wall.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-23 21:34:02
Message-ID: AANLkTikQZsEiWmrFTzEkv7tnn2h5GTp6A0ybU8Gf2Xfk@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/9/20 Robert Haas <robertmhaas(at)gmail(dot)com>:
> 4. TGL: "In any case, I'm amazed that it's not failing regression
> tests all over the place with those critical tests in
> make_sort_from_pathkeys lobotomized by random #ifdef FIXMEs.  Perhaps
> we need some more regression tests...".  Obviously, we need to remove
> that lobotomy and insert the correct fix for whatever problem it was
> trying to solve.  Adding some regression tests seems wise, too.

So, I spent a lot of time today trying to understand why these random
FIXMEs were inserted and just exactly what they break or, uh, fix. I
didn't get very far. The regression tests all pass with the latest
version of Merge Append, with and without these FIXMEs, and in fact
when I extracted just those changes and applied them to HEAD, the
regression tests still pass. So I said to myself: "Let's come up with
a regression test that illustrates why these changes are Evil and
Dangerous, and then we can add that to the regression test suite, per
Tom's suggestion." This proved to be easier said than done:
apparently our code is so good that it works even with random sections
commented out. Go us.

As a debugging aide, instead of actually diking these tests out, I
made them elog(LOG). This has the same effect as removing them
entirely but it makes it possible to see whether you've triggered the
relevant code path. Then I tried to trigger those code paths, which I
shall hereinafter refer to as FIXME #1, FIXME #2, FIXME #3, and FIXME
#4, as per the attached patch. It proved to be pretty easy to trigger
FIXME #3; indeed, just about every query I tried did the job.

CREATE TABLE parent (a integer NOT NULL, b integer NOT NULL);
CREATE TABLE child (b integer, a integer, c text NOT NULL) INHERITS (parent);
CREATE TABLE joinme (j integer NOT NULL);

I populated joinme with a single row with the value 1, and child with
a million rows with a running from 1 to a million, b running from 10
million down to 9 million and 1, and c getting random()::text ||
random()::text || random()::text || random()::text. Then you can
trigger FIXME #3 with either of these two (the second also triggers
FIXME #4):

select * from joinme, parent where a = j and j = 1;
select * from parent order by a;

I believe that the first case fires because of the "ec_has_const"
condition and the second from the "EC only has at most one member
condition". However, it's difficult to see what problem this actually
causes. From what I gather from reading the code, em_is_child
EquivalenceMembers are only supposed to affect the construction of
inner index-scan paths; and certainly if there's at most one
EquivalenceMember it's unclear to me how you'd be forming a join.
Maybe it's possible to break something in the ec_has_const case, but I
haven't been to puzzle out what that thing is.

FIXME #1 and FIXME #2 were much harder to trigger. In fact, barring
significant further lobotimization of the code, I couldn't. For FIXME
#1, the right sort of query seems to be something like this:

select 1 from joinme left join parent on a = j where j = 1 order by a;

...but the problem is that in every example I could construct, the
em_is_child EquivalenceMembers were *after* the members for the parent
rel, so you never get far enough in the loop to see them. For related
reasons, I couldn't get FIXME #2 to fire, either: in order to have a
chance of firing FIXME #2, you have to get all the way through the
loop where FIXME #1 is located without finding a match. I can't
believe all of this code is here just for fun, but in every example I
could come up with you quickly find a match in the first loop, and
never even finish visiting all the members of that list, let alone
reach the second loop. Somehow, you need to construct a case where
the values to be sorted aren't directly emitted by the node
immediately under the sort, but I couldn't figure out how to do that -
no matter how I rearranged things, the planner just computed the
necessary outputs one level down. Anyone have an idea?

I did find one apparent weirdness in the way explain outputs sort keys, namely:

rhaas=# explain (costs off) select 2 as x union all select 1 as x order by x;
QUERY PLAN
--------------------
Sort
Sort Key: (2)
-> Append
-> Result
-> Result
(5 rows)

This seems to have nothing to do with the problem at hand, but it
looks pretty weird. The planner is in fact sorting on the column, not
on the literal value 2, so this is (I think) just a display error.
"x" would be a more reasonable representation of the sort key.

All of this leaves me wondering why Greg ended up ifdefing out this
code in the first place. There's probably something I'm missing
here... but for now I can't think of a better idea than just removing
the #ifdefs and hoping that whatever problem they were causing was
limited to an earlier version of the code that no longer exists.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
planner_lobotomy.patch application/octet-stream 2.0 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-24 21:05:01
Message-ID: 27256.1285362301@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> FIXME #1 and FIXME #2 were much harder to trigger. In fact, barring
> significant further lobotimization of the code, I couldn't.

It's not that hard if you just tweak equivclass.c to make the order of
equivalence-class lists different, viz

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index a20ed5f..9528d0b 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -353,7 +353,7 @@ add_eq_member(EquivalenceClass *ec, Expr *expr, Relids relids,
{
ec->ec_relids = bms_add_members(ec->ec_relids, relids);
}
- ec->ec_members = lappend(ec->ec_members, em);
+ ec->ec_members = lcons(em, ec->ec_members);

return em;
}

Then for instance:

regression=# create table t1 (f1 int);
CREATE TABLE
regression=# create table t2 () inherits (t1);
CREATE TABLE
regression=# explain select * from t1 a join t1 b using (f1);
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1

Since the order of equivalence-class member lists isn't supposed to be
semantically significant, I claim that the code in createplan has to
be able to deal with this.

Note that what this is triggering is the em_is_child condition. I think
it may indeed be impossible to get a hit on the em_is_const case as the
system currently stands; the reason being that an EC containing a const
won't normally show up as a pathkey. It can only do so if it's
below_outer_join, as the comment notes. Now the calls to
make_sort_from_pathkeys in createplan.c are only used for constructing
subsidiary sorts for a mergejoin, and we won't consider building a
mergejoin with an EC that contains a const (see
eclass_useful_for_merging). There are some calls in planner.c that are
associated with doing a final sort or distinct, but I suspect they'd
never be applied with a below_outer_join EC. So given the current usage
of make_sort_from_pathkeys it might be pretty hard to get it applied to
an EC containing a constant. That's not a reason for it not to handle
the case, though.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-25 02:23:07
Message-ID: AANLkTimL+WYk_5mOsqJCoU+4vO7B7ffM7anGXcVQf8pN@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 24, 2010 at 5:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> It's not that hard if you just tweak equivclass.c to make the order of
> equivalence-class lists different, viz
[...]
> Since the order of equivalence-class member lists isn't supposed to be
> semantically significant, I claim that the code in createplan has to
> be able to deal with this.
[...]
> That's not a reason for it not to handle
> the case, though.

I'm not disputing that those tests are correct. All I'm saying is
that I can't figure out a way to write regression tests that fail if
they are removed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-29 01:27:43
Message-ID: AANLkTi=K1dNsiext1HHUAaQG9T43fiww1Th4RuRYZ+5p@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/9/23 Robert Haas <robertmhaas(at)gmail(dot)com>:
> All of this leaves me wondering why Greg ended up ifdefing out this
> code in the first place.  There's probably something I'm missing
> here...  but for now I can't think of a better idea than just removing
> the #ifdefs and hoping that whatever problem they were causing was
> limited to an earlier version of the code that no longer exists.

...and FAIL. I missed the blindingly obvious here, which is that
without these tests commented out, while it passes regression tests,
the merge append stuff stops working. I think for some reason without
this stuff in there the appropriate index paths fail to get generated
for the child rels.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-29 02:14:17
Message-ID: AANLkTimTrB6hSCjVoQ_04aFDiO0VG=WPGntzd7ea5TL7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/9/28 Robert Haas <robertmhaas(at)gmail(dot)com>:
> 2010/9/23 Robert Haas <robertmhaas(at)gmail(dot)com>:
>> All of this leaves me wondering why Greg ended up ifdefing out this
>> code in the first place.  There's probably something I'm missing
>> here...  but for now I can't think of a better idea than just removing
>> the #ifdefs and hoping that whatever problem they were causing was
>> limited to an earlier version of the code that no longer exists.
>
> ...and FAIL.  I missed the blindingly obvious here, which is that
> without these tests commented out, while it passes regression tests,
> the merge append stuff stops working.  I think for some reason without
> this stuff in there the appropriate index paths fail to get generated
> for the child rels.

Yep, that's the problem, all right. find_usable_indexes() calls
build_index_pathkeys() to generate pathkeys for each available index
on the child relation, and then calls truncate_useless_pathkeys() to
get rid of any that aren't useful. In a query like this:

explain select * from ma_parent order by name limit 10;

...what makes the pathkeys potentially useful is that they match the
root pathkeys for the query. However, without Greg's hacks, the
transposed child equivalence classes don't show up in the equivalence
member lists, so get_eclass_for_sort_expr() generates new equivalence
classes for them. Subsequently, when truncate_useless_pathkeys() is
called, those new equivalence classes are found not to be equal to the
overall ordering desired for the query, so it truncates them away.

I'm not presently sure quite what the best way to fix this is... any ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-29 03:13:11
Message-ID: 4915.1285729991@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> ...what makes the pathkeys potentially useful is that they match the
> root pathkeys for the query. However, without Greg's hacks, the
> transposed child equivalence classes don't show up in the equivalence
> member lists, so get_eclass_for_sort_expr() generates new equivalence
> classes for them. Subsequently, when truncate_useless_pathkeys() is
> called, those new equivalence classes are found not to be equal to the
> overall ordering desired for the query, so it truncates them away.

> I'm not presently sure quite what the best way to fix this is... any ideas?

Probably what's needed is to extend the "child eclass member" stuff to
cover sort-key eclasses. Per relation.h:

* em_is_child signifies that this element was built by transposing a member
* for an inheritance parent relation to represent the corresponding expression
* on an inheritance child. The element should be ignored for all purposes
* except constructing inner-indexscan paths for the child relation.

Likely these need to be ignored a bit less. A couple of Greg's
undocumented hacks seem to be related to this point, but not all of
them. In any case you'd need some careful thought about when to
ignore child EMs and when not.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-29 15:01:35
Message-ID: AANLkTikGXLquVY8HKU27Vb7v8OxboA66+A+bf==3NrrO@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 28, 2010 at 11:13 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> ...what makes the pathkeys potentially useful is that they match the
>> root pathkeys for the query.  However, without Greg's hacks, the
>> transposed child equivalence classes don't show up in the equivalence
>> member lists, so get_eclass_for_sort_expr() generates new equivalence
>> classes for them.  Subsequently, when truncate_useless_pathkeys() is
>> called, those new equivalence classes are found not to be equal to the
>> overall ordering desired for the query, so it truncates them away.
>
>> I'm not presently sure quite what the best way to fix this is... any ideas?
>
> Probably what's needed is to extend the "child eclass member" stuff to
> cover sort-key eclasses.  Per relation.h:
>
>  * em_is_child signifies that this element was built by transposing a member
>  * for an inheritance parent relation to represent the corresponding expression
>  * on an inheritance child.  The element should be ignored for all purposes
>  * except constructing inner-indexscan paths for the child relation.
>
> Likely these need to be ignored a bit less.  A couple of Greg's
> undocumented hacks seem to be related to this point, but not all of
> them.  In any case you'd need some careful thought about when to
> ignore child EMs and when not.

Makes sense to me. It seems that there are actually two halves to
this problem: getting the child EMs to be generated in the first
place, and then getting them to be examined at the appropriate time.

The FIXME in set_append_rel_pathlist() is there to ensure that
add_child_rel_equivalences() always gets called, and the hack in
add_child_rel_equivalences() is there to ensure that child EMs are
generated unconditionally rather than only when they're useful for
merging. That doesn't seem entirely off-base. In
set_append_rel_pathlist(), I think that we shouldn't set
has_eclass_joins on the child relation unless it's set on the parent,
but calling add_child_rel_equivalences() unconditionally might be
reasonable. Similarly, in add_child_rel_equivalences(), I think that
the cur_ec->ec_has_const test should be retained, but the test on
list_length(cur_ec->ec_members) needs to be relaxed or eliminated. As
a matter of correctness, it seems like it should be OK to do this
always - the worst thing that can happen is you end up with some extra
EMs that don't (or shouldn't) do anything. However, it's possible
that, for performance, we might want to avoid generating EMs that
won't actually be needed. I'm not entirely clear on whether there is
an inexpensive way for us to know that. Perhaps we could replace the
list_length() test with a call to has_useful_pathkeys() and just
accept that there will be some false positives.

The two FIXMEs in make_sort_from_pathkeys() are there to ensure that
we find the child EMs when we go to generate a sort. There's no
reason that I can see to remove the em_has_const test, but it looks
like we need to remove the em_is_child test. We need to match each
pathkey to an element of the child's target list, which means we must
consider an EM that appears in the child's target list, which is means
a child EM or nothing. Testing reveals that if you keep Greg's hacks
to add the child EMs but remove the hacks from
make_sort_from_pathkeys(), it still works IF every child can use an
index scan, but if you need to do an index scan on some children and
sort others then you get:

ERROR: could not find pathkey item to sort

...which seems consistent with the above analysis. If we accept that
it's necessary, that still leaves the question of whether it's safe.
I'm a little less certain on this point, but it seems like it ought to
be OK. As far as I'm aware, we currently never sort append children.
Indeed, unless equivalence classes for those append children were
getting created by some mechanism other than
add_child_rel_equivalences(), it seems like it ought to fail with the
above error messages. And on the flip side it should be impossible
for us to pick up a child EM when we meant to get some other one
because they shouldn't be equal().

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-30 00:46:51
Message-ID: AANLkTinhiRcAmVL+rZQJb-JCOup8Sk2=eSkvs2M8ZP96@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/9/23 Robert Haas <robertmhaas(at)gmail(dot)com>:

> All of this leaves me wondering why Greg ended up ifdefing out this
> code in the first place.  There's probably something I'm missing
> here...  but for now I can't think of a better idea than just removing
> the #ifdefs and hoping that whatever problem they were causing was
> limited to an earlier version of the code that no longer exists.
>

Sorry, I haven't gone completely AWOL, I just hadn't noticed this
thread. pgsql-hackers turns out to be kind of a lot of traffic if
you're not reading it continuously.

As you subsequently discovered I added these FIXMEs because without
them the paths just didn't compare equal when it was comparing the
paths of the children with the paths of the parent.

I think the reason you had difficulty demonstrating problems with at
least some of the FIXMEs was that they really aren't functionally
necessary. They're there because when Tom implemented the equivalence
classes he had a clear idea of what they were supposed to represent
and what variables they needed to include to represent that. And
including variables of child relations or subqueries of a UNION in an
equivalence class representing the parent relation just didn't fit
within that. It doesn't necessarily cause problems but it changes the
data structure representation invariant from what he had in mind.

I don't have a good grasp of exactly what the implications of changing
this data structure rep invariant are though. Is having hundreds or
thousands of variables in a single equivalence class (actually for a
whole bunch if the pathkey list is long) going to cause performance
problems? Is including variables that are only present for one child
of the relation going to limit the usefulness of the equivalence class
data structure for solving other problems where those columns really
aren't equivalent? Also, since I don't really understand what's going
on I wondered if there wasn't an obvious way to accomplish the same
thing perhaps more efficiently.

--
greg


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-30 02:30:19
Message-ID: AANLkTi=M8figmCEDOFCtFus2tVRLqnLSpFMqbgFY9Ti9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 29, 2010 at 11:01 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Makes sense to me.  It seems that there are actually two halves to
> this problem: getting the child EMs to be generated in the first
> place, and then getting them to be examined at the appropriate time.

So I tried out the logic described in this email and, with a few
modifications, it seemed to work. Updated patch attached, any review
appreciated. There are still a bunch of other things that need to be
fixed here, but I think this is OK as far as this particular issue is
concerned. I fixed a few other things:

- create_append_plan must call create_plan_recurse rather than
create_plan. This appears to be an error introduced by rebasing; the
previous version of the patch seg faults when attempting to execute a
plan wherein an inner index scan has been pushed down through an
append node
- remove the hack to short-circuit the append node altogether if
there's only one child
- some miscellaneous code cleanup (more is needed)

> 3. TGL: "Speaking of sorting, it's not entirely clear to me how the
> patch ensures that all the child plans produce the necessary sort keys
> as output columns, and especially not how it ensures that they all get
> produced in the *same* output columns. This might accidentally manage
> to work because of the "throwaway" call to make_sort_from_pathkeys(),
> but at the very least that's a misleading comment." I'm not sure what
> needs to be done about this; I'm going to look at this further.

I spent some time looking at this complaint, and I'm still not sure
what needs to be done about it. Experimentation reveals that the
neither the throwaway call to make_sort_from_pathkeys() nor any other
call to that function is creating the relevant target list entries.
It appears that they're getting created when set_append_rel_pathlist()
calls adjust_appendrel_attrs(). I'm not sure whether that's
sufficient or not. make_sort_from_pathkeys() could add any missing
entries later, but I'm not too sure the order would match if that
happened.

Another awkwardness of this patch is that it makes
create_append_path() and consequently set_dummy_rel_pathlist() take an
additional "root" argument. While there's nothing terribly
unreasonable about this on its face, it's only necessary so that
create_append_path() can call cost_sort(), which takes "root" but
doesn't actually use it. I'm not sure whether it's better to leave
this as-is or to remove the root argument from cost_sort().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
merge_append_2010_09_29.patch application/octet-stream 31.2 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-10-13 17:46:09
Message-ID: 29628.1286991969@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Another awkwardness of this patch is that it makes
> create_append_path() and consequently set_dummy_rel_pathlist() take an
> additional "root" argument. While there's nothing terribly
> unreasonable about this on its face, it's only necessary so that
> create_append_path() can call cost_sort(), which takes "root" but
> doesn't actually use it. I'm not sure whether it's better to leave
> this as-is or to remove the root argument from cost_sort().

Right offhand the cleanest answer to that seems to be to leave
create_append_path alone, and make a separate function named something
like create_ordered_append_path that handles the case where cost_sort
might be needed. I rather wonder if we don't want two separate
execution-time node types anyway, since what Append does seems
significantly different from Merge (and MergeAppend would be just a
misnomer).

I have to run off for a doctors appointment, will continue looking at
this patch when I get back.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-10-14 15:34:22
Message-ID: 19224.1287070462@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I rather wonder if we don't want two separate
> execution-time node types anyway, since what Append does seems
> significantly different from Merge (and MergeAppend would be just a
> misnomer).

I've been working on this patch, and have gotten the executor side split
out as a new node type. That adds something like 600 lines of
boilerplate code in various places, but I think it's well worthwhile to
get rid of the spaghetti-code effect of retail checks to see which kind
of Append we're dealing with. (Greg didn't catch all the places that
needed to distinguish, anyway.)

I did run into a problem with my plan to call the new node type "Merge":
the planner is already using "MergePath" as the name for the Path struct
that is precursor to a MergeJoin. For the moment I'm calling the new
node type MergeAppend, but as mentioned I feel that that's a bit of a
misnomer.

The only good alternative that I can think of is to rename MergePath to
MergeJoinPath (and then for consistency probably also HashPath to
HashJoinPath and NestPath to NestLoopPath). While that wouldn't touch
a huge number of files, it seems to carry some risk of breaking pending
patches, and anyway those are names that go back to Berkeley days so
people are used to 'em.

Anybody have a strong feeling about what to call these things?
At the moment I'm leaning to sticking with MergeAppend, but if we
decide to rename it it'd probably be better to do so before committing.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-10-14 15:36:47
Message-ID: AANLkTimk8ThR+4BF9k0NygfJSPD+2_3e=RhiFhHhmiSb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 14, 2010 at 11:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> I rather wonder if we don't want two separate
>> execution-time node types anyway, since what Append does seems
>> significantly different from Merge (and MergeAppend would be just a
>> misnomer).
>
> I've been working on this patch, and have gotten the executor side split
> out as a new node type.  That adds something like 600 lines of
> boilerplate code in various places, but I think it's well worthwhile to
> get rid of the spaghetti-code effect of retail checks to see which kind
> of Append we're dealing with.  (Greg didn't catch all the places that
> needed to distinguish, anyway.)
>
> I did run into a problem with my plan to call the new node type "Merge":
> the planner is already using "MergePath" as the name for the Path struct
> that is precursor to a MergeJoin.  For the moment I'm calling the new
> node type MergeAppend, but as mentioned I feel that that's a bit of a
> misnomer.
>
> The only good alternative that I can think of is to rename MergePath to
> MergeJoinPath (and then for consistency probably also HashPath to
> HashJoinPath and NestPath to NestLoopPath).  While that wouldn't touch
> a huge number of files, it seems to carry some risk of breaking pending
> patches, and anyway those are names that go back to Berkeley days so
> people are used to 'em.
>
> Anybody have a strong feeling about what to call these things?
> At the moment I'm leaning to sticking with MergeAppend, but if we
> decide to rename it it'd probably be better to do so before committing.

I don't like the idea of renaming the join nodes. Both the code churn
and the possibility of confusing long-time users seem undesirable.

Let's just stick with MergeAppend.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-10-14 15:42:53
Message-ID: 19409.1287070973@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Oct 14, 2010 at 11:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Anybody have a strong feeling about what to call these things?
>> At the moment I'm leaning to sticking with MergeAppend, but if we
>> decide to rename it it'd probably be better to do so before committing.

> I don't like the idea of renaming the join nodes. Both the code churn
> and the possibility of confusing long-time users seem undesirable.

Yeah, especially if MergePath would still be there but now meaning
something different.

The other possible line of attack is to call the new node type something
else than either Merge or MergeAppend. Robert and I batted around a few
thoughts off-list last night, but none of them seemed any better than
MergeAppend.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-10-14 19:11:16
Message-ID: AANLkTikjxPMNUQivWmEdk0cnRNBVx_ivGiptoDS526Q1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 14, 2010 at 8:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I did run into a problem with my plan to call the new node type "Merge":
> the planner is already using "MergePath" as the name for the Path struct
> that is precursor to a MergeJoin.  For the moment I'm calling the new
> node type MergeAppend, but as mentioned I feel that that's a bit of a
> misnomer.

On the plus side the resulting node does have a lot in common with
Append nodes and a lot of places that do something special with Append
nodes will have to do the same things with the new node, so having a
similar name might help people remember that when they're adding their
special code for Append.

At the time I went back and forth on whether to have a separate node.
I tried to do it and had the impression that there were a lot more
places that would need to treat the two similarly than places that
needed to treat the two differently. I'm curious to see how you do it
cleanly.

The only other name I batted around at the time was OrderedAppend
which only alters the other half of the name, so no real help.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-10-14 21:06:46
Message-ID: 4705.1287090406@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> So I tried out the logic described in this email and, with a few
> modifications, it seemed to work. Updated patch attached, any review
> appreciated.

Applied with revisions.

>> 3. TGL: "Speaking of sorting, it's not entirely clear to me how the
>> patch ensures that all the child plans produce the necessary sort keys
>> as output columns, and especially not how it ensures that they all get
>> produced in the *same* output columns. This might accidentally manage
>> to work because of the "throwaway" call to make_sort_from_pathkeys(),
>> but at the very least that's a misleading comment." I'm not sure what
>> needs to be done about this; I'm going to look at this further.

> I spent some time looking at this complaint, and I'm still not sure
> what needs to be done about it. Experimentation reveals that the
> neither the throwaway call to make_sort_from_pathkeys() nor any other
> call to that function is creating the relevant target list entries.

Yeah, this was in fact pretty broken. make_sort_from_pathkeys *does*
create the relevant tlist entries, but they were getting lost again
because they were attached to a Result node that went into the bit
bucket along with the "throwaway" Sort node. Much worse, it was hit or
miss whether the tlist entries would be present in the child plan nodes
--- as coded, this would be the case only for children that were
explicitly sorted to meet the merge's pathkey requirements. If they
weren't there, any sorting on resjunk values would misbehave. I was
able to develop a test case that exposed this by using expression
indexes, so that went into the regression tests.

I fixed this by refactoring make_sort_from_pathkeys so that the tlist
fixup logic was separately accessible.

There were assorted other bugs too. I think it's all good now, but
we'll find out when beta starts ...

regards, tom lane