Re: Rethinking representation of sort/hash semantics in queries and plans

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Rethinking representation of sort/hash semantics in queries and plans
Date: 2010-11-28 03:02:33
Message-ID: 8075.1290913353@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In recent discussions of the plan-tree representation for KNNGIST index
scans, there seemed to be agreement that it'd be a good idea to explicitly
represent the expected sort ordering of the output. While thinking about
how best to do that, it struck me that there's some pretty horrendous
impedance mismatches in the way we do things now. Different parts of the
system use two different representations of sort ordering:

* A sort operator (which can have < or > semantics) plus nulls-first flag

* A btree opfamily plus direction and nulls-first flags

Getting from one of these representations to the other is not exactly
cheap, as it requires one or more syscache lookups. But consider what
happens when we process a simple SELECT ... ORDER BY query to produce
a Sort plan:

1. The parser generates a SortGroupClause, which contains the
sort-operator representation. This involves looking up the default btree
opclass for the ORDER BY column's datatype, then finding the < or > member
of the opclass. (These lookups are buffered by the typcache, but we'll
still have to do them at least once per session.) If you use ORDER BY
USING then you might think it's cheaper ... but we still do a lookup to
verify that the operator is in some btree family.

2. The planner generates a PathKey, which is based on the opfamily
representation, so we have to do get_ordering_op_properties to go back the
other way.

3. If a sort plan is chosen, createplan.c uses get_opfamily_member to go
from the PathKey representation back to sort-operator representation,
because the Sort plan node type stores sort operators.

4. At runtime, tuplesort_begin_heap needs the comparison function for the
sort operator, so it calls get_ordering_op_properties *again* to re-derive
the opfamily, from which it can extract the right support function using
get_opfamily_proc.

Things are only marginally less comical if an indexscan plan is chosen,
and that's only because the IndexScan plan node doesn't contain any
explicit representation of the output sort order. If we were to solve the
KNNGIST issue by installing sort operator lists in IndexScan, it'd be
about as ugly as this. (For some extra amusement, trace through where
build_index_pathkeys' data comes from...)

I have not traced through the behavior for hash-based plans as carefully,
but I believe that there are a similar number of conversions between
operator OID and hash opfamily OID representations.

We got into this mess by revising the planner to use opfamily OIDs to
define sort/hash semantics without changing the structures that are its
input and output APIs. I think it's probably time to fix this. I don't
have any evidence at the moment about what fraction of SearchSysCache load
is coming from these repeated conversions, but it can't be trivial. And
quite aside from any performance benefits, it would be conceptually
cleaner to have only one representation of sort semantics not two.

If you look closely at what we're doing with sort operators
(get_ordering_op_properties pretty much encapsulates this), it turns out
that a sort operator is shorthand for three pieces of information:

1. btree opfamily OID
2. specific input datatype for the opfamily
3. ascending or descending direction

So to fix these problems we'd need to replace sort operator OIDs in
SortGroupClause and plan nodes with those three items. Obviously, this
would be slightly bulkier, but the extra cost added to copying parse and
plan trees should be tiny compared to the avoided syscache lookups.

A possible compromise is to avoid storing the specific input datatype.
In most cases it's the same as the column datatype or expression datatype
that we're feeding to the sort operation, which I believe we can always
get from somewhere else in the plan. However, it can be different in
binary-compatibility cases (such as feeding a specific array type to
anyarray_ops, or varchar to text_ops). If we don't store it then we'd
need to add extra complexity in get_opfamily_proc and similar places
to search for a binary-compatible member function or operator. This
extra cost would be paid only when actually using binary-compatible
cases, but it still seems like it'd be better to pay a little extra
storage to avoid it.

In a similar fashion, I think that the eqop member of SortGroupClause
should be replaced by a hash opfamily OID and specific input datatype
(no direction needed here, of course), and likewise for hash operator
OIDs in hash-based plan node types.

A possible objection to this idea is that instead of having stored rules
and views depending on specific operators, they'd now be shown as
depending on specific opfamilies. If you were to drop the relevant
operators then you'd get failures at runtime because no suitable member
operator or function could be found. However, with the current scheme
it's possible to drop the opfamily that makes use of a particular operator
valid in a given query context; so you can still get a failure, though it
would happen a bit upstream in the planner when it fails to locate the
opfamily. On balance I don't think this is any better or worse, just
different. We don't support dynamic modification of opclasses terribly
well anyhow, given all the caching that is done for them.

Thoughts, objections?

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Rethinking representation of sort/hash semantics in queries and plans
Date: 2010-11-28 14:45:54
Message-ID: 20101128144554.GA3796@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Nov 27, 2010 at 10:02:33PM -0500, Tom Lane wrote:
> In recent discussions of the plan-tree representation for KNNGIST index
> scans, there seemed to be agreement that it'd be a good idea to explicitly
> represent the expected sort ordering of the output. While thinking about
> how best to do that, it struck me that there's some pretty horrendous
> impedance mismatches in the way we do things now. Different parts of the
> system use two different representations of sort ordering:
>
> * A sort operator (which can have < or > semantics) plus nulls-first flag
>
> * A btree opfamily plus direction and nulls-first flags

Sounds like a good idea to me. Quite aside from the performance issues,
having one way to represent things will make it clearer what's going on
and easier to extend in the future.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
> - Charles de Gaulle


From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Rethinking representation of sort/hash semantics in queries and plans
Date: 2010-11-28 20:02:56
Message-ID: m2tyj1fc7j.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> If you look closely at what we're doing with sort operators
> (get_ordering_op_properties pretty much encapsulates this), it turns out
> that a sort operator is shorthand for three pieces of information:
>
> 1. btree opfamily OID
> 2. specific input datatype for the opfamily
> 3. ascending or descending direction
>
> So to fix these problems we'd need to replace sort operator OIDs in
> SortGroupClause and plan nodes with those three items. Obviously, this
> would be slightly bulkier, but the extra cost added to copying parse and
> plan trees should be tiny compared to the avoided syscache lookups.
>
> A possible compromise is to avoid storing the specific input datatype.

My understanding is that opfamily+datatype gives an opclass. What about
storing the opclass OID there?

Other than that, cleaning up the current situation by having as good an
view of the bigger picture as what you have now sounds great.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Rethinking representation of sort/hash semantics in queries and plans
Date: 2010-11-28 21:20:19
Message-ID: 21713.1290979219@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> So to fix these problems we'd need to replace sort operator OIDs in
>> SortGroupClause and plan nodes with those three items. Obviously, this
>> would be slightly bulkier, but the extra cost added to copying parse and
>> plan trees should be tiny compared to the avoided syscache lookups.

> My understanding is that opfamily+datatype gives an opclass. What about
> storing the opclass OID there?

Then you'd just need to look up the opfamily again. Opclasses are too
small a division to be useful.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Rethinking representation of sort/hash semantics in queries and plans
Date: 2010-11-28 22:03:20
Message-ID: 23027.1290981800@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> (For some extra amusement, trace through where
> build_index_pathkeys' data comes from...)

While I don't propose to implement right away the whole SortGroupClause
and plan tree modification sketched above, I did look into fixing
build_index_pathkeys so that it doesn't uselessly convert from
opfamilies to sort operators and back again. The main reason this is
relevant at the moment is that we could get rid of relcache.c's caching
of operators related to indexes, which seems possibly useful in
connection with the current discussion of backend startup time.

What becomes apparent when you look closely at what that code is doing
is that it's catering to the possibility of an amcanorder index access
method that isn't btree. As things stand in HEAD, an add-on index
access method would be recognized as supporting ordering so long as it
registers the regular comparison operators (<, >, etc) with the same
index strategy numbers as btree. The reason that it would work is that
those operators would be found as the fwdsortop/revsortop entries for
the index, and then looking up those operator OIDs in btree opfamilies
would locate the corresponding btree opfamily OIDs, which is what you
have to have to match to a pathkey's opfamily.

In the attached draft patch that would no longer work, because the new
access method would have to have the exact same opfamily OIDs as btree
in order to match to btree-derived pathkeys --- and of course it can't
have that, since access method is a property of an opfamily.

Now, this loss of flexibility doesn't particularly bother me, because
I know of no existing or contemplated btree-substitute access methods.
If one did appear on the horizon, there are a couple of ways we could
fix the problem, the cleanest being to let a non-btree opfamily declare
that it sorts the same as btree opfamily so-and-so. Or we could fix
it locally in plancat.c by performing the lookup-the-operators-and-
then-the-btree-opfamilies dance on the fly when setting up IndexOptInfo
for a non-btree amcanorder index. But I'm disinclined to write such
code when there's no way to test it and no foreseeable possibility
that it'll ever be used. Maybe we should just make plancat.c throw
a not-implemented error if amcanorder is true but it's not btree.

Thoughts? Anyone particularly opposed to pursuing an optimization
of this kind at all?

regards, tom lane

PS: the attached patch doesn't yet include removal of relcache
rd_operator arrays, since that would just make the patch bigger
without exposing any new issues.

Attachment Content-Type Size
simplify-index-pathkeys.patch text/x-patch 22.8 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Rethinking representation of sort/hash semantics in queries and plans
Date: 2010-11-29 00:42:23
Message-ID: 28019.1290991343@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Now, this loss of flexibility doesn't particularly bother me, because
> I know of no existing or contemplated btree-substitute access methods.
> If one did appear on the horizon, there are a couple of ways we could
> fix the problem, the cleanest being to let a non-btree opfamily declare
> that it sorts the same as btree opfamily so-and-so. Or we could fix
> it locally in plancat.c by performing the lookup-the-operators-and-
> then-the-btree-opfamilies dance on the fly when setting up IndexOptInfo
> for a non-btree amcanorder index. But I'm disinclined to write such
> code when there's no way to test it and no foreseeable possibility
> that it'll ever be used. Maybe we should just make plancat.c throw
> a not-implemented error if amcanorder is true but it's not btree.

On further reflection the last seems like clearly the thing to do.

> PS: the attached patch doesn't yet include removal of relcache
> rd_operator arrays, since that would just make the patch bigger
> without exposing any new issues.

And here's the complete patch.

regards, tom lane

Attachment Content-Type Size
simplify-index-pathkeys-2.patch text/x-patch 40.6 KB