Re: On partitioning

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: On partitioning
Date: 2014-11-14 02:12:01
Message-ID: 20141114021201.GN28859@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert,

* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> On Thu, Nov 13, 2014 at 1:39 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > but
> > with declarative partitioning, I expect us to eventually be able to
> > eliminate complete partitions from consideration on both sides of a
> > partition-table join and optimize cases where we have two partitioned
> > tables being joined with a compatible join key and only actually do
> > joins between the partitions which overlap each other. I don't see
> > those happening if we're allowing a node tree (only). If having a node
> > tree is just one option among other partitioning options, then we can
> > provide users with the ability to choose what suits their particular
> > needs.
>
> This seems completely muddled to me. What we're talking about is how
> to represent the partition definition in the system catalogs. I'm not
> proposing that the user would "partition by pg_node_tree"; what the
> heck would that even mean?

They'd provide an expression which would be able to identify the
partition to be used. In a way, this is exactly how many folks do
partitioning today with inheritence- consider the if/else trees in
triggers for handling new data coming into the parent table. That's
also why it wouldn't be easy to optimize for.

> I'm proposing one way of serializing the
> partition definitions that the user specifies into something that can
> be stored into a system catalog, which happens to reuse the existing
> infrastructure that we use for that same purpose in various other
> places.

Ok, I didn't immediately see how a node tree would be used for this- but
I admit that I've not gone back through the entirety of this iteration
of the partitioning discussion.

> I don't have a problem with somebody coming up with another
> way of representing the data in the catalogs; I'm just brainstorming.

Ditto.

> But saying that we'll be able to optimize joins better if we store the
> same data as anyarray rather than pg_node_tree or visca versa doesn't
> make any sense at all.

Ok, if the node tree is constrained in what can be stored in it then I
understand how we could still use optimize based on what we've stored in
it. I'm not entirely sure a node tree makes sense but at least I
understand better.

> > I'm not a fan of using pg_class- there are a number of columns in there
> > which I would *not* wish to be allowed to be different per partition
> > (starting with relowner and relacl...). Making those NULL would be just
> > as bad (probably worse, really, since we'd also need to add new columns
> > to pg_class to indicate the partitioning...) as having a sparsely
> > populated new catalog table.
>
> I think you are, again, confused as to what we're discussing. Nobody,
> including Alvaro, has proposed a design where the individual
> partitions don't have pg_class entries of some kind. What we're
> talking about is where to store the metadata for partition exclusion
> and tuple routing.

This discussion has gone a few rounds before and, yes, I was just
jumping into the middle of this particular round, but I'm pretty sure
I'm not the first to point out that storing the individual partition
information into pg_class isn't ideal since there are pieces that we
don't actually want to be different per partition, as I outlined
previously. Perhaps what that means is we should actually go the other
way and move *those* columns into a new catalog instead.

Consider this (totally off-the-cuff):

pg_relation (pg_tables? pg_heaps?)
oid
relname
relnamespace
reltype
reloftype
relowner
relam (?)
relhas*
relisshared
relpersistence
relkind (?)
relnatts
relchecks
relacl
reloptions
relhowpartitioned (?)

pg_class
pg_relation.oid
relfilenode
reltablespace
relpages
reltuples
reltoastrelid
reltoastidxid
relfrozenxid
relhasindexes (?)
relpartitioninfo (whatever this ends up being)

The general idea being to seperate the user-facing notion of a "table"
from the underlying implementation, with the implementation allowing
multiple sets of files to be used for each table. It's certainly not
for the faint of heart, but we saw what happened with our inheiritance
structure allowing different permissions on the child tables- we ended
up creating a pretty grotty hack to deal with it (going through the
parent bypasses the permissions). That's the best solution for that
situation, but it's far from ideal and it'd be nice to avoid that same
risk with partitioning. Additionally, if each partition is in pg_class,
how are we handling name conflicts? Why do individual partitions even
need to have a name? Do we allow queries against them directly? etc..

These are just my thoughts on it and I really don't intend to derail
progress on having a partitioning system and I hope that my comments
don't lead to that happening.

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2014-11-14 02:18:17 Re: On the warpath again about ill-considered inclusion nests
Previous Message Andres Freund 2014-11-14 02:05:29 Re: what does this mean: "running xacts with xcnt == 0"