Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject: Re: Partitioning: issues/ideas (Was: Re: On partitioning)
Date: 2015-01-19 03:37:55
Message-ID: CAFjFpRdf3T+n18J1EsyaOYzCZCV2MzB5ZXdTeOrOnFr=5EgB8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 16, 2015 at 11:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > * It has been repeatedly pointed out that we may want to decouple
> > partitioning from inheritance because implementing partitioning as an
> > extension of inheritance mechanism means that we have to keep all the
> > existing semantics which might limit what we want to do with the special
> > case of it which is partitioning; in other words, we would find
> > ourselves in difficult position where we have to inject a special case
> > code into a very generalized mechanism that is inheritance.
> > Specifically, do we regard a partitions as pg_inherits children of its
> > partitioning parent?
>
> I don't think this is totally an all-or-nothing decision. I think
> everyone is agreed that we need to not break things that work today --
> e.g. Merge Append. What that implies for pg_inherits isn't altogether
> clear.
>
> > * Syntax: do we want to make it similar to one of the many other
> > databases out there? Or we could invent our own?
>
> Well, what I think we don't want is something that is *almost* like
> some other database but not quite. I lean toward inventing our own
> since I'm not aware of something that we'd want to copy exactly.
>
> > I wonder if we could add a clause like DISTRIBUTED BY to complement
> > PARTITION ON that represents a hash distributed/partitioned table (that
> > could be a syntax to support sharded tables maybe; we would definitely
> > want to move ahead in that direction I guess)
>
> Maybe eventually, but let's not complicate things by worrying too much
> about that now.
>

Instead we might want to specify which server (foreign or local) each of
the partition go to, something like LOCATED ON clause for each of the
partitions with default as local server.

>
> > * Catalog: We would like to have a catalog structure suitable to
> > implement capabilities like multi-column partitioning, sub-partitioning
> > (with arbitrary number of levels in the hierarchy). I had suggested
> > that we create two new catalogs viz. pg_partitioned_rel,
> > pg_partition_def to store metadata about a partition key of a
> > partitioned relation and partition bound info of a partition,
> > respectively. Also, see the point about on-disk representation of
> > partition bounds
>
> I'm not convinced that there is any benefit in spreading this
> information across two tables neither of which exist today. If the
> representation of the partitioning scheme is going to be a node tree,
> then there's no point in taking what would otherwise have been a List
> and storing each element of it in a separate tuple. The overarching
> point here is that the system catalog structure should be whatever is
> most convenient for the system internals; I'm not sure we understand
> what that is yet.
>
> > * It is desirable to treat partitions as pg_class relations with perhaps
> > a new relkind(s). We may want to choose an implementation where only the
> > lowest level relations in a partitioning hierarchy have storage; those
> > at the upper layers are mere placeholder relations though of course with
> > associated constraints determined by partitioning criteria (with
> > appropriate metadata entered into the additional catalogs).
>
> I think the storage-less parents need a new relkind precisely to
> denote that they have no storage; I am not convinced that there's any
> reason to change the relkind for the leaf nodes. But that's been
> proposed, so evidently someone thinks there's a reason to do it.
>
> > I am not
> > quite sure if each kind of the relations involved in the partitioning
> > scheme have separate namespaces and, if they are, how we implement that
>
> I am in favor of having all of the nodes in the hierarchy have names
> just as relations do today -- pg_class.relname. Anything else seems
> to me to be complex to implement and of very marginal benefit. But
> again, it's been proposed.
>
> > * In the initial implementation, we could just live with partitioning on
> > a set of columns (and not arbitrary expressions of them)
>
> Seems quite fair.
>
> > * We perhaps do not need multi-column LIST partitions as they are not
> > very widely used and may complicate the implementation
>
> I agree that the use case is marginal; but I'm not sure it needs to
> complicate the implementation much. Depending on how the
> implementation shakes out, prohibiting it might come to seem like more
> of a wart than allowing it.
>
> > * There are a number of suggestions about how we represent partition
> > bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
> > associated with the relation itself), etc. Important point to consider
> > here may be that partition key may contain more than one column
>
> Yep.
>
> > * How we represent partition definition in memory (for a given
> > partitioned relation) - important point to remember is that such a
> > representation should be efficient to iterate through or
> > binary-searchable. Also see the points about tuple-routing and
> > partition-pruning
>
> Yep.
>
> > * Overflow/catchall partition: it seems we do not want/need them. It
> > might seem desirable for example in cases where a big transaction enters
> > a large number of tuples all but one of which find a defined partition;
> > we may not want to error out in such case but instead enter that erring
> > tuple into the overflow partition instead. If we choose to implement
> > that, we would like to also implement the capability to move the tuples
> > into the appropriate partition once it's defined. Related is the notion
> > of automatically creating partitions if one is not already defined for a
> > just entered tuple; but there may be locking troubles if many concurrent
> > sessions try to do that
>
> I think that dynamically creating new partitions is way beyond the
> scope of what this patch should be trying to do. If we ever do it at
> all, it should not be now. The value of a default partition (aka
> overflow partition) seems to me to be debatable. For range
> partitioning, it doesn't seem entirely necessary provided that you can
> define a range with only one endpoint (e.g. partition A has values 1
> to 10, B has 11 and up, and C has 0 and down). For list partitioning,
> though, you might well want something like that. But is it a
> must-have? Dunno.
>
> > * Tuple-routing: based on the internal representation of partition
> > bounds for the partitions of a given partitioned table, there should be
> > a way to map a just entered tuple to partition id it belongs to. Below
> > mentioned BRIN-like machinery could be made to work
> >
> > * Partition-pruning: again, based on the internal representation of
> > partition bounds for the partitions of a given partitioned table, there
> > should be a way to prune partitions deemed unnecessary per scan quals.
> > One notable suggestion is to consider BRIN (-like) machinery. For
> > example, it is able to tell from the scan quals whether a particular
> > block range of a given heap needs to be scanned or not based on summary
> > info index tuple for the block range. Though, the interface is currently
> > suitable to cover a single heap with blocks in range 0 to N-1 of that
> > heap. What we are looking for here is a hypothetical PartitionMemTuple
> > (or PartitionBound) that is a summary of a whole relation (in this case,
> > the partition) NOT a block range. But I guess the infrastructure is
> > generalized enough that we could make that work. Related then would be
> > an equivalent of ScanKey for the partitioning case. Just as ScanKeyData
> > has correspondence with the index being used, the hypothetical
> > PartitionScanKeyData (which may be an entirely bad/half-baked idea!)
> > would represent the application of comparison operator between table
> > column (partitioning key column) and a constant (as per quals).
>
> I'm not going to say this couldn't be done, but how is any of it
> better than having a list of the partition bounds and binary-searching
> it?
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2015-01-19 03:41:20 Re: install libpq.dll in bin directory on Windows / Cygwin
Previous Message Stephen Frost 2015-01-19 03:28:37 Re: pgaudit - an auditing extension for PostgreSQL