Re: Table Partitioning Feature

Lists: pgsql-hackers
From: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: manu(at)frogthinker(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Table Partitioning Feature
Date: 2009-01-22 12:38:45
Message-ID: 8d79a95c0901220438s197e9a76t94b9b4b072789bea@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

We are implementing table partitioning feature to support
- the attached commands. The syntax conforms to most of the suggestion
mentioned in
http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, barring
the following:
-- Specification of partition names is optional. System will be able to
generate partition names in such cases.
-- sub partitioning
We are using pgsql triggers to push/move data to appropriate partitions,
but we will definitely consider moving to C language triggers as suggested
by manu.
- Global non-partitioned indexes (that will extend all the partitions).
- Foreign key support for tables referring to partitioned tables.

Please feel free to post your comments and suggestions.

Thanks,
Amit
Persistent Systems

Attachment Content-Type Size
Part_table_stmt.txt text/plain 5.0 KB

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, manu(at)frogthinker(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-01-22 19:09:35
Message-ID: 4978C46F.4050805@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Amit,

Wow, thanks!

As you probably know, we're already in freeze for 8.4. So this patch
will need to go on the first commitfest for 8.5, in May or June.

--Josh


From: Emmanuel Cecchet <manu(at)frogthinker(dot)org>
To: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-01-22 23:16:44
Message-ID: 4978FE5C.90005@frogthinker.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Amit,

You might want to put this on the
http://wiki.postgresql.org/wiki/Table_partitioning wiki page.
How does your timeline look like for this implementation?
I would be happy to contribute C triggers to your implementation. From
what I understood in
http://archives.postgresql.org/pgsql-hackers/2008-01/msg00269.php, you
already have an implementation that parses the grammar and generates
rules as if someone had written them. Is this code available?

Regarding the use of triggers to push/move data to partitions, what if
someone declares triggers on partitions? Especially if you have
subpartitions, let's consider the case where there is a trigger on the
parent, child and grandchild. If I do an insert in the parent, the user
trigger on the parent will be executed, then the partition trigger that
decides to move to the grandchild. Are we going to bypass the child trigger?
If we also want fast COPY operations on partitioned table, we could have
an optimized implementation that could bypass triggers and move the
tuple directly to the appropriate child table.

Thanks for this big contribution,
Emmanuel

> Hi,
>
> We are implementing table partitioning feature to support
> - the attached commands. The syntax conforms to most of the suggestion
> mentioned in
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php,
> barring the following:
> -- Specification of partition names is optional. System will be able
> to generate partition names in such cases.
> -- sub partitioning
> We are using pgsql triggers to push/move data to appropriate
> partitions, but we will definitely consider moving to C language
> triggers as suggested by manu.
> - Global non-partitioned indexes (that will extend all the partitions).
> - Foreign key support for tables referring to partitioned tables.
>
> Please feel free to post your comments and suggestions.
>
> Thanks,
> Amit
> Persistent Systems
>
>
>
> ------------------------------------------------------------------------
>
>

--
Emmanuel Cecchet
FTO @ Frog Thinker
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: manu(at)frogthinker(dot)org
Skype: emmanuel_cecchet


From: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)frogthinker(dot)org>, pgsql-hackers(at)postgresql(dot)org
Cc: swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-01-23 15:19:23
Message-ID: 8d79a95c0901230719i503ae925kc3ec9a528d708af3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Emmanuel,

Please find my comments in-lined:

On 1/23/09, Emmanuel Cecchet <manu(at)frogthinker(dot)org> wrote:
>
> Amit,
>
> You might want to put this on the
> http://wiki.postgresql.org/wiki/Table_partitioning wiki page.

Sure.

How does your timeline look like for this implementation?

The implementation is planned as follows:
- Partition table commands
++ An intermediate patch in Feb end
++ Final patch in mid March
- Global Index: Mid March
- Optimizer changes for partitioned table: May

I would be happy to contribute C triggers to your implementation. From what
> I understood in
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00269.php, you
> already have an implementation that parses the grammar and generates rules
> as if someone had written them. Is this code available?

We have just started with the implementation, i will post the grammar rules
next week.

Regarding the use of triggers to push/move data to partitions, what if
> someone declares triggers on partitions? Especially if you have
> subpartitions, let's consider the case where there is a trigger on the
> parent, child and grandchild. If I do an insert in the parent, the user
> trigger on the parent will be executed, then the partition trigger that
> decides to move to the grandchild. Are we going to bypass the child trigger?

We are not supporting sub-partitioning - There is just one level of
partitioning.

If we also want fast COPY operations on partitioned table, we could have an
> optimized implementation that could bypass triggers and move the tuple
> directly to the appropriate child table.

We will definitely consider to implement fast COPY after we are done with
the planned tasks.

Thanks,
Amit

Thanks for this big contribution,
> Emmanuel
>
> Hi,
>>
>> We are implementing table partitioning feature to support
>> - the attached commands. The syntax conforms to most of the suggestion
>> mentioned in
>> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php,
>> barring the following:
>> -- Specification of partition names is optional. System will be able to
>> generate partition names in such cases.
>> -- sub partitioning
>> We are using pgsql triggers to push/move data to appropriate partitions,
>> but we will definitely consider moving to C language triggers as suggested
>> by manu.
>> - Global non-partitioned indexes (that will extend all the partitions).
>> - Foreign key support for tables referring to partitioned tables.
>>
>> Please feel free to post your comments and suggestions.
>>
>> Thanks,
>> Amit
>> Persistent Systems
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>
> --
> Emmanuel Cecchet
> FTO @ Frog Thinker Open Source Development & Consulting
> --
> Web: http://www.frogthinker.org
> email: manu(at)frogthinker(dot)org
> Skype: emmanuel_cecchet
>
>


From: Emmanuel Cecchet <manu(at)frogthinker(dot)org>
To: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-01-25 20:11:15
Message-ID: 497CC763.9050104@frogthinker.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Amit,

I overlooked the fact that you dropped composite partitions and
subpartitions template from the proposal presented in
http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php.
Is it because this is too hard to support? or you don't see any
immediate need for it?

Thanks,
Emmanuel

> Hi Emmanuel,
>
> Please find my comments in-lined:
>
> On 1/23/09, *Emmanuel Cecchet* <manu(at)frogthinker(dot)org
> <mailto:manu(at)frogthinker(dot)org>> wrote:
>
> Amit,
>
> You might want to put this on the
> http://wiki.postgresql.org/wiki/Table_partitioning wiki page.
>
>
> Sure.
>
> How does your timeline look like for this implementation?
>
>
> The implementation is planned as follows:
> - Partition table commands
> ++ An intermediate patch in Feb end
> ++ Final patch in mid March
> - Global Index: Mid March
> - Optimizer changes for partitioned table: May
>
> I would be happy to contribute C triggers to your implementation.
> From what I understood in
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00269.php,
> you already have an implementation that parses the grammar and
> generates rules as if someone had written them. Is this code
> available?
>
>
> We have just started with the implementation, i will post the grammar
> rules next week.
>
>
> Regarding the use of triggers to push/move data to partitions,
> what if someone declares triggers on partitions? Especially if you
> have subpartitions, let's consider the case where there is a
> trigger on the parent, child and grandchild. If I do an insert in
> the parent, the user trigger on the parent will be executed, then
> the partition trigger that decides to move to the grandchild. Are
> we going to bypass the child trigger?
>
>
> We are not supporting sub-partitioning - There is just one level of
> partitioning.
>
> If we also want fast COPY operations on partitioned table, we
> could have an optimized implementation that could bypass triggers
> and move the tuple directly to the appropriate child table.
>
>
> We will definitely consider to implement fast COPY after we are done
> with the planned tasks.
>
> Thanks,
> Amit
>
>
> Thanks for this big contribution,
> Emmanuel
>
> Hi,
>
> We are implementing table partitioning feature to support
> - the attached commands. The syntax conforms to most of the
> suggestion mentioned in
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php,
> barring the following:
> -- Specification of partition names is optional. System will
> be able to generate partition names in such cases.
> -- sub partitioning
> We are using pgsql triggers to push/move data to appropriate
> partitions, but we will definitely consider moving to C
> language triggers as suggested by manu.
> - Global non-partitioned indexes (that will extend all the
> partitions).
> - Foreign key support for tables referring to partitioned tables.
>
> Please feel free to post your comments and suggestions.
>
> Thanks,
> Amit
> Persistent Systems
>
>
>
> ------------------------------------------------------------------------
>
>
>
>
> --
> Emmanuel Cecchet
> FTO @ Frog Thinker Open Source Development & Consulting
> --
> Web: http://www.frogthinker.org
> email: manu(at)frogthinker(dot)org <mailto:manu(at)frogthinker(dot)org>
> Skype: emmanuel_cecchet
>
>

--
Emmanuel Cecchet
FTO @ Frog Thinker
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: manu(at)frogthinker(dot)org
Skype: emmanuel_cecchet


From: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)frogthinker(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-01-27 08:59:08
Message-ID: 8d79a95c0901270059v58a59ba9m9eb0ef9b95171129@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Emmanuel,

On 1/26/09, Emmanuel Cecchet <manu(at)frogthinker(dot)org> wrote:
>
> Hi Amit,
>
> I overlooked the fact that you dropped composite partitions and
> subpartitions template from the proposal presented in
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php.
> Is it because this is too hard to support? or you don't see any immediate
> need for it?

We do intend to implement composite partitioning, but the delivery dates are
not yet decided.
I feel that simple forms of sub-partitioning can be realized using composite
partitioning, hence the implementation of sub-partitioning is not planned.

Thanks,
Amit

Thanks,
> Emmanuel
>
>
> Hi Emmanuel,
>>
>> Please find my comments in-lined:
>>
>> On 1/23/09, *Emmanuel Cecchet* <manu(at)frogthinker(dot)org <mailto:
>> manu(at)frogthinker(dot)org>> wrote:
>>
>> Amit,
>>
>> You might want to put this on the
>> http://wiki.postgresql.org/wiki/Table_partitioning wiki page.
>>
>>
>> Sure.
>>
>> How does your timeline look like for this implementation?
>>
>>
>> The implementation is planned as follows:
>> - Partition table commands
>> ++ An intermediate patch in Feb end
>> ++ Final patch in mid March
>> - Global Index: Mid March
>> - Optimizer changes for partitioned table: May
>>
>> I would be happy to contribute C triggers to your implementation.
>> From what I understood in
>> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00269.php,
>> you already have an implementation that parses the grammar and
>> generates rules as if someone had written them. Is this code
>> available?
>>
>>
>> We have just started with the implementation, i will post the grammar
>> rules next week.
>>
>>
>> Regarding the use of triggers to push/move data to partitions,
>> what if someone declares triggers on partitions? Especially if you
>> have subpartitions, let's consider the case where there is a
>> trigger on the parent, child and grandchild. If I do an insert in
>> the parent, the user trigger on the parent will be executed, then
>> the partition trigger that decides to move to the grandchild. Are
>> we going to bypass the child trigger?
>>
>>
>> We are not supporting sub-partitioning - There is just one level of
>> partitioning.
>>
>> If we also want fast COPY operations on partitioned table, we
>> could have an optimized implementation that could bypass triggers
>> and move the tuple directly to the appropriate child table.
>>
>>
>> We will definitely consider to implement fast COPY after we are done with
>> the planned tasks.
>>
>> Thanks,
>> Amit
>>
>>
>> Thanks for this big contribution,
>> Emmanuel
>>
>> Hi,
>>
>> We are implementing table partitioning feature to support
>> - the attached commands. The syntax conforms to most of the
>> suggestion mentioned in
>> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php,
>> barring the following:
>> -- Specification of partition names is optional. System will
>> be able to generate partition names in such cases.
>> -- sub partitioning
>> We are using pgsql triggers to push/move data to appropriate
>> partitions, but we will definitely consider moving to C
>> language triggers as suggested by manu.
>> - Global non-partitioned indexes (that will extend all the
>> partitions).
>> - Foreign key support for tables referring to partitioned tables.
>>
>> Please feel free to post your comments and suggestions.
>>
>> Thanks,
>> Amit
>> Persistent Systems
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>>
>> -- Emmanuel Cecchet
>> FTO @ Frog Thinker Open Source Development & Consulting
>> --
>> Web: http://www.frogthinker.org
>> email: manu(at)frogthinker(dot)org <mailto:manu(at)frogthinker(dot)org>
>> Skype: emmanuel_cecchet
>>
>>
>>
>
> --
> Emmanuel Cecchet
> FTO @ Frog Thinker Open Source Development & Consulting
> --
> Web: http://www.frogthinker.org
> email: manu(at)frogthinker(dot)org
> Skype: emmanuel_cecchet
>
>


From: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)frogthinker(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-02-09 14:16:54
Message-ID: 8d79a95c0902090616g33c65c3cs52ff4df1215b9776@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Emmanuel,

We are considering to following approach:
1. metadata table pg_partitions is defined as follows:
CATALOG(pg_partitions,2336) BKI_WITHOUT_OIDS
{
Oid partrelid; // partition table Oid
Oid parentrelid; // Parent table Oid
int4 parttype; // Type of partition, list, hash, range
Oid partkey; // partition key Oid
Oid keytype; /// type of partition key.
int4 keyorder /// order of the key in multi-key partitions.
text min;
text max; // min and max for range parti
text[] list;
int hash; // hash value
} FormData_pg_partitions;

2. C triggers will fire a query on this table to get the relevant
partition of the inserted/updated data using SPI interface. The query
will look something like (for range partitioning)

select min(partrelid)
from pg_partitions
where parentrelid = 2934 // we know this value
and (
( $1 between to_int(min ) and to_int(max) and
keyorder = 1) OR
($2 between to_date (min) and to_date (max) and
keyorder =2 )
....
)
group by
parentrelid
having
count(*) = <number of partition keys>

$1, $2, ... are the placeholders of the actual partition key values of
trigger tuple.

Since we know the type of partition keys, and the parentrelid, this
kind of query string can be saved in another table say, pg_part_map.
And its plan can be parsed once and saved in cache to be reused.
Do you see any issue with using SPI interface within triggers?

The advantage of this kind of approah is that trigger code can be made
genric for any kind of partition table.

Thanks,
Amit
Persistent Systems,
www.persistentsys.com

On 1/23/09, Emmanuel Cecchet <manu(at)frogthinker(dot)org> wrote:
> Amit,
>
> You might want to put this on the
> http://wiki.postgresql.org/wiki/Table_partitioning wiki
> page.
> How does your timeline look like for this implementation?
> I would be happy to contribute C triggers to your implementation. From what
> I understood in
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00269.php,
> you already have an implementation that parses the grammar and generates
> rules as if someone had written them. Is this code available?
>
> Regarding the use of triggers to push/move data to partitions, what if
> someone declares triggers on partitions? Especially if you have
> subpartitions, let's consider the case where there is a trigger on the
> parent, child and grandchild. If I do an insert in the parent, the user
> trigger on the parent will be executed, then the partition trigger that
> decides to move to the grandchild. Are we going to bypass the child trigger?
> If we also want fast COPY operations on partitioned table, we could have an
> optimized implementation that could bypass triggers and move the tuple
> directly to the appropriate child table.
>
> Thanks for this big contribution,
> Emmanuel
>
>
> >
> > Hi,
> >
> > We are implementing table partitioning feature to support
> > - the attached commands. The syntax conforms to most of the suggestion
> mentioned in
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php,
> barring the following:
> > -- Specification of partition names is optional. System will be able to
> generate partition names in such cases.
> > -- sub partitioning
> > We are using pgsql triggers to push/move data to appropriate partitions,
> but we will definitely consider moving to C language triggers as suggested
> by manu.
> > - Global non-partitioned indexes (that will extend all the partitions).
> > - Foreign key support for tables referring to partitioned tables.
> >
> > Please feel free to post your comments and suggestions.
> >
> > Thanks,
> > Amit
> > Persistent Systems
> >
> >
> >
> >
> ------------------------------------------------------------------------
> >
> >
> >
>
>
> --
> Emmanuel Cecchet
> FTO @ Frog Thinker Open Source Development & Consulting
> --
> Web: http://www.frogthinker.org
> email: manu(at)frogthinker(dot)org
> Skype: emmanuel_cecchet
>
>


From: Emmanuel Cecchet <manu(at)frogthinker(dot)org>
To: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-02-10 02:10:38
Message-ID: 4990E21E.2080402@frogthinker.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Amit,

I will be traveling until next Tuesday and will have no access to email
so don't be surprised if I don't follow up this week.
The overall approach seems sound. The metadata table should help also
for DDL to find out overlapping ranges or duplicate list entries.
So far, I have not tried to use the SPI interface from a C trigger so I
don't see any disadvantage yet. We would have to assess the performance
to make sure it's not going to be a show stopper.
I think that the main issue of the trigger approach is that other
triggers might interfere. The 'partition trigger' must be the last of
the 'before insert' triggers and if the destination (child) table has a
trigger, we must ensure that this trigger is not going to require a new
routing.
Another issue is the result that will be returned by insert/copy
commands if all tuples are moved to other tables, the result will be 0.
We might want to have stats that would collect where tuples where moved
for a particular command (I don't know yet what would be the best place
to collect these stats but they could probably be updated by the trigger).
Also would the trigger be attached to all tables in the hierarchy or
only to the top parent?
What kind of query would you use with more than 1 level of inheritance
(e.g. parent=year, child=month, grand-child=day)? It looks like we have
to parse the leaves of the graph but intermediate nodes would help
accelerating the search.

An alternative approach (I haven't assessed the feasibility yet) would
be to try to call the query planner. If we ask to select the partition
value of the tuple, the query planner should return the table it is
going to scan (as in EXPLAIN SELECT * FROM t WHERE key=$1).

Let me know what you think,
Emmanuel

> We are considering to following approach:
> 1. metadata table pg_partitions is defined as follows:
> CATALOG(pg_partitions,2336) BKI_WITHOUT_OIDS
> {
> Oid partrelid; // partition table Oid
> Oid parentrelid; // Parent table Oid
> int4 parttype; // Type of partition, list, hash, range
> Oid partkey; // partition key Oid
> Oid keytype; /// type of partition key.
> int4 keyorder /// order of the key in multi-key partitions.
> text min;
> text max; // min and max for range parti
> text[] list;
> int hash; // hash value
> } FormData_pg_partitions;
>
>
> 2. C triggers will fire a query on this table to get the relevant
> partition of the inserted/updated data using SPI interface. The query
> will look something like (for range partitioning)
>
> select min(partrelid)
> from pg_partitions
> where parentrelid = 2934 // we know this value
> and (
> ( $1 between to_int(min ) and to_int(max) and
> keyorder = 1) OR
> ($2 between to_date (min) and to_date (max) and
> keyorder =2 )
> ....
> )
> group by
> parentrelid
> having
> count(*) = <number of partition keys>
>
> $1, $2, ... are the placeholders of the actual partition key values of
> trigger tuple.
>
> Since we know the type of partition keys, and the parentrelid, this
> kind of query string can be saved in another table say, pg_part_map.
> And its plan can be parsed once and saved in cache to be reused.
> Do you see any issue with using SPI interface within triggers?
>
> The advantage of this kind of approah is that trigger code can be made
> genric for any kind of partition table.
>
> Thanks,
> Amit
> Persistent Systems,
> www.persistentsys.com
>
>
>
>
>
> On 1/23/09, Emmanuel Cecchet <manu(at)frogthinker(dot)org> wrote:
>
>> Amit,
>>
>> You might want to put this on the
>> http://wiki.postgresql.org/wiki/Table_partitioning wiki
>> page.
>> How does your timeline look like for this implementation?
>> I would be happy to contribute C triggers to your implementation. From what
>> I understood in
>> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00269.php,
>> you already have an implementation that parses the grammar and generates
>> rules as if someone had written them. Is this code available?
>>
>> Regarding the use of triggers to push/move data to partitions, what if
>> someone declares triggers on partitions? Especially if you have
>> subpartitions, let's consider the case where there is a trigger on the
>> parent, child and grandchild. If I do an insert in the parent, the user
>> trigger on the parent will be executed, then the partition trigger that
>> decides to move to the grandchild. Are we going to bypass the child trigger?
>> If we also want fast COPY operations on partitioned table, we could have an
>> optimized implementation that could bypass triggers and move the tuple
>> directly to the appropriate child table.
>>
>> Thanks for this big contribution,
>> Emmanuel
>>
>>
>>
>>> Hi,
>>>
>>> We are implementing table partitioning feature to support
>>> - the attached commands. The syntax conforms to most of the suggestion
>>>
>> mentioned in
>> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php,
>> barring the following:
>>
>>> -- Specification of partition names is optional. System will be able to
>>>
>> generate partition names in such cases.
>>
>>> -- sub partitioning
>>> We are using pgsql triggers to push/move data to appropriate partitions,
>>>
>> but we will definitely consider moving to C language triggers as suggested
>> by manu.
>>
>>> - Global non-partitioned indexes (that will extend all the partitions).
>>> - Foreign key support for tables referring to partitioned tables.
>>>
>>> Please feel free to post your comments and suggestions.
>>>
>>> Thanks,
>>> Amit
>>> Persistent Systems
>>>
>>>
>>>
>>>
>>>
>> ------------------------------------------------------------------------
>>
>>>
>
>

--
Emmanuel Cecchet
FTO @ Frog Thinker
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: manu(at)frogthinker(dot)org
Skype: emmanuel_cecchet


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)frogthinker(dot)org>, pgsql-hackers(at)postgresql(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-02-10 05:16:00
Message-ID: 603c8f070902092116j276836f2p2d8e243e98a4a20e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 9, 2009 at 9:16 AM, Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com> wrote:
> Hi Emmanuel,
>
> We are considering to following approach:
> 1. metadata table pg_partitions is defined as follows:
> CATALOG(pg_partitions,2336) BKI_WITHOUT_OIDS
> {
> Oid partrelid; // partition table Oid
> Oid parentrelid; // Parent table Oid
> int4 parttype; // Type of partition, list, hash, range
> Oid partkey; // partition key Oid
> Oid keytype; /// type of partition key.
> int4 keyorder /// order of the key in multi-key partitions.
> text min;
> text max; // min and max for range parti
> text[] list;
> int hash; // hash value
> } FormData_pg_partitions;
>
>
> 2. C triggers will fire a query on this table to get the relevant
> partition of the inserted/updated data using SPI interface. The query
> will look something like (for range partitioning)
>
> select min(partrelid)
> from pg_partitions
> where parentrelid = 2934 // we know this value
> and (
> ( $1 between to_int(min ) and to_int(max) and
> keyorder = 1) OR
> ($2 between to_date (min) and to_date (max) and
> keyorder =2 )
> ....
> )
> group by
> parentrelid
> having
> count(*) = <number of partition keys>
>
> $1, $2, ... are the placeholders of the actual partition key values of
> trigger tuple.
>
> Since we know the type of partition keys, and the parentrelid, this
> kind of query string can be saved in another table say, pg_part_map.
> And its plan can be parsed once and saved in cache to be reused.
> Do you see any issue with using SPI interface within triggers?
>
> The advantage of this kind of approah is that trigger code can be made
> genric for any kind of partition table.

I am a little fuzzy on what you're proposing here, but I think you're
saying that you're only going to support range partitioning on
integers or dates and that you plan to use the text type to store the
integer or date values. FWIW, those don't seem like very good
decisions to me. I think you should aim to support range partitioning
on any combination of a datatype and a less-than operator, similar to
what pg_statistic does for statistics. pg_statistic uses anyarray to
store the datums.

I am also somewhat skeptical about the idea of using triggers for
this. I haven't scrutinized the issue in detail, so I may be all
wet... but ISTM that the concerns raised elsewhere about the order in
which triggers can be expected to fire may bite you fairly hard. ISTM
the right semantics are something like this:

- fire all of the row-level BEFORE triggers on the parent table
(giving up if any return NULL)
- determine the correct child table based on the resulting tuple
- fire all of the row-level BEFORE triggers on the child table (giving
up if any return NULL)
- insert the tuple into the child table
- fire all of the row-level AFTER triggers on the child table... and
possibly also the parent table... not sure about the order

You will also need to fire statement-level triggers on the appropriate
tables, which is a little tricky. Presumably you want the tables on
which the AFTER triggers fire to be the same ones as those on which
the BEFORE triggers fire, but you don't know which child tables you're
actually going to hit until you actually perform the action. Maybe
the right thing to do is fire both sets of triggers on the parent
table and those child tables not excluded by constraint exclusion...?
But I'm not sure about that.

Anyway, getting these types of behavior via triggers may be tricky.
But then again maybe not: I haven't read the code.

...Robert


From: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)frogthinker(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-02-10 13:11:31
Message-ID: 8d79a95c0902100511m4c5fab4fg76342b8cd5c4bb9b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for your feedback, Emmanuel.
Here are my comments:

On 2/10/09, Emmanuel Cecchet <manu(at)frogthinker(dot)org> wrote:
> Hi Amit,
>
> I will be traveling until next Tuesday and will have no access to email so
> don't be surprised if I don't follow up this week.
> The overall approach seems sound. The metadata table should help also for
> DDL to find out overlapping ranges or duplicate list entries.

We are checking for overlaps in the partition keys before creating partitions.

> So far, I have not tried to use the SPI interface from a C trigger so I
> don't see any disadvantage yet. We would have to assess the performance to
> make sure it's not going to be a show stopper.
> I think that the main issue of the trigger approach is that other triggers
> might interfere. The 'partition trigger' must be the last of the 'before
> insert' triggers and if the destination (child) table has a trigger, we must
> ensure that this trigger is not going to require a new routing.
> Another issue is the result that will be returned by insert/copy commands
> if all tuples are moved to other tables, the result will be 0. We might want
> to have stats that would collect where tuples where moved for a particular
> command (I don't know yet what would be the best place to collect these
> stats but they could probably be updated by the trigger).

Row movements will be done by firing deletes and inserts. We will
investigte on how these stats can be maintained and displayed.

> Also would the trigger be attached to all tables in the hierarchy or only
> to the top parent?
> What kind of query would you use with more than 1 level of inheritance
> (e.g. parent=year, child=month, grand-child=day)? It looks like we have to
> parse the leaves of the graph but intermediate nodes would help accelerating
> the search.
>

We haven't yet planned for supporting multi-level partitioning.
However, the pg_partition table can be extented to store "partlevel"
column (to represent depth of partition from the root), and we should
just select the leaf level partitions in the SQL that finds target
partition.
(This is with the assumption that only leaf level partitions will have
the data.)

> An alternative approach (I haven't assessed the feasibility yet) would be
> to try to call the query planner. If we ask to select the partition value of
> the tuple, the query planner should return the table it is going to scan (as
> in EXPLAIN SELECT * FROM t WHERE key=$1).
>

That's a good idea. We will have to anyway write this code for planner
module to find relevant partitions for 'SELECT' queries.

Another question i have is - should we create a separate C file and
shared library for the partition trigger functions, or can we bundle
it with one of the existing libraries?

Thanks,
Amit

> Let me know what you think,
>
> Emmanuel
>
>
> > We are considering to following approach:
> > 1. metadata table pg_partitions is defined as follows:
> > CATALOG(pg_partitions,2336) BKI_WITHOUT_OIDS
> > {
> > Oid partrelid; // partition table Oid
> > Oid parentrelid; // Parent table Oid
> > int4 parttype; // Type of partition, list, hash, range
> > Oid partkey; // partition key Oid
> > Oid keytype; /// type of partition key.
> > int4 keyorder /// order of the key in multi-key partitions.
> > text min;
> > text max; // min and max for range parti
> > text[] list;
> > int hash; // hash value
> > } FormData_pg_partitions;
> >
> >
> > 2. C triggers will fire a query on this table to get the relevant
> > partition of the inserted/updated data using SPI interface. The query
> > will look something like (for range partitioning)
> >
> > select min(partrelid)
> > from pg_partitions
> > where parentrelid = 2934 // we know this value
> > and (
> > ( $1 between to_int(min ) and to_int(max) and
> > keyorder = 1) OR
> > ($2 between to_date (min) and to_date (max) and
> > keyorder =2 )
> > ....
> > )
> > group by
> > parentrelid
> > having
> > count(*) = <number of partition keys>
> >
> > $1, $2, ... are the placeholders of the actual partition key values of
> > trigger tuple.
> >
> > Since we know the type of partition keys, and the parentrelid, this
> > kind of query string can be saved in another table say, pg_part_map.
> > And its plan can be parsed once and saved in cache to be reused.
> > Do you see any issue with using SPI interface within triggers?
> >
> > The advantage of this kind of approah is that trigger code can be made
> > genric for any kind of partition table.
> >
> > Thanks,
> > Amit
> > Persistent Systems,
> > www.persistentsys.com
> >
> >
> >
> >
> >
> > On 1/23/09, Emmanuel Cecchet <manu(at)frogthinker(dot)org> wrote:
> >
> >
> > > Amit,
> > >
> > > You might want to put this on the
> > > http://wiki.postgresql.org/wiki/Table_partitioning wiki
> > > page.
> > > How does your timeline look like for this implementation?
> > > I would be happy to contribute C triggers to your implementation. From
> what
> > > I understood in
> > >
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00269.php,
> > > you already have an implementation that parses the grammar and generates
> > > rules as if someone had written them. Is this code available?
> > >
> > > Regarding the use of triggers to push/move data to partitions, what if
> > > someone declares triggers on partitions? Especially if you have
> > > subpartitions, let's consider the case where there is a trigger on the
> > > parent, child and grandchild. If I do an insert in the parent, the user
> > > trigger on the parent will be executed, then the partition trigger that
> > > decides to move to the grandchild. Are we going to bypass the child
> trigger?
> > > If we also want fast COPY operations on partitioned table, we could
> have an
> > > optimized implementation that could bypass triggers and move the tuple
> > > directly to the appropriate child table.
> > >
> > > Thanks for this big contribution,
> > > Emmanuel
> > >
> > >
> > >
> > >
> > > > Hi,
> > > >
> > > > We are implementing table partitioning feature to support
> > > > - the attached commands. The syntax conforms to most of the suggestion
> > > >
> > > >
> > > mentioned in
> > >
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php,
> > > barring the following:
> > >
> > >
> > > > -- Specification of partition names is optional. System will be able
> to
> > > >
> > > >
> > > generate partition names in such cases.
> > >
> > >
> > > > -- sub partitioning
> > > > We are using pgsql triggers to push/move data to appropriate
> partitions,
> > > >
> > > >
> > > but we will definitely consider moving to C language triggers as
> suggested
> > > by manu.
> > >
> > >
> > > > - Global non-partitioned indexes (that will extend all the
> partitions).
> > > > - Foreign key support for tables referring to partitioned tables.
> > > >
> > > > Please feel free to post your comments and suggestions.
> > > >
> > > > Thanks,
> > > > Amit
> > > > Persistent Systems
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> ------------------------------------------------------------------------
> > >
> > >
> > > >
> > > >
> > >
> >
> >
> >
>
>
> --
> Emmanuel Cecchet
> FTO @ Frog Thinker Open Source Development & Consulting
> --
> Web: http://www.frogthinker.org
> email: manu(at)frogthinker(dot)org
> Skype: emmanuel_cecchet
>
>


From: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)frogthinker(dot)org>, pgsql-hackers(at)postgresql(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-02-10 15:02:07
Message-ID: 8d79a95c0902100702vce4e0c0y8be24bf1c5245daf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Robert,

> I am a little fuzzy on what you're proposing here, but I think you're
> saying that you're only going to support range partitioning on
> integers or dates and that you plan to use the text type to store the
> integer or date values. FWIW, those don't seem like very good
> decisions to me. I think you should aim to support range partitioning
> on any combination of a datatype and a less-than operator, similar to
> what pg_statistic does for statistics. pg_statistic uses anyarray to
> store the datums.
>

We don't have any strong reason for not using anyarray datatypes. We
will consider your suggestion.

> I am also somewhat skeptical about the idea of using triggers for
> this. I haven't scrutinized the issue in detail, so I may be all
> wet... but ISTM that the concerns raised elsewhere about the order in
> which triggers can be expected to fire may bite you fairly hard. ISTM
> the right semantics are something like this:
>
> - fire all of the row-level BEFORE triggers on the parent table
> (giving up if any return NULL)
> - determine the correct child table based on the resulting tuple
> - fire all of the row-level BEFORE triggers on the child table (giving
> up if any return NULL)
> - insert the tuple into the child table
> - fire all of the row-level AFTER triggers on the child table... and
> possibly also the parent table... not sure about the order
>

The child tables will just have update triggers to take care of row
movements. Invalid Inserts on child tables will be taken care by the
constraints.
Parent table will have all the triggers (insert/update/del) to
redirect the rows to appropriate child tables.

The order of execution of triggers can create problems. However,
triggers are called in order of there names. So we can use a prefix
starting with "large" string value for partition names to make sure
that they are called last.

> You will also need to fire statement-level triggers on the appropriate
> tables, which is a little tricky. Presumably you want the tables on
> which the AFTER triggers fire to be the same ones as those on which
> the BEFORE triggers fire, but you don't know which child tables you're
> actually going to hit until you actually perform the action. Maybe
> the right thing to do is fire both sets of triggers on the parent
> table and those child tables not excluded by constraint exclusion...?
> But I'm not sure about that.

I am not sure i understood the problem. But our triggers will know
which partitions (child tables) will be subjected to
insert/update/del.

Thanks,
Amit
Persitent Systems

>
> Anyway, getting these types of behavior via triggers may be tricky.
> But then again maybe not: I haven't read the code.
>
>
> ...Robert
>