Re: help with plug-in function for additional (partition/shard) visibility checks

Lists: pgsql-hackers
From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: help with plug-in function for additional (partition/shard) visibility checks
Date: 2011-09-01 08:13:32
Message-ID: 1314864812.16158.17.camel@hvost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hallow hackers

I have the following problem to solve and would like to get advice on
the best way to do it.

The problem:

When growing a pl/proxy based database cluster, one of the main
operations is splitting a partition. The standard flow is as follows:

1) make a copy of the partitions table(s) to another database
2) reconfigure pl/proxy to use 2 partitions instead of one

The easy part is making a copy of all or half of the table to another
database. The hard part is fast deletion (i mean milliseconds,
comparable to TRUNCATE) the data that should not be in a partition (so
that RUN ON ALL functions will continue to return right results).

It would be relatively easy, if we still had the RULES for select
available for plain tables, but even then the eventual cleanup would
usually mean at least 3 passes of disk writes (set xmax, write deleted
flag, vacuum and remove)

What I would like to have is possibility for additional visibility
checks, which would run some simple C function over tuple data (usually
hash(fieldval) + and + or ) and return visibility (is in this partition)
as a result. It would be best if this is run at so low level that also
vacuum would use it and can clean up the foreign partition data in one
pass, without doing the delete dance first.

So finally the QUESTION :

where in code would be the best place to check for this so that

1) both regular queries and VACUUM see it
2) the tuple data (and not only system fields or just xmin/xmax) would
be available for the function to use

--
-------
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/


From: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: help with plug-in function for additional (partition/shard) visibility checks
Date: 2011-09-02 12:01:51
Message-ID: 689CAF92-6D70-4761-8F64-E47DE3527EF2@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

hello …

i have been thinking about this issue for quite a while ...
given your idea i am not sure how this can work at all.

consider:
begin;
insert 1
insert 2
commit

assume this ends up in the same node,
now you split it into two …
1 and 2 will have exactly the same visibility to and transaction.
i am not sure how you can get this right without looking at the data.

alternative idea: what if the proxy would add / generate a filter by looking at the data?
a quick idea would be that once you split you add a simple directive such as "FILTER GENERATOR $1" or so to the PL/proxy code.
it would then behind the scene arrange the filter passed on.
what do you think?

regards,

hans

On Sep 1, 2011, at 10:13 AM, Hannu Krosing wrote:

> Hallow hackers
>
> I have the following problem to solve and would like to get advice on
> the best way to do it.
>
> The problem:
>
> When growing a pl/proxy based database cluster, one of the main
> operations is splitting a partition. The standard flow is as follows:
>
> 1) make a copy of the partitions table(s) to another database
> 2) reconfigure pl/proxy to use 2 partitions instead of one
>
> The easy part is making a copy of all or half of the table to another
> database. The hard part is fast deletion (i mean milliseconds,
> comparable to TRUNCATE) the data that should not be in a partition (so
> that RUN ON ALL functions will continue to return right results).
>
> It would be relatively easy, if we still had the RULES for select
> available for plain tables, but even then the eventual cleanup would
> usually mean at least 3 passes of disk writes (set xmax, write deleted
> flag, vacuum and remove)
>
> What I would like to have is possibility for additional visibility
> checks, which would run some simple C function over tuple data (usually
> hash(fieldval) + and + or ) and return visibility (is in this partition)
> as a result. It would be best if this is run at so low level that also
> vacuum would use it and can clean up the foreign partition data in one
> pass, without doing the delete dance first.
>
> So finally the QUESTION :
>
> where in code would be the best place to check for this so that
>
> 1) both regular queries and VACUUM see it
> 2) the tuple data (and not only system fields or just xmin/xmax) would
> be available for the function to use
>
>
> --
> -------
> Hannu Krosing
> PostgreSQL Unlimited Scalability and Performance Consultant
> 2ndQuadrant Nordic
> PG Admin Book: http://www.2ndQuadrant.com/books/
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


From: Hannu Krosing <hannu(at)krosing(dot)net>
To: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: help with plug-in function for additional (partition/shard) visibility checks
Date: 2011-09-02 12:36:00
Message-ID: 1314966960.3187.17.camel@hvost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2011-09-02 at 14:01 +0200, PostgreSQL - Hans-Jürgen Schönig
wrote:
> hello …
>
> i have been thinking about this issue for quite a while ...
> given your idea i am not sure how this can work at all.
>
> consider:
> begin;
> insert 1
> insert 2
> commit
>
> assume this ends up in the same node,
> now you split it into two …
> 1 and 2 will have exactly the same visibility to and transaction.
> i am not sure how you can get this right without looking at the data.

It has to consider the data when determining visibility, that's the
whole point of the plug-in .

The idea is, that each row "belongs" to a certain partition, as
determined by some function over it's fields. Most often this function
is hash of primary key OR-ed by a bitmap representing cluster size and
AND-ed by bitmap for partition(s) stored in this database.

when you split the parition, then some row's don't belong in the old
partition database anymore (and if you did a full copy, then the other
half dont belong to the new one), so they should be handled as
invisible / deleted. As this can be only done by looking at the tuple
data, this needs an additional visibility function. And as this is only
needed for partitioned databases, it makes sense to implement it as a
plogin, so it would not wast cycles on non-partitioned databases

> alternative idea: what if the proxy would add / generate a filter by
> looking at the data?
> a quick idea would be that once you split you add a simple directive
> such as "FILTER GENERATOR $1" or so to the PL/proxy code.
> it would then behind the scene arrange the filter passed on.
> what do you think?

Hmm. I'm not sure I understand what you are trying to say. Can you
elaborate a little ?

>
> regards,
>
> hans
>
>
>
> On Sep 1, 2011, at 10:13 AM, Hannu Krosing wrote:
>
> > Hallow hackers
> >
> > I have the following problem to solve and would like to get advice on
> > the best way to do it.
> >
> > The problem:
> >
> > When growing a pl/proxy based database cluster, one of the main
> > operations is splitting a partition. The standard flow is as follows:
> >
> > 1) make a copy of the partitions table(s) to another database
> > 2) reconfigure pl/proxy to use 2 partitions instead of one
> >
> > The easy part is making a copy of all or half of the table to another
> > database. The hard part is fast deletion (i mean milliseconds,
> > comparable to TRUNCATE) the data that should not be in a partition (so
> > that RUN ON ALL functions will continue to return right results).
> >
> > It would be relatively easy, if we still had the RULES for select
> > available for plain tables, but even then the eventual cleanup would
> > usually mean at least 3 passes of disk writes (set xmax, write deleted
> > flag, vacuum and remove)
> >
> > What I would like to have is possibility for additional visibility
> > checks, which would run some simple C function over tuple data (usually
> > hash(fieldval) + and + or ) and return visibility (is in this partition)
> > as a result. It would be best if this is run at so low level that also
> > vacuum would use it and can clean up the foreign partition data in one
> > pass, without doing the delete dance first.
> >
> > So finally the QUESTION :
> >
> > where in code would be the best place to check for this so that
> >
> > 1) both regular queries and VACUUM see it
> > 2) the tuple data (and not only system fields or just xmin/xmax) would
> > be available for the function to use
> >
> >
> > --
> > -------
> > Hannu Krosing
> > PostgreSQL Unlimited Scalability and Performance Consultant
> > 2ndQuadrant Nordic
> > PG Admin Book: http://www.2ndQuadrant.com/books/
> >
> >
> > --
> > Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
> >
>
> --
> Cybertec Schönig & Schönig GmbH
> Gröhrmühlgasse 26
> A-2700 Wiener Neustadt, Austria
> Web: http://www.postgresql-support.de
>
>


From: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: Hannu Krosing <hannu(at)krosing(dot)net>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: help with plug-in function for additional (partition/shard) visibility checks
Date: 2011-09-02 12:51:35
Message-ID: 692F6DBD-0DA0-432B-B291-55FB9BC5091C@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

hello …

the goal of the entire proxy thing is to make the right query go to the right node / nodes.
we determine this by using a partitioning function and so on …
currently PL/proxy has only a handful of commands - one is RUN ON … which tells us where to put things.
assume you issue a select … some select will "fall out" on the target node.
to restrict the data coming from the node you could add an additional constraint on the way …

say:
SELECT * FROM proxy_table WHERE a = 20;

what you want to reach the node after a split is …

SELECT * FROM proxy_table WHERE a = 20 AND col = "filter the wrong half away"

my idea is to add an additional command to the PL/proxy command set.
it should call a function generating this additional filter.
maybe somehow like that …

RUN ON hashtext($1) -- this one already knows about the increased cluster
GENERATE FILTER my_create_the_bloody_filter_func($1) -- this one would "massage" the query going to the node.

it would actually open the door for a lot of additional trickery.
the function would tell the proxy what to append - and: this "what" would be under your full control.

what do you think?
i got to think about it futher but i can envision that this could be feasible ...

hans

On Sep 2, 2011, at 2:36 PM, Hannu Krosing wrote:

> On Fri, 2011-09-02 at 14:01 +0200, PostgreSQL - Hans-Jürgen Schönig
> wrote:
>> hello …
>>
>> i have been thinking about this issue for quite a while ...
>> given your idea i am not sure how this can work at all.
>>
>> consider:
>> begin;
>> insert 1
>> insert 2
>> commit
>>
>> assume this ends up in the same node,
>> now you split it into two …
>> 1 and 2 will have exactly the same visibility to and transaction.
>> i am not sure how you can get this right without looking at the data.
>
> It has to consider the data when determining visibility, that's the
> whole point of the plug-in .
>
> The idea is, that each row "belongs" to a certain partition, as
> determined by some function over it's fields. Most often this function
> is hash of primary key OR-ed by a bitmap representing cluster size and
> AND-ed by bitmap for partition(s) stored in this database.
>
> when you split the parition, then some row's don't belong in the old
> partition database anymore (and if you did a full copy, then the other
> half dont belong to the new one), so they should be handled as
> invisible / deleted. As this can be only done by looking at the tuple
> data, this needs an additional visibility function. And as this is only
> needed for partitioned databases, it makes sense to implement it as a
> plogin, so it would not wast cycles on non-partitioned databases
>
>> alternative idea: what if the proxy would add / generate a filter by
>> looking at the data?
>> a quick idea would be that once you split you add a simple directive
>> such as "FILTER GENERATOR $1" or so to the PL/proxy code.
>> it would then behind the scene arrange the filter passed on.
>> what do you think?
>
> Hmm. I'm not sure I understand what you are trying to say. Can you
> elaborate a little ?
>
>>
>> regards,
>>
>> hans
>>
>>
>>
>> On Sep 1, 2011, at 10:13 AM, Hannu Krosing wrote:
>>
>>> Hallow hackers
>>>
>>> I have the following problem to solve and would like to get advice on
>>> the best way to do it.
>>>
>>> The problem:
>>>
>>> When growing a pl/proxy based database cluster, one of the main
>>> operations is splitting a partition. The standard flow is as follows:
>>>
>>> 1) make a copy of the partitions table(s) to another database
>>> 2) reconfigure pl/proxy to use 2 partitions instead of one
>>>
>>> The easy part is making a copy of all or half of the table to another
>>> database. The hard part is fast deletion (i mean milliseconds,
>>> comparable to TRUNCATE) the data that should not be in a partition (so
>>> that RUN ON ALL functions will continue to return right results).
>>>
>>> It would be relatively easy, if we still had the RULES for select
>>> available for plain tables, but even then the eventual cleanup would
>>> usually mean at least 3 passes of disk writes (set xmax, write deleted
>>> flag, vacuum and remove)
>>>
>>> What I would like to have is possibility for additional visibility
>>> checks, which would run some simple C function over tuple data (usually
>>> hash(fieldval) + and + or ) and return visibility (is in this partition)
>>> as a result. It would be best if this is run at so low level that also
>>> vacuum would use it and can clean up the foreign partition data in one
>>> pass, without doing the delete dance first.
>>>
>>> So finally the QUESTION :
>>>
>>> where in code would be the best place to check for this so that
>>>
>>> 1) both regular queries and VACUUM see it
>>> 2) the tuple data (and not only system fields or just xmin/xmax) would
>>> be available for the function to use
>>>
>>>
>>> --
>>> -------
>>> Hannu Krosing
>>> PostgreSQL Unlimited Scalability and Performance Consultant
>>> 2ndQuadrant Nordic
>>> PG Admin Book: http://www.2ndQuadrant.com/books/
>>>
>>>
>>> --
>>> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-hackers
>>>
>>
>> --
>> Cybertec Schönig & Schönig GmbH
>> Gröhrmühlgasse 26
>> A-2700 Wiener Neustadt, Austria
>> Web: http://www.postgresql-support.de
>>
>>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


From: Hannu Krosing <hannu(at)krosing(dot)net>
To: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: help with plug-in function for additional (partition/shard) visibility checks
Date: 2011-09-02 12:59:40
Message-ID: 1314968380.3187.20.camel@hvost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2011-09-02 at 14:51 +0200, PostgreSQL - Hans-Jürgen Schönig
wrote:
> hello …
>
> the goal of the entire proxy thing is to make the right query go to the right node / nodes.
> we determine this by using a partitioning function and so on …
> currently PL/proxy has only a handful of commands - one is RUN ON … which tells us where to put things.
> assume you issue a select … some select will "fall out" on the target node.
> to restrict the data coming from the node you could add an additional constraint on the way …
>
> say:
> SELECT * FROM proxy_table WHERE a = 20;
>
> what you want to reach the node after a split is …
>
> SELECT * FROM proxy_table WHERE a = 20 AND col = "filter the wrong half away"
>
> my idea is to add an additional command to the PL/proxy command set.
> it should call a function generating this additional filter.
> maybe somehow like that …
>
> RUN ON hashtext($1) -- this one already knows about the increased cluster
> GENERATE FILTER my_create_the_bloody_filter_func($1) -- this one would "massage" the query going to the node.
>
> it would actually open the door for a lot of additional trickery.
> the function would tell the proxy what to append - and: this "what" would be under your full control.
>
> what do you think?

Hmm, could work for simplest cases, but this has 2 main problems:

1) you need a full SQL parser to make this generally useful for plain
SQL

and

2) it still won't work for pl/proxy's main usecase - calling the same
_function_ on partition.

> i got to think about it futher but i can envision that this could be feasible ...
>
> hans
>
>
> On Sep 2, 2011, at 2:36 PM, Hannu Krosing wrote:
>
> > On Fri, 2011-09-02 at 14:01 +0200, PostgreSQL - Hans-Jürgen Schönig
> > wrote:
> >> hello …
> >>
> >> i have been thinking about this issue for quite a while ...
> >> given your idea i am not sure how this can work at all.
> >>
> >> consider:
> >> begin;
> >> insert 1
> >> insert 2
> >> commit
> >>
> >> assume this ends up in the same node,
> >> now you split it into two …
> >> 1 and 2 will have exactly the same visibility to and transaction.
> >> i am not sure how you can get this right without looking at the data.
> >
> > It has to consider the data when determining visibility, that's the
> > whole point of the plug-in .
> >
> > The idea is, that each row "belongs" to a certain partition, as
> > determined by some function over it's fields. Most often this function
> > is hash of primary key OR-ed by a bitmap representing cluster size and
> > AND-ed by bitmap for partition(s) stored in this database.
> >
> > when you split the parition, then some row's don't belong in the old
> > partition database anymore (and if you did a full copy, then the other
> > half dont belong to the new one), so they should be handled as
> > invisible / deleted. As this can be only done by looking at the tuple
> > data, this needs an additional visibility function. And as this is only
> > needed for partitioned databases, it makes sense to implement it as a
> > plogin, so it would not wast cycles on non-partitioned databases
> >
> >> alternative idea: what if the proxy would add / generate a filter by
> >> looking at the data?
> >> a quick idea would be that once you split you add a simple directive
> >> such as "FILTER GENERATOR $1" or so to the PL/proxy code.
> >> it would then behind the scene arrange the filter passed on.
> >> what do you think?
> >
> > Hmm. I'm not sure I understand what you are trying to say. Can you
> > elaborate a little ?
> >
> >>
> >> regards,
> >>
> >> hans
> >>
> >>
> >>
> >> On Sep 1, 2011, at 10:13 AM, Hannu Krosing wrote:
> >>
> >>> Hallow hackers
> >>>
> >>> I have the following problem to solve and would like to get advice on
> >>> the best way to do it.
> >>>
> >>> The problem:
> >>>
> >>> When growing a pl/proxy based database cluster, one of the main
> >>> operations is splitting a partition. The standard flow is as follows:
> >>>
> >>> 1) make a copy of the partitions table(s) to another database
> >>> 2) reconfigure pl/proxy to use 2 partitions instead of one
> >>>
> >>> The easy part is making a copy of all or half of the table to another
> >>> database. The hard part is fast deletion (i mean milliseconds,
> >>> comparable to TRUNCATE) the data that should not be in a partition (so
> >>> that RUN ON ALL functions will continue to return right results).
> >>>
> >>> It would be relatively easy, if we still had the RULES for select
> >>> available for plain tables, but even then the eventual cleanup would
> >>> usually mean at least 3 passes of disk writes (set xmax, write deleted
> >>> flag, vacuum and remove)
> >>>
> >>> What I would like to have is possibility for additional visibility
> >>> checks, which would run some simple C function over tuple data (usually
> >>> hash(fieldval) + and + or ) and return visibility (is in this partition)
> >>> as a result. It would be best if this is run at so low level that also
> >>> vacuum would use it and can clean up the foreign partition data in one
> >>> pass, without doing the delete dance first.
> >>>
> >>> So finally the QUESTION :
> >>>
> >>> where in code would be the best place to check for this so that
> >>>
> >>> 1) both regular queries and VACUUM see it
> >>> 2) the tuple data (and not only system fields or just xmin/xmax) would
> >>> be available for the function to use
> >>>
> >>>
> >>> --
> >>> -------
> >>> Hannu Krosing
> >>> PostgreSQL Unlimited Scalability and Performance Consultant
> >>> 2ndQuadrant Nordic
> >>> PG Admin Book: http://www.2ndQuadrant.com/books/
> >>>
> >>>
> >>> --
> >>> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> >>> To make changes to your subscription:
> >>> http://www.postgresql.org/mailpref/pgsql-hackers
> >>>
> >>
> >> --
> >> Cybertec Schönig & Schönig GmbH
> >> Gröhrmühlgasse 26
> >> A-2700 Wiener Neustadt, Austria
> >> Web: http://www.postgresql-support.de
> >>
> >>
> >
> >
> >
> > --
> > Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
> >
>
> --
> Cybertec Schönig & Schönig GmbH
> Gröhrmühlgasse 26
> A-2700 Wiener Neustadt, Austria
> Web: http://www.postgresql-support.de
>
>


From: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: Hannu Krosing <hannu(at)krosing(dot)net>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: help with plug-in function for additional (partition/shard) visibility checks
Date: 2011-09-02 13:17:10
Message-ID: 65DE60BF-71CE-49B1-9068-E0FF837BA257@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Sep 2, 2011, at 2:59 PM, Hannu Krosing wrote:

> On Fri, 2011-09-02 at 14:51 +0200, PostgreSQL - Hans-Jürgen Schönig
> wrote:
>> hello …
>>
>> the goal of the entire proxy thing is to make the right query go to the right node / nodes.
>> we determine this by using a partitioning function and so on …
>> currently PL/proxy has only a handful of commands - one is RUN ON … which tells us where to put things.
>> assume you issue a select … some select will "fall out" on the target node.
>> to restrict the data coming from the node you could add an additional constraint on the way …
>>
>> say:
>> SELECT * FROM proxy_table WHERE a = 20;
>>
>> what you want to reach the node after a split is …
>>
>> SELECT * FROM proxy_table WHERE a = 20 AND col = "filter the wrong half away"
>>
>> my idea is to add an additional command to the PL/proxy command set.
>> it should call a function generating this additional filter.
>> maybe somehow like that …
>>
>> RUN ON hashtext($1) -- this one already knows about the increased cluster
>> GENERATE FILTER my_create_the_bloody_filter_func($1) -- this one would "massage" the query going to the node.
>>
>> it would actually open the door for a lot of additional trickery.
>> the function would tell the proxy what to append - and: this "what" would be under your full control.
>>
>> what do you think?
>
> Hmm, could work for simplest cases, but this has 2 main problems:
>
> 1) you need a full SQL parser to make this generally useful for plain
> SQL

i think that everything beyond a simple case is pretty hard to achieve anyway.
to me it looks pretty impossible to solve this in a generic way without same insane amount of labor input - at listen given the ideas coming to me in the past.
and yes, functions are an issue. unless you have some sort of "virtually private database" thing it is close to impossible (unless you want to try some nightmare based on views / constraint exclusion on the partitions or so).

regards,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de