share nothing but reads cluster

From: "Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To: <pgsql-general(at)postgresql(dot)org>
Subject: share nothing but reads cluster
Date: 2011-09-15 13:40:45
Message-ID: C4DAC901169B624F933534A26ED7DF310861B0B9@JENMAIL01.ad.intershop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

in order to face our data growth we want to start evaluate cluster
possibilities.

our application is mainly an ETL and reporting Tool for logs content.
all large tables are timely partitioned, whereas the partition logic is
driven by the application to avoid trigger overhead.

each customer gets its own schema, the largest being approx. 1TB in
size.

One of the main component is a query framework that analyze the client
request and compute the related SELECT statement.
simple SELECTs have such a general form (but there also are some much
more tricky cases) :

SELECT *mainAggregate*
FROM
(
(select *subAggregate* from table1 where ...)
UNION ALL
(select *subAggregate* from table2 where ...)
UNION ALL
(select *subAggregate* from table3 where ...)
...
)
WHERE ...



One idea is to split and process the logs on different servers and just
add the corresponding address
within the partition list on the "master".

The query framework should then be able to distribute the sub queries on
the remote server accordingly.

A quick test with db_link shows a severe drop down in performance while
putting the data together:

select count(*) from
(
select * from dblink('a','select * from test_cluster')as t1(a int)
union all
select * from dblink('a','select * from test_cluster')as t1(a int)
union all
select * from dblink('a','select * from test_cluster')as t1(a int)
)foo

is about 5 times slower than an equivalent query run locally.

working with asynchron. queries (dblink_send_query) does not bring much
benefit
so that much time seems to be spent for transfer and merge.

I've also taken a quick look at the SQL/MED and PL/Proxy documentation,
but as far as I've seen,
these features expect predefined column definition and do not allow to
push arbitrary SQLs statement,
especially the aggregate part that could take advantage of the foreign
resources.
Running the foreign sub queries concurrently would also be a fine thing.

Are there other way we should evaluate ?
Should we better wait foir POstgres 9.2+ ?

best regards,

Marc Mamin

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ondrej Ivanič 2011-09-15 13:55:29 Re: share nothing but reads cluster
Previous Message Rich Shepard 2011-09-15 13:24:42 Re: Cryptic Error Message Importing Table Dump