Re: Dealing with big tables

From: Sami Dalouche <skoobi(at)free(dot)fr>
To: Mindaugas <ml(at)kilimas(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Dealing with big tables
Date: 2007-12-02 11:05:35
Message-ID: 1196593535.7946.8.camel@whiteair
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

my answer may be out of topic since you might be looking for a
postgres-only solution.. But just in case....

What are you trying to achieve exactly ? Is there any way you could
re-work your algorithms to avoid selects and use a sequential scan
(consider your postgres data as one big file) to retrieve each of the
rows, analyze / compute them (possibly in a distributed manner), and
join the results at the end ?

A few pointers :
http://lucene.apache.org/hadoop/
http://www.gridgain.com/

Regards,
Sami Dalouche

On Sun, 2007-12-02 at 12:26 +0200, Mindaugas wrote:
> Hello,
>
> Started to work with big tables (like 300GB) and performance problems started to appear. :(
>
> To simplify things - table has an index on From an index on To columns. And it also have several other not indexed columns. There are 100000+ of different values for From and the same for To.
>
> I execute simple query "select * from bigtable where From='something'". Query returns like 1000 rows and takes 5++ seconds to complete. As far as I understand the query is slow because:
> - first it has to retrieve pointers to rows with data from index. That goes fast.
> - retrieve all the rows one by one. There we have 100% random read because rows with the same From is distributed evenly through all the 300GB and most probably nothing is cached. So there we are limited by _one_ disk performance independently of how many disks we have in storage? And in case storage consists of 15k rpm Cheetahs with 3.5ms average read seek time we should expect not more than ~285 rows per second?
>
> I feel that I'm overlooking something here. But I'm new into data warehousing. :)
>
> Also this query should greatly benefit from parallel execution or async IO. Storage (seeks/second) scales almost linearly when it has a lot of disks. And query is completely IO bound so it should scale well on single server.
>
> And I cannot use some index organized table or table partitioned by From :) because there are at least 2 similar indexes by which queries can be executed - From and To.
>
> Ideas for improvement? Greenplum or EnterpriseDB? Or I forgot something from PostgreSQL features.
>
> Thanks,
>
> Mindaugas
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Mindaugas 2007-12-02 11:37:56 Re: Dealing with big tables
Previous Message Mindaugas 2007-12-02 10:26:17 Dealing with big tables