Re: intermittant performance problem

From: Mike Charnoky <noky(at)nextbus(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: intermittant performance problem
Date: 2009-03-10 02:10:53
Message-ID: 49B5CC2D.4070905@nextbus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Yeah, I wish I didn't have to resort to using ORDER BY RANDOM(). I
really wanted to use something like TABLESAMPLE, but that is not
implemented in PostgreSQL. Unfortunately, I cannot use use the random
sampling technique you mentioned, since I need to select samples across
various strata of the data (in this case, where "item_name=something"),
not just between timestamp ranges. Guess I'll just have to try kicking
up the work_mem for that query.

Thanks so much for your input.

Mike

Scott Marlowe wrote:
> On Mon, Mar 9, 2009 at 1:55 PM, Mike Charnoky <noky(at)nextbus(dot)com> wrote:
>> Hello,
>>
>> I'm looking for some insight on an intermittent PostgreSQL performance
>> problem that has been very troublesome. Using PG 8.3.5 on a server
>> running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10
>> SCSI 600GB array).
>>
>> The problem in a nutshell is this: on some days, a nightly sampling
>> process (which reads data from one very large table and writes to
>> another) runs about 2 orders of magnitude slower and disk IO goes
>> through the roof. The sampling process (which starts at 1am and usually
>> takes ~30 minutes) takes many hours to complete and eventually ends up
>> interfering with other processes that need to access the database.
>> Other processes which need to write to the db get backed up and
>> eventually data gets dropped (ie: in memory queues waiting for db writes
>> get filled up).
>>
>> The problem only happens on maybe one or two random days during the
>> week. There is generally no other db activity during this time
>> (pg_stat_activity shows mostly idle connections). It seems as if db
>> cache is not being used properly and heavy disk usage results. Not sure
>> how to test this assumption.
>>
>> Details are as follows:
>>
>> 1) The db contains a "raw_data" table which stores on the order of 10-15
>> million rows per day. A total of two weeks of data are stored, total
>> table size is about 40GB (with indices).
>> 2) Every day, a process runs which samples data from the "raw_data"
>> table and stores it to the "sampled_data" table for long term storage.
>> The sampling is done across two strata: time of day (time range) and
>> name of item (text field). That is, the day is divided into about 5
>> chunks to ensure that we sample enough data for each chunk of time, for
>> each item.
>> 3) The sampling process happens in two passes. The first pass looks at
>> the data in order to determine the sample size required for each (time
>> of day, item name). This consists of running some aggregate queries
>> over the entire dataset to be sampled (std dev, count, etc). Sample
>> sizes are calculated for all items at once for a given chunk of time.
>> The second pass actually performs random sampling of the data and stores
>> the samples in the "sampled_data" table. It is this second pass of the
>> sampling process that is running about 2 orders of magnitude slower!
>> 4) After the sampling process finishes, the oldest day's worth of data
>> is deleted from the "raw_data" table.
>>
>> The sampling query which runs really slow on some days looks something
>> like this:
>>
>> INSERT INTO sampled_data
>> (item_name, timestmp, ... )
>> SELECT item_name, timestmp, ... )
>> FROM raw_data
>> WHERE timestmp >= ? and timestmp < ?
>> AND item_name=?
>> AND some_data_field NOTNULL
>> ORDER BY random()
>> LIMIT ?;
>
> Have you got any other method for doing the sampling that order by
> random()? order by random() is the most inefficient way you could
> possibly do this. If you know the range of say, ids:
>
> select max(id), min(id) from rawtable where timestmp >= ? and timestmp < ?
>
> to get it. Then use a random number generator to generate a list of
> ids between those two ids, and select x rows from the database.
>
> select * from rawtable where id in (1000 ids be here);
>
> Will be WAY faster than order by random().
>
>> Changes made to postgresql.conf include:
>> max_connections = 250
>> shared_buffers = 1024MB
>> work_mem = 32MB
>
> If you are married to order by random() then you might wanna crank up
> work_mem while running that query. I'd try something in the 128 to
> 512M range to start with.
>
>> Any pointers on how to troubleshoot this?
>
> Try methods that don't involve order by random().
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Mike Charnoky 2009-03-10 02:21:25 Re: intermittant performance problem
Previous Message Joshua D. Drake 2009-03-10 00:23:06 Re: Connection Refused Error