Re: intermittant performance problem

Lists: pgsql-general
From: Mike Charnoky <noky(at)nextbus(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: intermittant performance problem
Date: 2009-03-09 19:55:46
Message-ID: 49B57442.2010909@nextbus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hello,

I'm looking for some insight on an intermittent PostgreSQL performance
problem that has been very troublesome. Using PG 8.3.5 on a server
running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10
SCSI 600GB array).

The problem in a nutshell is this: on some days, a nightly sampling
process (which reads data from one very large table and writes to
another) runs about 2 orders of magnitude slower and disk IO goes
through the roof. The sampling process (which starts at 1am and usually
takes ~30 minutes) takes many hours to complete and eventually ends up
interfering with other processes that need to access the database.
Other processes which need to write to the db get backed up and
eventually data gets dropped (ie: in memory queues waiting for db writes
get filled up).

The problem only happens on maybe one or two random days during the
week. There is generally no other db activity during this time
(pg_stat_activity shows mostly idle connections). It seems as if db
cache is not being used properly and heavy disk usage results. Not sure
how to test this assumption.

Details are as follows:

1) The db contains a "raw_data" table which stores on the order of 10-15
million rows per day. A total of two weeks of data are stored, total
table size is about 40GB (with indices).
2) Every day, a process runs which samples data from the "raw_data"
table and stores it to the "sampled_data" table for long term storage.
The sampling is done across two strata: time of day (time range) and
name of item (text field). That is, the day is divided into about 5
chunks to ensure that we sample enough data for each chunk of time, for
each item.
3) The sampling process happens in two passes. The first pass looks at
the data in order to determine the sample size required for each (time
of day, item name). This consists of running some aggregate queries
over the entire dataset to be sampled (std dev, count, etc). Sample
sizes are calculated for all items at once for a given chunk of time.
The second pass actually performs random sampling of the data and stores
the samples in the "sampled_data" table. It is this second pass of the
sampling process that is running about 2 orders of magnitude slower!
4) After the sampling process finishes, the oldest day's worth of data
is deleted from the "raw_data" table.

The sampling query which runs really slow on some days looks something
like this:

INSERT INTO sampled_data
(item_name, timestmp, ... )
SELECT item_name, timestmp, ... )
FROM raw_data
WHERE timestmp >= ? and timestmp < ?
AND item_name=?
AND some_data_field NOTNULL
ORDER BY random()
LIMIT ?;

We have done a great deal of PG tuning, including the autovacuum for the
"raw_data" table. Autovacuum kicks like clockwork every day on that
table after the sampling process finishes (after one day's worth of data
is deleted from "raw_data" table, a roughly 7% change in size).

Changes made to postgresql.conf include:
max_connections = 250
shared_buffers = 1024MB
work_mem = 32MB
maintenance_work_mem = 256MB
max_fsm_pages = 10000000
max_fsm_relations = 30000
checkpoint_segments = 64
checkpoint_timeout = 10min
checkpoint_warning = 1min

Any pointers on how to troubleshoot this?

Mike


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Mike Charnoky <noky(at)nextbus(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: intermittant performance problem
Date: 2009-03-09 23:08:50
Message-ID: dcc563d10903091608l207b97d5k1593a740212cd869@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, Mar 9, 2009 at 1:55 PM, Mike Charnoky <noky(at)nextbus(dot)com> wrote:
> Hello,
>
> I'm looking for some insight on an intermittent PostgreSQL performance
> problem that has been very troublesome.  Using PG 8.3.5 on a server
> running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10
> SCSI 600GB array).
>
> The problem in a nutshell is this: on some days, a nightly sampling
> process (which reads data from one very large table and writes to
> another) runs about 2 orders of magnitude slower and disk IO goes
> through the roof.  The sampling process (which starts at 1am and usually
> takes ~30 minutes) takes many hours to complete and eventually ends up
> interfering with other processes that need to access the database.
> Other processes which need to write to the db get backed up and
> eventually data gets dropped (ie: in memory queues waiting for db writes
> get filled up).
>
> The problem only happens on maybe one or two random days during the
> week.  There is generally no other db activity during this time
> (pg_stat_activity shows mostly idle connections).  It seems as if db
> cache is not being used properly and heavy disk usage results.  Not sure
> how to test this assumption.
>
> Details are as follows:
>
> 1) The db contains a "raw_data" table which stores on the order of 10-15
> million rows per day.  A total of two weeks of data are stored, total
> table size is about 40GB (with indices).
> 2) Every day, a process runs which samples data from the "raw_data"
> table and stores it to the "sampled_data" table for long term storage.
> The sampling is done across two strata: time of day (time range) and
> name of item (text field).  That is, the day is divided into about 5
> chunks to ensure that we sample enough data for each chunk of time, for
> each item.
> 3) The sampling process happens in two passes.  The first pass looks at
> the data in order to determine the sample size required for each (time
> of day, item name).  This consists of running some aggregate queries
> over the entire dataset to be sampled (std dev, count, etc).  Sample
> sizes are calculated for all items at once for a given chunk of time.
> The second pass actually performs random sampling of the data and stores
> the samples in the "sampled_data" table.  It is this second pass of the
> sampling process that is running about 2 orders of magnitude slower!
> 4) After the sampling process finishes, the oldest day's worth of data
> is deleted from the "raw_data" table.
>
> The sampling query which runs really slow on some days looks something
> like this:
>
> INSERT INTO sampled_data
>  (item_name, timestmp, ... )
>  SELECT item_name, timestmp, ... )
>  FROM raw_data
>  WHERE timestmp >= ? and timestmp < ?
>  AND item_name=?
>  AND some_data_field NOTNULL
>  ORDER BY random()
>  LIMIT ?;

Have you got any other method for doing the sampling that order by
random()? order by random() is the most inefficient way you could
possibly do this. If you know the range of say, ids:

select max(id), min(id) from rawtable where timestmp >= ? and timestmp < ?

to get it. Then use a random number generator to generate a list of
ids between those two ids, and select x rows from the database.

select * from rawtable where id in (1000 ids be here);

Will be WAY faster than order by random().

> Changes made to postgresql.conf include:
> max_connections = 250
> shared_buffers = 1024MB
> work_mem = 32MB

If you are married to order by random() then you might wanna crank up
work_mem while running that query. I'd try something in the 128 to
512M range to start with.

> Any pointers on how to troubleshoot this?

Try methods that don't involve order by random().


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mike Charnoky <noky(at)nextbus(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: intermittant performance problem
Date: 2009-03-10 00:00:20
Message-ID: 18232.1236643220@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Mike Charnoky <noky(at)nextbus(dot)com> writes:
> The sampling query which runs really slow on some days looks something
> like this:

> INSERT INTO sampled_data
> (item_name, timestmp, ... )
> SELECT item_name, timestmp, ... )
> FROM raw_data
> WHERE timestmp >= ? and timestmp < ?
> AND item_name=?
> AND some_data_field NOTNULL
> ORDER BY random()
> LIMIT ?;

Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
There's no good way to optimize "ORDER BY random()". However, it seems
like the first thing you should do is modify the program so that it
issues an EXPLAIN for that right before actually doing the query, and
then you could see if the plan is different on the slow days.

> We have done a great deal of PG tuning, including the autovacuum for the
> "raw_data" table. Autovacuum kicks like clockwork every day on that
> table after the sampling process finishes (after one day's worth of data
> is deleted from "raw_data" table, a roughly 7% change in size).

Also, are you sure you have ruled out the possibility that the problem
comes from autovac kicking in *while* the update is running?

regards, tom lane


From: Mike Charnoky <noky(at)nextbus(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: intermittant performance problem
Date: 2009-03-10 02:10:53
Message-ID: 49B5CC2D.4070905@nextbus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Yeah, I wish I didn't have to resort to using ORDER BY RANDOM(). I
really wanted to use something like TABLESAMPLE, but that is not
implemented in PostgreSQL. Unfortunately, I cannot use use the random
sampling technique you mentioned, since I need to select samples across
various strata of the data (in this case, where "item_name=something"),
not just between timestamp ranges. Guess I'll just have to try kicking
up the work_mem for that query.

Thanks so much for your input.

Mike

Scott Marlowe wrote:
> On Mon, Mar 9, 2009 at 1:55 PM, Mike Charnoky <noky(at)nextbus(dot)com> wrote:
>> Hello,
>>
>> I'm looking for some insight on an intermittent PostgreSQL performance
>> problem that has been very troublesome. Using PG 8.3.5 on a server
>> running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10
>> SCSI 600GB array).
>>
>> The problem in a nutshell is this: on some days, a nightly sampling
>> process (which reads data from one very large table and writes to
>> another) runs about 2 orders of magnitude slower and disk IO goes
>> through the roof. The sampling process (which starts at 1am and usually
>> takes ~30 minutes) takes many hours to complete and eventually ends up
>> interfering with other processes that need to access the database.
>> Other processes which need to write to the db get backed up and
>> eventually data gets dropped (ie: in memory queues waiting for db writes
>> get filled up).
>>
>> The problem only happens on maybe one or two random days during the
>> week. There is generally no other db activity during this time
>> (pg_stat_activity shows mostly idle connections). It seems as if db
>> cache is not being used properly and heavy disk usage results. Not sure
>> how to test this assumption.
>>
>> Details are as follows:
>>
>> 1) The db contains a "raw_data" table which stores on the order of 10-15
>> million rows per day. A total of two weeks of data are stored, total
>> table size is about 40GB (with indices).
>> 2) Every day, a process runs which samples data from the "raw_data"
>> table and stores it to the "sampled_data" table for long term storage.
>> The sampling is done across two strata: time of day (time range) and
>> name of item (text field). That is, the day is divided into about 5
>> chunks to ensure that we sample enough data for each chunk of time, for
>> each item.
>> 3) The sampling process happens in two passes. The first pass looks at
>> the data in order to determine the sample size required for each (time
>> of day, item name). This consists of running some aggregate queries
>> over the entire dataset to be sampled (std dev, count, etc). Sample
>> sizes are calculated for all items at once for a given chunk of time.
>> The second pass actually performs random sampling of the data and stores
>> the samples in the "sampled_data" table. It is this second pass of the
>> sampling process that is running about 2 orders of magnitude slower!
>> 4) After the sampling process finishes, the oldest day's worth of data
>> is deleted from the "raw_data" table.
>>
>> The sampling query which runs really slow on some days looks something
>> like this:
>>
>> INSERT INTO sampled_data
>> (item_name, timestmp, ... )
>> SELECT item_name, timestmp, ... )
>> FROM raw_data
>> WHERE timestmp >= ? and timestmp < ?
>> AND item_name=?
>> AND some_data_field NOTNULL
>> ORDER BY random()
>> LIMIT ?;
>
> Have you got any other method for doing the sampling that order by
> random()? order by random() is the most inefficient way you could
> possibly do this. If you know the range of say, ids:
>
> select max(id), min(id) from rawtable where timestmp >= ? and timestmp < ?
>
> to get it. Then use a random number generator to generate a list of
> ids between those two ids, and select x rows from the database.
>
> select * from rawtable where id in (1000 ids be here);
>
> Will be WAY faster than order by random().
>
>> Changes made to postgresql.conf include:
>> max_connections = 250
>> shared_buffers = 1024MB
>> work_mem = 32MB
>
> If you are married to order by random() then you might wanna crank up
> work_mem while running that query. I'd try something in the 128 to
> 512M range to start with.
>
>> Any pointers on how to troubleshoot this?
>
> Try methods that don't involve order by random().
>


From: Mike Charnoky <noky(at)nextbus(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: intermittant performance problem
Date: 2009-03-10 02:21:25
Message-ID: 49B5CEA5.9010306@nextbus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

The random sampling query is normally pretty snappy. It usually takes
on the order of 1 second to sample a few thousand rows of data out of a
few million. The sampling is consistently quick, too. However, on some
days, the sampling starts off quick, then when the process starts
sampling from a different subset of data (different range of times for
the same day), the sampling query takes a couple minutes.

Regarding the concurrent vacuuming, this is definitely not happening. I
always check pg_stat_activity whenever the sampling process starts to
lag behind. I have never seen a vacuum running during this time.

Interesting idea to issue the EXPLAIN first... I will see if I can
instrument the sampling program to do this.

Thanks for your help Tom.

Mike

Tom Lane wrote:
> Mike Charnoky <noky(at)nextbus(dot)com> writes:
>> The sampling query which runs really slow on some days looks something
>> like this:
>
>> INSERT INTO sampled_data
>> (item_name, timestmp, ... )
>> SELECT item_name, timestmp, ... )
>> FROM raw_data
>> WHERE timestmp >= ? and timestmp < ?
>> AND item_name=?
>> AND some_data_field NOTNULL
>> ORDER BY random()
>> LIMIT ?;
>
> Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
> There's no good way to optimize "ORDER BY random()". However, it seems
> like the first thing you should do is modify the program so that it
> issues an EXPLAIN for that right before actually doing the query, and
> then you could see if the plan is different on the slow days.
>
>> We have done a great deal of PG tuning, including the autovacuum for the
>> "raw_data" table. Autovacuum kicks like clockwork every day on that
>> table after the sampling process finishes (after one day's worth of data
>> is deleted from "raw_data" table, a roughly 7% change in size).
>
> Also, are you sure you have ruled out the possibility that the problem
> comes from autovac kicking in *while* the update is running?
>
> regards, tom lane
>


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Mike Charnoky <noky(at)nextbus(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject: Re: intermittant performance problem
Date: 2009-03-10 03:35:56
Message-ID: dcc563d10903092035gf777ef4y893c2ba20fe8e9c8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, Mar 9, 2009 at 8:21 PM, Mike Charnoky <noky(at)nextbus(dot)com> wrote:
> The random sampling query is normally pretty snappy.  It usually takes on
> the order of 1 second to sample a few thousand rows of data out of a few
> million.  The sampling is consistently quick, too.  However, on some days,
> the sampling starts off quick, then when the process starts sampling from a
> different subset of data (different range of times for the same day), the
> sampling query takes a couple minutes.

Then definitely look at saving explain plans before execution to
compare fast to slow runs. This definitely sounds like ocassionally
bad query plans to me so far.

> Regarding the concurrent vacuuming, this is definitely not happening.  I
> always check pg_stat_activity whenever the sampling process starts to lag
> behind.  I have never seen a vacuum running during this time.

And if autovac is getting in the ways, try adjusting the various
autovac options. spefically autovacuum_vacuum_cost_delay set to 10 or
20 (mS).

>
> Interesting idea to issue the EXPLAIN first... I will see if I can
> instrument the sampling program to do this.
>
> Thanks for your help Tom.
>
>
> Mike
>
> Tom Lane wrote:
>>
>> Mike Charnoky <noky(at)nextbus(dot)com> writes:
>>>
>>> The sampling query which runs really slow on some days looks something
>>> like this:
>>
>>> INSERT INTO sampled_data
>>>   (item_name, timestmp, ... )
>>>   SELECT item_name, timestmp, ... )
>>>   FROM raw_data
>>>   WHERE timestmp >= ? and timestmp < ?
>>>   AND item_name=?
>>>   AND some_data_field NOTNULL
>>>   ORDER BY random()
>>>   LIMIT ?;
>>
>> Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
>> There's no good way to optimize "ORDER BY random()".  However, it seems
>> like the first thing you should do is modify the program so that it
>> issues an EXPLAIN for that right before actually doing the query, and
>> then you could see if the plan is different on the slow days.
>>
>>> We have done a great deal of PG tuning, including the autovacuum for the
>>> "raw_data" table.  Autovacuum kicks like clockwork every day on that
>>> table after the sampling process finishes (after one day's worth of data
>>> is deleted from "raw_data" table, a roughly 7% change in size).
>>
>> Also, are you sure you have ruled out the possibility that the problem
>> comes from autovac kicking in *while* the update is running?
>>
>>                        regards, tom lane
>>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

--
When fascism comes to America, it will be the intolerant selling it as
diversity.


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mike Charnoky <noky(at)nextbus(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: intermittant performance problem
Date: 2009-03-10 12:22:56
Message-ID: 87ljrd7ejj.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Mike Charnoky <noky(at)nextbus(dot)com> writes:
>> The sampling query which runs really slow on some days looks something
>> like this:
>
>> INSERT INTO sampled_data
>> (item_name, timestmp, ... )
>> SELECT item_name, timestmp, ... )
>> FROM raw_data
>> WHERE timestmp >= ? and timestmp < ?
>> AND item_name=?
>> AND some_data_field NOTNULL
>> ORDER BY random()
>> LIMIT ?;
>
> Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
> There's no good way to optimize "ORDER BY random()".

This seems kind of unlikely but does the parameter to the LIMIT vary a lot? If
it's small enough to fit all the chosen records in work_mem then you'll avoid
a disk-sort and do a top-k scan. If it overflows work_mem then it'll fail over
to do a full disk sort of all the records picked from raw_data.

It does seem much more likely that whatever index you have it using on
timestmp or item_name or some_data_field is sometimes being used and sometimes
not. Perhaps it's switching from an index on one of those columns to an index
on some other column and that's what's throwing it off.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!


From: Mike Charnoky <noky(at)nextbus(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject: Re: intermittant performance problem
Date: 2009-03-10 13:04:54
Message-ID: 49B66576.9030402@nextbus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gregory Stark wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
>> Mike Charnoky <noky(at)nextbus(dot)com> writes:
>>> The sampling query which runs really slow on some days looks something
>>> like this:
>>> INSERT INTO sampled_data
>>> (item_name, timestmp, ... )
>>> SELECT item_name, timestmp, ... )
>>> FROM raw_data
>>> WHERE timestmp >= ? and timestmp < ?
>>> AND item_name=?
>>> AND some_data_field NOTNULL
>>> ORDER BY random()
>>> LIMIT ?;
>> Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
>> There's no good way to optimize "ORDER BY random()".
>
> This seems kind of unlikely but does the parameter to the LIMIT vary a lot? If
> it's small enough to fit all the chosen records in work_mem then you'll avoid
> a disk-sort and do a top-k scan. If it overflows work_mem then it'll fail over
> to do a full disk sort of all the records picked from raw_data.
>
> It does seem much more likely that whatever index you have it using on
> timestmp or item_name or some_data_field is sometimes being used and sometimes
> not. Perhaps it's switching from an index on one of those columns to an index
> on some other column and that's what's throwing it off.

The parameter used for the LIMIT does not vary too much. It is
typically a couple thousand records that are selected.

Judging by the disk IO monitoring we have in place, it does seem like a
full disk-sort is being done when the query runs slow. Would you expect
this action to totally hose overall database performance?

I'm instrumenting the EXPLAIN now, I'll see what this turns up over the
course of the week and will check back if I'm still stumped. Thanks.

Mike