Re: 8.3 beta testing suggestions welcome

Lists: pgsql-hackers
From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: 8.3 beta testing suggestions welcome
Date: 2007-08-20 21:06:46
Message-ID: 46C9BC16.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been lobbying management here for us to allocate some resources to testing 8.3 once it hits beta. If it is approved, it might happen on a time frame too short to get much feedback before the tests, so I'm throwing the question out here now: what would people like us to bang on?

The box most likely to be used for the testing is a bit old, but still, it is SMP and we would be throwing real-world traffic at it, so it should be of some value. It has 4 2 GHz Xeon MP CPUs, 6 GB RAM, and a RAID controller with 256 MB battery-backed RAM cache. The 230 GB database would be sitting on a 407 GB RAID 5 array. In addition to the PostgreSQL instance there would be two Java middle tiers running on the box.

One middle tier is for modifying data based on transactions received from 72 source databases; this load is about 1 million database transactions on a typical work day, with an average of maybe 20 INSERT, UPDATE, and DELETE statements per transaction. (We don't typically have many deletes.) The other middle tier uses a login which only has SELECT rights to support our web site. We have about 2 million web hits per day generating about 10 million database transactions. We can play the actual HTTP requests from our log through a bank of renderers to get a real mix of queries from production.

We're particularly interested in seeing what configuration changes we may have to make to achieve optimal performance with the checkpoints and background writer in the new release. When we first went to PostgreSQL our biggest problem was that dirty buffers would accumulate in shared memory until a checkpoint, and then overrun the controllers cache. This would cause disk reads to queue up behind the writes, and queries which normally ran in a millisecond or two were timing out at our renderers' 20 second limit. The problem went away completely when we used a very aggressive background writer configuration, to put the dirty pages in front of the OS file system right away, so that its algorithms and the controller cache could deal with things before they got out of hand.

We could run some tests with just the read-only web load, if that is useful, or push the update load alone. We could pace input. My guess is that the most useful tests would involve letting both run as fast as the machine can handle it with various configurations and see what throughput and timeout counts we get.

Any thoughts or suggestions welcome, particularly about what configurations to try.

-Kevin


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 8.3 beta testing suggestions welcome
Date: 2007-08-20 21:18:22
Message-ID: 46CA051E.1070107@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin Grittner wrote:
> I've been lobbying management here for us to allocate some resources to testing 8.3 once it hits beta. If it is approved, it might happen on a time frame too short to get much feedback before the tests, so I'm throwing the question out here now: what would people like us to bang on?

That would be great!

> We're particularly interested in seeing what configuration changes we may have to make to achieve optimal performance with the checkpoints and background writer in the new release. When we first went to PostgreSQL our biggest problem was that dirty buffers would accumulate in shared memory until a checkpoint, and then overrun the controllers cache. This would cause disk reads to queue up behind the writes, and queries which normally ran in a millisecond or two were timing out at our renderers' 20 second limit. The problem went away completely when we used a very aggressive background writer configuration, to put the dirty pages in front of the OS file system right away, so that its algorithms and the controller cache could deal with things before they got out of hand.

Yes, the load distributed checkpoints definitely should help with that.
I'd like to see how well it works for you with the default bgwriter
settings.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.3 beta testing suggestions welcome
Date: 2007-08-20 21:47:27
Message-ID: 87lkc56fps.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:

> When we first went to PostgreSQL our biggest problem was that dirty buffers
> would accumulate in shared memory until a checkpoint, and then overrun the
> controllers cache. This would cause disk reads to queue up behind the
> writes, and queries which normally ran in a millisecond or two were timing
> out at our renderers' 20 second limit. The problem went away completely when
> we used a very aggressive background writer configuration, to put the dirty
> pages in front of the OS file system right away, so that its algorithms and
> the controller cache could deal with things before they got out of hand.

Sounds like a tailor-mode use case for precisely what Heikki was complaining
about. He couldn't find a case in 8.3 where tuning the bgwriter to be more
aggressive helped at all.

With the load distributed checkpoints I think the symptoms would be different
but the disease may still be there. Since checkpoints will try not to swamp
your i/o bandwidth any longer you shouldn't get these terrible spikes.

However the theory with bgwriter is that setting it to be very aggressive will
reduce the response time even outside the checkpoints by avoiding the need for
individual backends to evict dirty pages. So it would be interesting to know
with 8.3 whether the average response time even outside of checkpoints is
reduced by having a more aggressive bgwriter policy.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 8.3 beta testing suggestions welcome
Date: 2007-08-21 07:36:57
Message-ID: Pine.GSO.4.64.0708210326370.2093@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 20 Aug 2007, Kevin Grittner wrote:

> The problem went away completely when we used a very aggressive
> background writer configuration, to put the dirty pages in front of the
> OS file system right away, so that its algorithms and the controller
> cache could deal with things before they got out of hand.

Can you share what the aggressive BGW settings you settled on were?
Knowing that information would allow giving better suggestions on how to
translate what you're currently doing into the very different settings 8.3
uses in this area. Also be helpful to know about how often you have a
checkpoint with your current configuration, and how big your
shared_buffers is to get a general context for the size/frequency of
potential checkpoint problems.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.3 beta testing suggestions welcome
Date: 2007-08-21 14:34:17
Message-ID: 46CAB199.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Tue, Aug 21, 2007 at 2:36 AM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0708210326370(dot)2093(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:
> On Mon, 20 Aug 2007, Kevin Grittner wrote:
>
>> The problem went away completely when we used a very aggressive
>> background writer configuration, to put the dirty pages in front of the
>> OS file system right away, so that its algorithms and the controller
>> cache could deal with things before they got out of hand.
>
> Can you share what the aggressive BGW settings you settled on were?

Under 8.2.4, the postgresql.conf file, minus blank lines and comments:

listen_addresses = '*'
max_connections = 200
shared_buffers = 160MB
temp_buffers = 50MB
work_mem = 10MB
maintenance_work_mem = 160MB
max_fsm_pages = 800000
bgwriter_lru_percent = 20.0
bgwriter_lru_maxpages = 200
bgwriter_all_percent = 10.0
bgwriter_all_maxpages = 600
wal_buffers = 160kB
checkpoint_segments = 10
random_page_cost = 2.0
effective_cache_size = 5GB
redirect_stderr = on
log_line_prefix = '[%m] %p %q<%u %d %r> '
stats_block_level = on
stats_row_level = on
autovacuum = on
autovacuum_naptime = 10s
autovacuum_vacuum_threshold = 1
autovacuum_analyze_threshold = 1
datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
escape_string_warning = off
standard_conforming_strings = on
sql_inheritance = off

> Also be helpful to know about how often you have a
> checkpoint with your current configuration

Since our problems associated with checkpoints have been eliminated I
haven't been tracking them very closely. A scan of our log files on this
machine show that there were two episodes in the last two months where we
had 20 to 40 checkpoints in a day which were less than 30 seconds after the
prior one. These were associated with times when this central copy had
fallen significantly behind the source databases and replication was running
full out, catching up with the sources.

Outside of those events we seem to have on to four episodes on scattered
days. A quick check shows that all but a few coincide with a weekly run
which passes massive amounts of data to rebuild a table of cases which
should not show on the public web site according to the Supreme Court's
records retention rules.

Is there anything you would like me to do to capture more information
about the checkpoints?

-Kevin


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 8.3 beta testing suggestions welcome
Date: 2007-08-21 18:12:26
Message-ID: Pine.GSO.4.64.0708211316210.16511@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 21 Aug 2007, Kevin Grittner wrote:

> shared_buffers = 160MB
> effective_cache_size = 5GB
> bgwriter_lru_percent = 20.0
> bgwriter_lru_maxpages = 200
> bgwriter_all_percent = 10.0
> bgwriter_all_maxpages = 600
> checkpoint_segments = 10

The other thing I realized you haven't mentioned yet is what operating
system you're using.

What you've done here is make your shared_buffers smaller than even the
write cache on your disk controller (256MB), so that anytime a checkpoint
hits it should be able to cache writing even the entire buffer cache out
if necessary. You're letting the OS handle caching everything else, which
is memory not involved in the checkpoint process and therefore doesn't
impact your problem situation.

With checkpoint_segments at 10, you can't do much activity (relative to
how fast this powerful of a server can dirty pages) before a checkpoint
happens, which also limits how much any one checkpoint has to clean up.
And your background writer settings are aggressive enough that the entire
pool is getting scanned constantly, which is wasting lots of CPU resources
but you have them to spare.

This is reasonable solution for working around checkpoint issues in 8.2.4,
but the fact that you're still having checkpoints spaced 30 seconds apart
sometimes is one weak spot you already have seen. PostgreSQL operating
with a tiny shared_buffers cache is another, and you could expect overall
performance to improve if it were possible to increase that without
getting killed by checkpoints.

In 8.3, it should be possible for you to increase both shared_buffers and
checkpoint_segments significantly and still have checkpoints happen in an
orderly fashion. There is no all-scan anymore, instead there's a
parameter called checkpoint_completion_target; see
http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html

My suggestion for a starting 8.3 configuration for you would be adjusting
these settings as follows:

shared_buffers=1GB
checkpoint_segments = 50

And then try replaying your data with checkpoint_completion_target at 0.5
(default), 0.7, and 0.9 and see how each of those works out for you (0.7
is the least useful of those if you only did two). Hopefully the data
replay you have can be setup to invoke the same tightly spaced checkpoint
behavior you commented about. Based on the results of those tests, it may
be possible to further increase shared_buffers, and
checkpoint_segments/checkpoint_timeout may need some adjustment one way or
another.

Note that some/all of the bgwriter_lru parameters may be going away before
8.3 is done as well, that's a loose end I'm working on right now.

If your project gets approved, that's what I think would be a useful test
to run. That should get some good results for the community as large as
well as research how upgrading to the new version might positively impact
your application. You're actually in one of the situations I'm a little
concerned about. All the tests that have been done by people here have
suggested using checkpoint_completion_target and removing the all scan are
always net positive compared to the 8.2.4 behavior, but your situation
(where you're heavily using bgwriter_all_percent = 10.0 and
bgwriter_all_maxpages = 600) is one where it's possible 8.3 may be a step
backwards. Not likely, just possible, and it would be great to get
another data point on this during the beta.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.3 beta testing suggestions welcome
Date: 2007-08-21 18:27:07
Message-ID: 46CAE82B.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Tue, Aug 21, 2007 at 1:12 PM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0708211316210(dot)16511(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:
>
> The other thing I realized you haven't mentioned yet is what operating
> system you're using.

Linux version 2.6.5-7.286-bigsmp (geeko(at)buildhost) (gcc version 3.3.3 (SuSE Linux)) #1 SMP Thu May 31 10:12:58 UTC 2007

SUSE LINUX Enterprise Server 9 (i586)
VERSION = 9
PATCHLEVEL = 3

I know this is old, but I was happy just to get them off of Windows. :-/

> My suggestion for a starting 8.3 configuration for you would be adjusting
> these settings as follows:
>
> shared_buffers=1GB
> checkpoint_segments = 50
>
> And then try replaying your data with checkpoint_completion_target at 0.5
> (default), 0.7, and 0.9 and see how each of those works out for you (0.7
> is the least useful of those if you only did two). Hopefully the data
> replay you have can be setup to invoke the same tightly spaced checkpoint
> behavior you commented about. Based on the results of those tests, it may
> be possible to further increase shared_buffers, and
> checkpoint_segments/checkpoint_timeout may need some adjustment one way or
> another.
>
> Note that some/all of the bgwriter_lru parameters may be going away before
> 8.3 is done as well, that's a loose end I'm working on right now.

Thanks much. If I get the approval, I'll be sure to run these three tests.
It probably makes sense to get a baseline with the current config first.
I assume you want me to do these with both the update and query streams
running full out?

Anyone else have a configuration to suggest?

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.3 beta testing suggestions welcome
Date: 2007-11-30 23:17:01
Message-ID: 4750458D.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Tue, Aug 21, 2007 at 1:12 PM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0708211316210(dot)16511(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:
> My suggestion for a starting 8.3 configuration for you would be adjusting
> these settings as follows:
>
> shared_buffers=1GB
> checkpoint_segments = 50

I'm not into the most meaningful tests yet, but I'm trying to
garner what useful information I can from the load of the 8.2.5
database into a server running the cvs tip copy taken a few hours
ago. After letting the cache settle in, while in the middle of a
COPY of a large table, I see the attached through a couple
checkpoint cycles. The pattern is pretty consistent with this
over other the other checkpoint cycles I've examined.

Note that the system can sustain over 20000 writes per second
without the disk being a bottleneck. The current tip behavior is
smoother than 8.2.x without background writer, for sure. It still
isn't as smooth as 8.2.x with an aggressive backgroud writer, at
least for loading a dump. Note how it lounges around at as low as
5500 writes per second for prolonged periods with occassional
spikes into the 55000 range, causing I/O wait time, although these
only last a few seconds. Without more rigorous tests for
comparison, I'd say that this is probably a net loss for this
operation -- I seem to recall seeing a pretty steady 20000 bo under
8.2.x.

More info to follow as testing progresses.

-Kevin

Attachment Content-Type Size
vmstat.sample1.out application/octet-stream 27.5 KB

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 8.3 beta testing suggestions welcome
Date: 2007-12-01 00:04:28
Message-ID: Pine.GSO.4.64.0711301853210.15926@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 30 Nov 2007, Kevin Grittner wrote:

>> checkpoint_segments = 50

Here's how I'm reading your data:

16:43:11 : Checkpoint start
16:44:23 : Checkpoint ends [ 1:22 long]
1:01 passes
16:45:24 : Checkpoint start
16:46:36 : checkpoint ends [1:22 long]

If you're getting a checkpoint every minute or so, the number of
checkpoint segments you're using is still far too low to really take
advantage of the new checkpoint spreading features. You want to make that
high enough that it's several minutes between them. It's not unusual to
set a much higher checkpoint_segments during bulk loading than you'd
necessarily want for actual production use.

One thing that would make your reports more useful: in the spots in your
log where you're labeling [checkpoint starting] etc., if you could include
the new message spit out by turning on checkpoint_log there it would help.
That's the one that says how much data was written.

Did you ever write something to save snapshots of pg_stat_bgwriter?
Those would be interesting to see on the same time scale as well.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: testing result overview (was: 8.3 beta testing suggestions welcome)
Date: 2008-01-18 19:08:46
Message-ID: 4790A4DE.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Tue, Aug 21, 2007 at 1:12 PM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0708211316210(dot)16511(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:

> My suggestion for a starting 8.3 configuration for you would be adjusting
> these settings as follows:
>
> shared_buffers=1GB
> checkpoint_segments = 50
>
> And then try replaying your data with checkpoint_completion_target at 0.5
> (default), 0.7, and 0.9 and see how each of those works out for you (0.7
> is the least useful of those if you only did two). Hopefully the data
> replay you have can be setup to invoke the same tightly spaced checkpoint
> behavior you commented about.

Under 8.2.4, 8.2.5, 8.3beta4 and a recent compile of the CVS tip,
the test machine (4 2 GHz Xeons with 6 GB RAM) was unable to create
the clusters of timeouts which we saw on the production box (8 3GHz
Xeons with 12 GB RAM), in spite of concerted efforts to recreate the
conditions. It's possible that updates to PostgreSQL or the OS have
made the difference, although I think it's more likely that the
larger OS cache and/or the higher CPU power of the production box
are needed to create the problem on this type of disk array. We're
working on getting a couple new boxes on line, which will allow us
to test using the current production box, but that's weeks away.

Performance under 8.3 is better in general, and there is a very
noticeable smoothing of the output spikes at the OS level. I didn't
notice much difference in vmstat output with the different
checkpoint_completion_target settings.

Our testing methoodology involved playing transaction streams from
both our replication engine and our web site, at the same time,
against the test server, through the normal middle tier software we
use in production. We have timestamps on all of it, which allowed
us to play the streams at various percentages of the production
load. Based on the timeouts from the web renderers (which cancel
a request if the middle tier hasn't responded within 20 seconds),
the middle tier metrics of actual run time once the request is
pulled from the request queue, and the vmstat output, 8.3 shows
significant performance improvement as long as the fix for the
OUTER JOIN optimizer regression is used (that is, with a recent
checkout from the CVS tip, or beta/RC with Tom's patch applied).

-Kevin