Quick Links

Re: Sorting writes during checkpoint

Lists:	pgsql-hackerspgsql-patches

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject:	Controlling Load Distributed Checkpoints
Date:	2007-06-06 13:19:12
Message-ID:	4666B450.8070506@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I'm again looking at way the GUC variables work in load distributed
checkpoints patch. We've discussed them a lot already, but I don't think
they're still quite right.

Write-phase
-----------
I like the way the write-phase is controlled in general. Writes are
throttled so that we spend the specified percentage of checkpoint
interval doing the writes. But we always write at a specified minimum
rate to avoid spreading out the writes unnecessarily when there's little
work to do.

The original patch uses bgwriter_all_max_pages to set the minimum rate.
I think we should have a separate variable, checkpoint_write_min_rate,
in KB/s, instead.

Nap phase
---------
This is trickier. The purpose of the sleep between writes and fsyncs is
to give the OS a chance to flush the pages to disk in it's own pace,
hopefully limiting the affect on concurrent activity. The sleep
shouldn't last too long, because any concurrent activity can be dirtying
and writing more pages, and we might end up fsyncing more than necessary
which is bad for performance. The optimal delay depends on many factors,
but I believe it's somewhere between 0-30 seconds in any reasonable system.

In the current patch, the duration of the sleep between the write and
sync phases is controlled as a percentage of checkpoint interval. Given
that the optimal delay is in the range of seconds, and
checkpoint_timeout can be up to 60 minutes, the useful values of that
percentage would be very small, like 0.5% or even less. Furthermore, the
optimal value doesn't depend that much on the checkpoint interval, it's
more dependent on your OS and memory configuration.

We should therefore give the delay as a number of seconds instead of as
a percentage of checkpoint interval.

Sync phase
----------
This is also tricky. As with the nap phase, we don't want to spend too
much time fsyncing, because concurrent activity will write more dirty
pages and we might just end up doing more work.

And we don't know how much work an fsync performs. The patch uses the
file size as a measure of that, but as we discussed that doesn't
necessarily have anything to do with reality. fsyncing a 1GB file with
one dirty block isn't any more expensive than fsyncing a file with a
single block.

Another problem is the granularity of an fsync. If we fsync a 1GB file
that's full of dirty pages, we can't limit the affect on other activity.
The best we can do is to sleep between fsyncs, but sleeping more than a
few seconds is hardly going to be useful, no matter how bad an I/O storm
each fsync causes.

Because of the above, I'm thinking we should ditch the
checkpoint_sync_percentage variable, in favor of:
checkpoint_fsync_period # duration of the fsync phase, in seconds
checkpoint_fsync_delay # max. sleep between fsyncs, in milliseconds

In all phases, the normal bgwriter activities are performed:
lru-cleaning and switching xlog segments if archive_timeout expires. If
a new checkpoint request arrives while the previous one is still in
progress, we skip all the delays and finish the previous checkpoint as
soon as possible.

GUC summary and suggested default values
----------------------------------------
checkpoint_write_percent = 50 # % of checkpoint interval to spread out
writes
checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty
buffers at checkpoint (KB/s)
checkpoint_nap_duration = 2 # delay between write and sync phase, in
seconds
checkpoint_fsync_period = 30 # duration of the sync phase, in seconds
checkpoint_fsync_delay = 500 # max. delay between fsyncs

I don't like adding that many GUC variables, but I don't really see a
way to tune them automatically. Maybe we could just hard-code the last
one, it doesn't seem that critical, but that still leaves us 4 variables.

Thoughts?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, "Greg Smith" <gsmith(at)gregsmith(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-06 14:14:14
Message-ID:	87lkexf9vt.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:

> GUC summary and suggested default values
> ----------------------------------------
> checkpoint_write_percent = 50 # % of checkpoint interval to spread out writes
> checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty
> buffers at checkpoint (KB/s)

I don't understand why this is a min_rate rather than a max_rate.

> checkpoint_nap_duration = 2 # delay between write and sync phase, in seconds

Not a comment on the choice of guc parameters, but don't we expect useful
values of this to be much closer to 30 than 0? I understand it might not be
exactly 30.

Actually, it's not so much whether there's any write traffic to the data files
during the nap that matters, it's whether there's more traffic during the nap
than during the 30s or so prior to the nap. As long as it's a steady-state
condition it shouldn't matter how long we wait, should it?

> checkpoint_fsync_period = 30 # duration of the sync phase, in seconds
> checkpoint_fsync_delay = 500 # max. delay between fsyncs

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-06 15:03:25
Message-ID:	21062.1181142205@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> GUC summary and suggested default values
> ----------------------------------------
> checkpoint_write_percent = 50 # % of checkpoint interval to spread out
> writes
> checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty
> buffers at checkpoint (KB/s)
> checkpoint_nap_duration = 2 # delay between write and sync phase, in
> seconds
> checkpoint_fsync_period = 30 # duration of the sync phase, in seconds
> checkpoint_fsync_delay = 500 # max. delay between fsyncs

> I don't like adding that many GUC variables, but I don't really see a
> way to tune them automatically.

If we don't know how to tune them, how will the users know? Having to
add that many variables to control one feature says to me that we don't
understand the feature.

Perhaps what we need is to think about how it can auto-tune itself.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-06 18:05:35
Message-ID:	Pine.GSO.4.64.0706061328450.27416@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Wed, 6 Jun 2007, Tom Lane wrote:

> If we don't know how to tune them, how will the users know?

I can tell you a good starting set for them to on a Linux system, but you
first have to let me know how much memory is in the OS buffer cache, the
typical I/O rate the disks can support, how many buffers are expected to
be written out by BGW/other backends at heaviest load, and the current
setting for /proc/sys/vm/dirty_background_ratio. It's not a coincidence
that there are patches applied to 8.3 or in the queue to measure all of
the Postgres internals involved in that computation; I've been picking
away at the edges of this problem.

Getting this sort of tuning right takes that level of information about
the underlying system. If there's a way to internally auto-tune the
values this patch operates on (which I haven't found despite months of
trying), it would be in the form of some sort of measurement/feedback loop
based on how fast data is being written out. There really are way too
many things involved to try and tune it based on anything else; the
underlying OS/hardware mechanisms that determine how this will go are
complicated enough that it might as well be a black box for most people.

One of the things I've been fiddling with the design of is a testing
program that simulates database activity at checkpoint time under load.
I think running some tests like that is the most straightforward way to
generate useful values for these tunables; it's much harder to try and
determine them from within the backends because there's so much going on
to keep track of.

I view the LDC mechanism as being in the same state right now as the
background writer: there are a lot of complicated knobs to tweak, they
all do *something* useful for someone, and eliminating them will require a
data-collection process across a much wider sample of data than can be
collected quickly. If I had to make a guess how this will end up, I'd
expect there to be more knobs in LDC than everyone would like for the 8.3
release, along with fairly verbose logging of what is happening at
checkpoint time (that's why I've been nudging development in that area,
along with making logs easier to aggregate). Collect up enough of that
information, then you're in a position to talk about useful automatic
tuning--right around the 8.4 timeframe I suspect.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-06 18:26:11
Message-ID:	Pine.GSO.4.64.0706061409050.27416@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Wed, 6 Jun 2007, Heikki Linnakangas wrote:

> The original patch uses bgwriter_all_max_pages to set the minimum rate. I
> think we should have a separate variable, checkpoint_write_min_rate, in KB/s,
> instead.

Completely agreed. There shouldn't be any coupling with the background
writer parameters, which may be set for a completely different set of
priorities than the checkpoint has. I have to look at this code again to
see why it's a min_rate instead of a max, that seems a little weird.

> Nap phase: We should therefore give the delay as a number of seconds
> instead of as a percentage of checkpoint interval.

Again, the setting here should be completely decoupled from another GUC
like the interval. My main complaint with the original form of this patch
was how much it tried to syncronize the process with the interval; since I
don't even have a system where that value is set to something, because
it's all segment based instead, that whole idea was incompatible.

The original patch tried to spread the load out as evenly as possible over
the time available. I much prefer thinking in terms of getting it done as
quickly as possible while trying to bound the I/O storm.

> And we don't know how much work an fsync performs. The patch uses the file
> size as a measure of that, but as we discussed that doesn't necessarily have
> anything to do with reality. fsyncing a 1GB file with one dirty block isn't
> any more expensive than fsyncing a file with a single block.

On top of that, if you have a system with a write cache, the time an fsync
takes can greatly depend on how full it is at the time, which there is no
way to measure or even model easily.

Is there any way to track how many dirty blocks went into each file during
the checkpoint write? That's your best bet for guessing how long the
fsync will take.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-07 08:36:53
Message-ID:	4667C3A5.9090008@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Greg Smith wrote:
> On Wed, 6 Jun 2007, Heikki Linnakangas wrote:
>
>> The original patch uses bgwriter_all_max_pages to set the minimum
>> rate. I think we should have a separate variable,
>> checkpoint_write_min_rate, in KB/s, instead.
>
> Completely agreed. There shouldn't be any coupling with the background
> writer parameters, which may be set for a completely different set of
> priorities than the checkpoint has. I have to look at this code again
> to see why it's a min_rate instead of a max, that seems a little weird.

It's min rate, because it never writes slower than that, and it can
write faster if the next checkpoint is due soon so that we wouldn't
finish before it's time to start the next one. (Or to be precise, before
the next checkpoint is closer than 100-(checkpoint_write_percent)% of
the checkpoint interval)

>> Nap phase: We should therefore give the delay as a number of seconds
>> instead of as a percentage of checkpoint interval.
>
> Again, the setting here should be completely decoupled from another GUC
> like the interval. My main complaint with the original form of this
> patch was how much it tried to syncronize the process with the interval;
> since I don't even have a system where that value is set to something,
> because it's all segment based instead, that whole idea was incompatible.

checkpoint_segments is taken into account as well as checkpoint_timeout.
I used the term "checkpoint interval" to mean the real interval at which
the checkpoints occur, whether it's because of segments or timeout.

> The original patch tried to spread the load out as evenly as possible
> over the time available. I much prefer thinking in terms of getting it
> done as quickly as possible while trying to bound the I/O storm.

Yeah, the checkpoint_min_rate allows you to do that.

So there's two extreme ways you can use LDC:
1. Finish the checkpoint as soon as possible, without disturbing other
activity too much. Set checkpoint_write_percent to a high number, and
set checkpoint_min_rate to define "too much".
2. Disturb other activity as little as possible, as long as the
checkpoint finishes in a reasonable time. Set checkpoint_min_rate to a
low number, and checkpoint_write_percent to define "reasonable time"

Are both interesting use cases, or is it enough to cater for just one of
them? I think 2 is easier to tune. Defining the min_rate properly can be
difficult and depends a lot on your hardware and application, but a
default value of say 50% for checkpoint_write_percent to tune for use
case 2 should work pretty well for most people.

In any case, the checkpoint better finish before it's time to start
another one. Or would you rather delay the next checkpoint, and let
checkpoint take as long as it takes to finish at the min_rate?

>> And we don't know how much work an fsync performs. The patch uses the
>> file size as a measure of that, but as we discussed that doesn't
>> necessarily have anything to do with reality. fsyncing a 1GB file with
>> one dirty block isn't any more expensive than fsyncing a file with a
>> single block.
>
> On top of that, if you have a system with a write cache, the time an
> fsync takes can greatly depend on how full it is at the time, which
> there is no way to measure or even model easily.
>
> Is there any way to track how many dirty blocks went into each file
> during the checkpoint write? That's your best bet for guessing how long
> the fsync will take.

I suppose it's possible, but the OS has hopefully started flushing them
to disk almost as soon as we started the writes, so even that isn't very
good a measure.

On a Linux system, one way to model it is that the OS flushes dirty
buffers to disk at the same rate as we write them, but delayed by
dirty_expire_centisecs. That should hold if the writes are spread out
enough. Then the amount of dirty buffers in OS cache at the end of write
phase is roughly constant, as long as the write phase lasts longer than
dirty_expire_centisecs. If we take a nap of dirty_expire_centisecs after
the write phase, the fsyncs should be effectively no-ops, except that
they will flush any other writes the bgwriter lru-sweep and other
backends performed during the nap.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-07 09:28:03
Message-ID:	1181208483.6903.9.camel@hannu-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ühel kenal päeval, K, 2007-06-06 kell 11:03, kirjutas Tom Lane:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> > GUC summary and suggested default values
> > ----------------------------------------
> > checkpoint_write_percent = 50 # % of checkpoint interval to spread out
> > writes
> > checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty
> > buffers at checkpoint (KB/s)
> > checkpoint_nap_duration = 2 # delay between write and sync phase, in
> > seconds
> > checkpoint_fsync_period = 30 # duration of the sync phase, in seconds
> > checkpoint_fsync_delay = 500 # max. delay between fsyncs
>
> > I don't like adding that many GUC variables, but I don't really see a
> > way to tune them automatically.
>
> If we don't know how to tune them, how will the users know?

He talked about doing it _automatically_.

If the knobns are available, it will be possible to determine "good"
values even by brute-force performance testing, given enough time and
manpower is available.

> Having to
> add that many variables to control one feature says to me that we don't
> understand the feature.

The feature has lots of complex dependencies to things outside postgres,
so learning to understand it takes time. Having the knows available
helps as more people ar willing to do turn-the-knobs-and-test vs.
recompile-and-test.

> Perhaps what we need is to think about how it can auto-tune itself.

Sure.

-------------------
Hannu Krosing

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, tgl(at)sss(dot)pgh(dot)pa(dot)us, Hannu Krosing <hannu(at)skype(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-07 12:23:06
Message-ID:	4667F8AA.4040300@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Thinking about this whole idea a bit more, it occured to me that the
current approach to write all, then fsync all is really a historical
artifact of the fact that we used to use the system-wide sync call
instead of fsyncs to flush the pages to disk. That might not be the best
way to do things in the new load-distributed-checkpoint world.

How about interleaving the writes with the fsyncs?

1.
Scan all shared buffers, and build a list of all files with dirty pages,
and buffers belonging to them

2.
foreach(file in list)
{
foreach(buffer belonging to file)
{
write();
sleep(); /* to throttle the I/O rate */
}
sleep(); /* to give the OS a chance to flush the writes at it's own
pace */
fsync()
}

This would spread out the fsyncs in a natural way, making the knob to
control the duration of the sync phase unnecessary.

At some point we'll also need to fsync all files that have been modified
since the last checkpoint, but don't have any dirty buffers in the
buffer cache. I think it's a reasonable assumption that fsyncing those
files doesn't generate a lot of I/O. Since the writes have been made
some time ago, the OS has likely already flushed them to disk.

Doing the 1st phase of just scanning the buffers to see which ones are
dirty also effectively implements the optimization of not writing
buffers that were dirtied after the checkpoint start. And grouping the
writes per file gives the OS a better chance to group the physical writes.

One problem is that currently the segmentation of relations to 1GB files
is handled at a low level inside md.c, and we don't really have any
visibility into that in the buffer manager. ISTM that some changes to
the smgr interfaces would be needed for this to work well, though just
doing it on a relation per relation basis would also be better than the
current approach.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-07 14:16:25
Message-ID:	719.1181225785@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Thinking about this whole idea a bit more, it occured to me that the
> current approach to write all, then fsync all is really a historical
> artifact of the fact that we used to use the system-wide sync call
> instead of fsyncs to flush the pages to disk. That might not be the best
> way to do things in the new load-distributed-checkpoint world.

> How about interleaving the writes with the fsyncs?

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling. Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-07 17:23:41
Message-ID:	46683F1D.4060609@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Thinking about this whole idea a bit more, it occured to me that the
>> current approach to write all, then fsync all is really a historical
>> artifact of the fact that we used to use the system-wide sync call
>> instead of fsyncs to flush the pages to disk. That might not be the best
>> way to do things in the new load-distributed-checkpoint world.
>
>> How about interleaving the writes with the fsyncs?
>
> I don't think it's a historical artifact at all: it's a valid reflection
> of the fact that we don't know enough about disk layout to do low-level
> I/O scheduling. Issuing more fsyncs than necessary will do little
> except guarantee a less-than-optimal scheduling of the writes.

I'm not proposing to issue any more fsyncs. I'm proposing to change the
ordering so that instead of first writing all dirty buffers and then
fsyncing all files, we'd write all buffers belonging to a file, fsync
that file only, then write all buffers belonging to next file, fsync,
and so forth.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-07 17:43:49
Message-ID:	14034.1181238229@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> I don't think it's a historical artifact at all: it's a valid reflection
>> of the fact that we don't know enough about disk layout to do low-level
>> I/O scheduling. Issuing more fsyncs than necessary will do little
>> except guarantee a less-than-optimal scheduling of the writes.

> I'm not proposing to issue any more fsyncs. I'm proposing to change the
> ordering so that instead of first writing all dirty buffers and then
> fsyncing all files, we'd write all buffers belonging to a file, fsync
> that file only, then write all buffers belonging to next file, fsync,
> and so forth.

But that means that the I/O to different files cannot be overlapped by
the kernel, even if it would be more efficient to do so.

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-07 17:59:28
Message-ID:	46684780.6010903@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Tom Lane wrote:
>>> I don't think it's a historical artifact at all: it's a valid reflection
>>> of the fact that we don't know enough about disk layout to do low-level
>>> I/O scheduling. Issuing more fsyncs than necessary will do little
>>> except guarantee a less-than-optimal scheduling of the writes.
>
>> I'm not proposing to issue any more fsyncs. I'm proposing to change the
>> ordering so that instead of first writing all dirty buffers and then
>> fsyncing all files, we'd write all buffers belonging to a file, fsync
>> that file only, then write all buffers belonging to next file, fsync,
>> and so forth.
>
> But that means that the I/O to different files cannot be overlapped by
> the kernel, even if it would be more efficient to do so.

True. On the other hand, if we issue writes in essentially random order,
we might fill the kernel buffers with random blocks and the kernel needs
to flush them to disk as almost random I/O. If we did the writes in
groups, the kernel has better chance at coalescing them.

I tend to agree that if the goal is to finish the checkpoint as quickly
as possible, the current approach is better. In the context of load
distributed checkpoints, however, it's unlikely the kernel can do any
significant overlapping since we're trickling the writes anyway.

Do we need both strategies?

I'm starting to feel we should give up on smoothing the fsyncs and
distribute the writes only, for 8.3. As we get more experience with that
and it's shortcomings, we can enhance our checkpoints further in 8.4.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-07 18:58:50
Message-ID:	Pine.GSO.4.64.0706071403560.2676@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Thu, 7 Jun 2007, Heikki Linnakangas wrote:

> So there's two extreme ways you can use LDC:
> 1. Finish the checkpoint as soon as possible, without disturbing other
> activity too much
> 2. Disturb other activity as little as possible, as long as the
> checkpoint finishes in a reasonable time.
> Are both interesting use cases, or is it enough to cater for just one of
> them? I think 2 is easier to tune.

The motivation for the (1) case is that you've got a system that's
dirtying the buffer cache very fast in normal use, where even the
background writer is hard pressed to keep the buffer pool clean. The
checkpoint is the most powerful and efficient way to clean up many dirty
buffers out of such a buffer cache in a short period of time so that
you're back to having room to work in again. In that situation, since
there are many buffers to write out, you'll also be suffering greatly from
fsync pauses. Being able to synchronize writes a little better with the
underlying OS to smooth those out is a huge help.

I'm completely biased because of the workloads I've been dealing with
recently, but I consider (2) so much easier to tune for that it's barely
worth worrying about. If your system is so underloaded that you can let
the checkpoints take their own sweet time, I'd ask if you have enough
going on that you're suffering very much from checkpoint performance
issues anyway. I'm used to being in a situation where if you don't push
out checkpoint data as fast as physically possible, you end up fighting
with the client backends for write bandwidth once the LRU point moves past
where the checkpoint has written out to already. I'm not sure how much
always running the LRU background writer will improve that situation.

> On a Linux system, one way to model it is that the OS flushes dirty buffers
> to disk at the same rate as we write them, but delayed by
> dirty_expire_centisecs. That should hold if the writes are spread out enough.

If they're really spread out, sure. There is congestion avoidance code
inside the Linux kernel that makes dirty_expire_centisecs not quite work
the way it is described under load. All you can say in the general case
is that when dirty_expire_centisecs has passed, the kernel badly wants to
write the buffers out as quickly as possible; that could still be many
seconds after the expiration time on a busy system, or on one with slow
I/O.

On every system I've ever played with Postgres write performance on, I
discovered that the memory-based parameters like dirty_background_ratio
were really driving write behavior, and I almost ignore the expire timeout
now. Plotting the "Dirty:" value in /proc/meminfo as you're running tests
is extremely informative for figuring out what Linux is really doing
underneath the database writes.

The influence of the congestion code is why I made the comment about
watching how long writes are taking to gauge how fast you can dump data
onto the disks. When you're suffering from one of the congestion
mechanisms, the initial writes start blocking, even before the fsync.
That behavior is almost undocumented outside of the relevant kernel source
code.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-07 19:28:27
Message-ID:	87y7ivef8k.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Greg Smith" <gsmith(at)gregsmith(dot)com> writes:

> I'm completely biased because of the workloads I've been dealing with recently,
> but I consider (2) so much easier to tune for that it's barely worth worrying
> about. If your system is so underloaded that you can let the checkpoints take
> their own sweet time, I'd ask if you have enough going on that you're suffering
> very much from checkpoint performance issues anyway. I'm used to being in a
> situation where if you don't push out checkpoint data as fast as physically
> possible, you end up fighting with the client backends for write bandwidth once
> the LRU point moves past where the checkpoint has written out to already. I'm
> not sure how much always running the LRU background writer will improve that
> situation.

I think you're working from a faulty premise.

There's no relationship between the volume of writes and how important the
speed of checkpoint is. In either scenario you should assume a system that is
close to the max i/o bandwidth. The only question is which task the admin
would prefer take the hit for maxing out the bandwidth, the transactions or
the checkpoint.

You seem to have imagined that letting the checkpoint take longer will slow
down transactions. In fact that's precisely the effect we're trying to avoid.
Right now we're seeing tests where Postgres stops handling *any* transactions
for up to a minute. In virtually any real world scenario that would simply be
unacceptable.

That one-minute outage is a direct consequence of trying to finish the
checkpoint as quick as possible. If we spread it out then it might increase
the average i/o load if you sum it up over time, but then you just need a
faster i/o controller.

The only scenario where you would prefer the absolute lowest i/o rate summed
over time would be if you were close to maxing out your i/o bandwidth,
couldn't buy a faster controller, and response time was not a factor, only
sheer volume of transactions processed mattered. That's a much less common
scenario than caring about the response time.

The flip side of having to worry about response time buying a faster
controller doesn't even help. It would shorten the duration of the checkpoint
but not eliminate it. A 30-second outage every half hour is just as
unacceptable as a 1-minute outage every half hour.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-07 20:49:17
Message-ID:	Pine.GSO.4.64.0706071602360.4005@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Thu, 7 Jun 2007, Gregory Stark wrote:

> You seem to have imagined that letting the checkpoint take longer will slow
> down transactions.

And you seem to have imagined that I have so much spare time that I'm just
making stuff up to entertain myself and sow confusion.

I observed some situations where delaying checkpoints too long ends up
slowing down both transaction rate and response time, using earlier
variants of the LDC patch and code with similar principles I wrote. I'm
trying to keep the approach used here out of the worst of the corner cases
I ran into, or least to make it possible for people in those situations to
have some ability to tune out of the bad spots. I am unfortunately not
free to disclose all those test results, and since that project is over I
can't see how the current LDC compares to what I tested at the time.

I plainly stated I had a bias here, one that's not even close to the
average case. My concern here was that Heikki would end up optimizing in
a direction where a really wide spread across the active checkpoint
interval was strongly preferred. I wanted to offer some suggestions on
the type of situation where that might not be true, but where a different
tuning of LDC would still be an improvement over the current behavior.
There are some tuning knobs there that I don't want to see go away until
there's been a wider range of tests to prove they aren't effective.

> Right now we're seeing tests where Postgres stops handling *any* transactions
> for up to a minute. In virtually any real world scenario that would simply be
> unacceptable.

No doubt; I've seen things get close to that bad myself, both on the high
and low end. I collided with the issue in a situation of "maxing out your
i/o bandwidth, couldn't buy a faster controller" at one point, which is
what kicked off my working in this area. It turned out there were still
some software tunables left that pulled the worst case down to the 2-5
second range instead. With more checkpoint_segments to decrease the
frequency, that was just enough to make the problem annoying rather than
crippling. But after that, I could easily imagine a different application
scenario where the behavior you describe is the best case.

This is really a serious issue with the current design of the database,
one that merely changes instead of going away completely if you throw more
hardware at it. I'm perversely glad to hear this is torturing more people
than just me as it improves the odds the situation will improve.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-07 20:56:14
Message-ID:	466870EE.3090501@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

> This is really a serious issue with the current design of the database,
> one that merely changes instead of going away completely if you throw
> more hardware at it. I'm perversely glad to hear this is torturing more
> people than just me as it improves the odds the situation will improve.

It tortures pretty much any high velocity postgresql db of which there
are more and more every day.

Joshua D. Drake

>
> --
> * Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: .conf File Organization WAS: Controlling Load Distributed Checkpoints
Date:	2007-06-08 00:33:38
Message-ID:	4668A3E2.7090707@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

All,

This brings up another point. With the increased number of .conf
options, the file is getting hard to read again. I'd like to do another
reorganization, but I don't really want to break people's diff scripts.
Should I worry about that?

--Josh

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: .conf File Organization WAS: Controlling Load Distributed Checkpoints
Date:	2007-06-08 00:43:18
Message-ID:	4668A626.5020603@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Josh Berkus wrote:
> All,
>
> This brings up another point. With the increased number of .conf
> options, the file is getting hard to read again. I'd like to do another
> reorganization, but I don't really want to break people's diff scripts.
> Should I worry about that?

As a point of feedback, autovacuum and vacuum should be together.

Joshua D. Drake

>
> --Josh
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: .conf File Organization WAS: Controlling Load Distributed Checkpoints
Date:	2007-06-08 01:19:50
Message-ID:	11503.1181265590@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> This brings up another point. With the increased number of .conf
> options, the file is getting hard to read again. I'd like to do another
> reorganization, but I don't really want to break people's diff scripts.

Do you have a better organizing principle than what's there now?

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-08 08:50:49
Message-ID:	46691869.9070300@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Greg Smith wrote:
> On Thu, 7 Jun 2007, Heikki Linnakangas wrote:
>
>> So there's two extreme ways you can use LDC:
>> 1. Finish the checkpoint as soon as possible, without disturbing other
>> activity too much
>> 2. Disturb other activity as little as possible, as long as the
>> checkpoint finishes in a reasonable time.
>> Are both interesting use cases, or is it enough to cater for just one
>> of them? I think 2 is easier to tune.
>
> The motivation for the (1) case is that you've got a system that's
> dirtying the buffer cache very fast in normal use, where even the
> background writer is hard pressed to keep the buffer pool clean. The
> checkpoint is the most powerful and efficient way to clean up many dirty
> buffers out of such a buffer cache in a short period of time so that
> you're back to having room to work in again. In that situation, since
> there are many buffers to write out, you'll also be suffering greatly
> from fsync pauses. Being able to synchronize writes a little better
> with the underlying OS to smooth those out is a huge help.

ISTM the bgwriter just isn't working hard enough in that scenario.
Assuming we get the lru autotuning patch in 8.3, do you think there's
still merit in using the checkpoints that way?

> I'm completely biased because of the workloads I've been dealing with
> recently, but I consider (2) so much easier to tune for that it's barely
> worth worrying about. If your system is so underloaded that you can let
> the checkpoints take their own sweet time, I'd ask if you have enough
> going on that you're suffering very much from checkpoint performance
> issues anyway. I'm used to being in a situation where if you don't push
> out checkpoint data as fast as physically possible, you end up fighting
> with the client backends for write bandwidth once the LRU point moves
> past where the checkpoint has written out to already. I'm not sure how
> much always running the LRU background writer will improve that situation.

I'd think it eliminates the problem. Assuming we keep the LRU cleaning
running as usual, I don't see how writing faster during checkpoints
could ever be beneficial for concurrent activity. The more you write,
the less bandwidth there's available for others.

Doing the checkpoint as quickly as possible might be slightly better for
average throughput, but that's a different matter.

> On every system I've ever played with Postgres write performance on, I
> discovered that the memory-based parameters like dirty_background_ratio
> were really driving write behavior, and I almost ignore the expire
> timeout now. Plotting the "Dirty:" value in /proc/meminfo as you're
> running tests is extremely informative for figuring out what Linux is
> really doing underneath the database writes.

Interesting. I haven't touched any of the kernel parameters yet in my
tests. It seems we need to try different parameters and see how the
dynamics change. But we must also keep in mind that average DBA doesn't
change any settings, and might not even be able or allowed to. That
means the defaults should work reasonably well without tweaking the OS
settings.

> The influence of the congestion code is why I made the comment about
> watching how long writes are taking to gauge how fast you can dump data
> onto the disks. When you're suffering from one of the congestion
> mechanisms, the initial writes start blocking, even before the fsync.
> That behavior is almost undocumented outside of the relevant kernel
> source code.

Yeah, that's controlled by dirty_ratio, if I've understood the
parameters correctly. If we spread out the writes enough, we shouldn't
hit that limit or congestion. That's the point of the patch.

Do you have time / resources to do testing? You've clearly spent a lot
of time on this, and I'd be very interested to see some actual numbers
from your tests with various settings.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-08 14:10:43
Message-ID:	20070608141043.GW17144@phlogiston.dyndns.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Fri, Jun 08, 2007 at 09:50:49AM +0100, Heikki Linnakangas wrote:

> dynamics change. But we must also keep in mind that average DBA doesn't
> change any settings, and might not even be able or allowed to. That
> means the defaults should work reasonably well without tweaking the OS
> settings.

Do you mean "change the OS settings" or something else? (I'm not
sure it's true in any case, because shared memory kernel settings
have to be fiddled with in many instances, but I thought I'd ask for
clarification.)

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
Users never remark, "Wow, this software may be buggy and hard
to use, but at least there is a lot of code underneath."
--Damien Katz

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-08 14:21:10
Message-ID:	466965D6.3050100@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Andrew Sullivan wrote:
> On Fri, Jun 08, 2007 at 09:50:49AM +0100, Heikki Linnakangas wrote:
>
>> dynamics change. But we must also keep in mind that average DBA doesn't
>> change any settings, and might not even be able or allowed to. That
>> means the defaults should work reasonably well without tweaking the OS
>> settings.
>
> Do you mean "change the OS settings" or something else? (I'm not
> sure it's true in any case, because shared memory kernel settings
> have to be fiddled with in many instances, but I thought I'd ask for
> clarification.)

Yes, that's what I meant. An average DBA is not likely to change OS
settings.

You're right on the shmmax setting, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-08 14:33:50
Message-ID:	Pine.GSO.4.64.0706081029510.5361@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Fri, 8 Jun 2007, Andrew Sullivan wrote:

> Do you mean "change the OS settings" or something else? (I'm not
> sure it's true in any case, because shared memory kernel settings
> have to be fiddled with in many instances, but I thought I'd ask for
> clarification.)

In a situation where a hosting provider of some sort is providing
PostgreSQL, they should know that parameters like SHMMAX need to be
increased before customers can create a larger installation. You'd expect
they'd take care of that as part of routine server setup. What wouldn't
be reasonable is to expect them to tune obscure parts of the kernel just
for your application.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-08 15:06:09
Message-ID:	20070608150609.GF17144@phlogiston.dyndns.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Fri, Jun 08, 2007 at 10:33:50AM -0400, Greg Smith wrote:
> they'd take care of that as part of routine server setup. What wouldn't
> be reasonable is to expect them to tune obscure parts of the kernel just
> for your application.

Well, I suppose it'd depend on what kind of hosting environment
you're in (if I'm paying for dedicated hosting, you better believe
I'm going to insist they tune the kernel the way I want), but you're
right that in shared hosting for $25/mo, it's not going to happen.

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
"The year's penultimate month" is not in truth a good way of saying
November.
--H.W. Fowler

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-08 18:36:54
Message-ID:	200706081836.l58IasZ18010@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Andrew Sullivan wrote:
> On Fri, Jun 08, 2007 at 10:33:50AM -0400, Greg Smith wrote:
> > they'd take care of that as part of routine server setup. What wouldn't
> > be reasonable is to expect them to tune obscure parts of the kernel just
> > for your application.
>
> Well, I suppose it'd depend on what kind of hosting environment
> you're in (if I'm paying for dedicated hosting, you better believe
> I'm going to insist they tune the kernel the way I want), but you're
> right that in shared hosting for $25/mo, it's not going to happen.

And consider other operating systems that don't have the same knobs. We
should tune as best we can first without kernel knobs.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-09 07:39:19
Message-ID:	20070609073919.GZ92628@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> > Thinking about this whole idea a bit more, it occured to me that the
> > current approach to write all, then fsync all is really a historical
> > artifact of the fact that we used to use the system-wide sync call
> > instead of fsyncs to flush the pages to disk. That might not be the best
> > way to do things in the new load-distributed-checkpoint world.
>
> > How about interleaving the writes with the fsyncs?
>
> I don't think it's a historical artifact at all: it's a valid reflection
> of the fact that we don't know enough about disk layout to do low-level
> I/O scheduling. Issuing more fsyncs than necessary will do little
> except guarantee a less-than-optimal scheduling of the writes.

If we extended relations by more than 8k at a time, we would know a lot
more about disk layout, at least on filesystems with a decent amount of
free space.
--
Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-10 19:49:24
Message-ID:	466C55C4.80109@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Jim C. Nasby wrote:
> On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:
>> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>>> Thinking about this whole idea a bit more, it occured to me that the
>>> current approach to write all, then fsync all is really a historical
>>> artifact of the fact that we used to use the system-wide sync call
>>> instead of fsyncs to flush the pages to disk. That might not be the best
>>> way to do things in the new load-distributed-checkpoint world.
>>> How about interleaving the writes with the fsyncs?
>> I don't think it's a historical artifact at all: it's a valid reflection
>> of the fact that we don't know enough about disk layout to do low-level
>> I/O scheduling. Issuing more fsyncs than necessary will do little
>> except guarantee a less-than-optimal scheduling of the writes.
>
> If we extended relations by more than 8k at a time, we would know a lot
> more about disk layout, at least on filesystems with a decent amount of
> free space.

I doubt it makes that much difference. If there was a significant amount
of fragmentation, we'd hear more complaints about seq scan performance.

The issue here is that we don't know which relations are on which drives
and controllers, how they're striped, mirrored etc.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-11 06:27:48
Message-ID:	20070611141111.8B5D.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:

> True. On the other hand, if we issue writes in essentially random order,
> we might fill the kernel buffers with random blocks and the kernel needs
> to flush them to disk as almost random I/O. If we did the writes in
> groups, the kernel has better chance at coalescing them.

If the kernel can treat sequential writes better than random writes,
is it worth sorting dirty buffers in block order per file at the start
of checkpoints? Here is the pseudo code:

buffers_to_be_written =
SELECT buf_id, tag FROM BufferDescriptors
WHERE (flags & BM_DIRTY) != 0 ORDER BY tag.rnode, tag.blockNum;
for { buf_id, tag } in buffers_to_be_written:
if BufferDescriptors[buf_id].tag == tag:
FlushBuffer(&BufferDescriptors[buf_id])

We can also avoid writing buffers newly dirtied after the checkpoint was
started with this method.

> I tend to agree that if the goal is to finish the checkpoint as quickly
> as possible, the current approach is better. In the context of load
> distributed checkpoints, however, it's unlikely the kernel can do any
> significant overlapping since we're trickling the writes anyway.

Some kernels or storage subsystems treat all I/Os too fairly so that user
transactions waiting for reads are blocked by checkpoints writes. It is
unavoidable behavior though, but we can split writes in small batches.

> I'm starting to feel we should give up on smoothing the fsyncs and
> distribute the writes only, for 8.3. As we get more experience with that
> and it's shortcomings, we can enhance our checkpoints further in 8.4.

I agree with the only writes distribution for 8.3. The new parameters
introduced by it (checkpoint_write_percent and checkpoint_write_min_rate)
will continue to be alive without major changes in the future, but other
parameters seem to be volatile.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-11 07:51:51
Message-ID:	Pine.GSO.4.64.0706110316020.9600@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:

> If the kernel can treat sequential writes better than random writes, is
> it worth sorting dirty buffers in block order per file at the start of
> checkpoints?

I think it has the potential to improve things. There are three obvious
and one subtle argument against it I can think of:

1) Extra complexity for something that may not help. This would need some
good, robust benchmarking improvements to justify its use.

2) Block number ordering may not reflect actual order on disk. While
true, it's got to be better correlated with it than writing at random.

3) The OS disk elevator should be dealing with this issue, particularly
because it may really know the actual disk ordering.

Here's the subtle thing: by writing in the same order the LRU scan occurs
in, you are writing dirty buffers in the optimal fashion to eliminate
client backend writes during BuferAlloc. This makes the checkpoint a
really effective LRU clearing mechanism. Writing in block order will
change that.

I spent some time trying to optimize the elevator part of this operation,
since I knew that on the system I was using block order was actual order.
I found that under Linux, the behavior of the pdflush daemon that manages
dirty memory had a more serious impact on writing behavior at checkpoint
time than playing with the elevator scheduling method did. The way
pdflush works actually has several interesting implications for how to
optimize this patch. For example, how writes get blocked when the dirty
memory reaches certain thresholds means that you may not get the full
benefit of the disk elevator at checkpoint time the way most would expect.

Since much of that was basically undocumented, I had to write my own
analysis of the actual workings, which is now available at
http://www.westnet.com/~gsmith/content/linux-pdflush.htm I hope that
anyone who wants more information about how Linux kernel parameters like
dirty_background_ratio actually work, and how they impact the writing
strategy, should find that article uniquely helpful.

> Some kernels or storage subsystems treat all I/Os too fairly so that
> user transactions waiting for reads are blocked by checkpoints writes.

In addition to that (which I've seen happen quite a bit), in the Linux
case another fairness issue is that the code that handles writes allows a
single process writing a lot of data to block writes for everyone else.
That means that in addition to being blocked on actual reads, if a client
backend starts a write in order to complete a buffer allocation to hold
new information, that can grind to a halt because of the checkpoint
process as well.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-11 09:27:30
Message-ID:	466D1582.8080503@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

ITAGAKI Takahiro wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
>> True. On the other hand, if we issue writes in essentially random order,
>> we might fill the kernel buffers with random blocks and the kernel needs
>> to flush them to disk as almost random I/O. If we did the writes in
>> groups, the kernel has better chance at coalescing them.
>
> If the kernel can treat sequential writes better than random writes,
> is it worth sorting dirty buffers in block order per file at the start
> of checkpoints? Here is the pseudo code:
>
> buffers_to_be_written =
> SELECT buf_id, tag FROM BufferDescriptors
> WHERE (flags & BM_DIRTY) != 0 ORDER BY tag.rnode, tag.blockNum;
> for { buf_id, tag } in buffers_to_be_written:
> if BufferDescriptors[buf_id].tag == tag:
> FlushBuffer(&BufferDescriptors[buf_id])
>
> We can also avoid writing buffers newly dirtied after the checkpoint was
> started with this method.

That's worth testing, IMO. Probably won't happen for 8.3, though.

>> I tend to agree that if the goal is to finish the checkpoint as quickly
>> as possible, the current approach is better. In the context of load
>> distributed checkpoints, however, it's unlikely the kernel can do any
>> significant overlapping since we're trickling the writes anyway.
>
> Some kernels or storage subsystems treat all I/Os too fairly so that user
> transactions waiting for reads are blocked by checkpoints writes. It is
> unavoidable behavior though, but we can split writes in small batches.

That's really the heart of our problems. If the kernel had support for
prioritizing the normal backend activity and LRU cleaning over the
checkpoint I/O, we wouldn't need to throttle the I/O ourselves. The
kernel has the best knowledge of what it can and can't do, and how busy
the I/O subsystems are. Recent Linux kernels have some support for read
I/O priorities, but not for writes.

I believe the best long term solution is to add that support to the
kernel, but it's going to take a long time until that's universally
available, and we have a lot of platforms to support.

>> I'm starting to feel we should give up on smoothing the fsyncs and
>> distribute the writes only, for 8.3. As we get more experience with that
>> and it's shortcomings, we can enhance our checkpoints further in 8.4.
>
> I agree with the only writes distribution for 8.3. The new parameters
> introduced by it (checkpoint_write_percent and checkpoint_write_min_rate)
> will continue to be alive without major changes in the future, but other
> parameters seem to be volatile.

I'm going to start testing with just distributing the writes. Let's see
how far that gets us.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: .conf File Organization
Date:	2007-06-12 19:49:06
Message-ID:	200706121249.06313.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom,

> Do you have a better organizing principle than what's there now?

It's mostly detail stuff: putting VACUUM and Autovac together, breaking up
some subsections that now have too many options in them into grouped.

Client Connection Defaults has somehow become a catchall secton for *any*
USERSET variable, regardless of purpose. I'd like to trim it back down and
assign some of those variables to appropriate sections.

On the more hypothetical basis I was thinking of adding a section at the top
with the 7-9 most common options that people *need* to set; this would make
PostgreSQL.conf much more accessable but would result in duplicate options
which might cause some issues.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: .conf File Organization
Date:	2007-06-12 19:51:46
Message-ID:	26035.1181677906@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> On the more hypothetical basis I was thinking of adding a section at the top
> with the 7-9 most common options that people *need* to set; this would make
> PostgreSQL.conf much more accessable but would result in duplicate options
> which might cause some issues.

Doesn't sound like a good idea, but maybe there's a case for a comment
there saying "these are the most important ones to look at"?

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: .conf File Organization
Date:	2007-06-12 20:08:01
Message-ID:	200706121308.01469.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom,

> Doesn't sound like a good idea, but maybe there's a case for a comment
> there saying "these are the most important ones to look at"?

Yeah, probably need to do that. Seems user-unfriendly, but loading a foot gun
by having some options appear twice in the file seems much worse. I'll also
add some notes on how to set these values.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-13 18:05:23
Message-ID:	20070613180523.GJ92628@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Sun, Jun 10, 2007 at 08:49:24PM +0100, Heikki Linnakangas wrote:
> Jim C. Nasby wrote:
> >On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:
> >>Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> >>>Thinking about this whole idea a bit more, it occured to me that the
> >>>current approach to write all, then fsync all is really a historical
> >>>artifact of the fact that we used to use the system-wide sync call
> >>>instead of fsyncs to flush the pages to disk. That might not be the best
> >>>way to do things in the new load-distributed-checkpoint world.
> >>>How about interleaving the writes with the fsyncs?
> >>I don't think it's a historical artifact at all: it's a valid reflection
> >>of the fact that we don't know enough about disk layout to do low-level
> >>I/O scheduling. Issuing more fsyncs than necessary will do little
> >>except guarantee a less-than-optimal scheduling of the writes.
> >
> >If we extended relations by more than 8k at a time, we would know a lot
> >more about disk layout, at least on filesystems with a decent amount of
> >free space.
>
> I doubt it makes that much difference. If there was a significant amount
> of fragmentation, we'd hear more complaints about seq scan performance.
>
> The issue here is that we don't know which relations are on which drives
> and controllers, how they're striped, mirrored etc.

Actually, isn't pre-allocation one of the tricks that Greenplum uses to
get it's seqscan performance?
--
Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-13 22:04:57
Message-ID:	46706A09.4020504@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Jim C. Nasby wrote:
>> On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:
>>> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>>>> Thinking about this whole idea a bit more, it occured to me that the
>>>> current approach to write all, then fsync all is really a historical
>>>> artifact of the fact that we used to use the system-wide sync call
>>>> instead of fsyncs to flush the pages to disk. That might not be the
>>>> best way to do things in the new load-distributed-checkpoint world.
>>>> How about interleaving the writes with the fsyncs?
>>> I don't think it's a historical artifact at all: it's a valid reflection
>>> of the fact that we don't know enough about disk layout to do low-level
>>> I/O scheduling. Issuing more fsyncs than necessary will do little
>>> except guarantee a less-than-optimal scheduling of the writes.
>>
>> If we extended relations by more than 8k at a time, we would know a lot
>> more about disk layout, at least on filesystems with a decent amount of
>> free space.
>
> I doubt it makes that much difference. If there was a significant amount
> of fragmentation, we'd hear more complaints about seq scan performance.

OTOH, extending a relation that uses N pages by something like
min(ceil(N/1024), 1024)) pages might help some filesystems to
avoid fragmentation, and hardly introduce any waste (about 0.1%
in the worst case). So if it's not too hard to do it might
be worthwhile, even if it turns out that most filesystems deal
well with the current allocation pattern.

greetings, Florian Pflug

From:	PFC <lists(at)peufeu(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-13 22:09:02
Message-ID:	op.ttvrtciocigqcu@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

>> >If we extended relations by more than 8k at a time, we would know a lot
>> >more about disk layout, at least on filesystems with a decent amount of
>> >free space.
>>
>> I doubt it makes that much difference. If there was a significant amount
>> of fragmentation, we'd hear more complaints about seq scan performance.
>>
>> The issue here is that we don't know which relations are on which drives
>> and controllers, how they're striped, mirrored etc.
>
> Actually, isn't pre-allocation one of the tricks that Greenplum uses to
> get it's seqscan performance?

My tests here show that, at least on reiserfs, after a few hours of
benchmark torture (this represents several million write queries), table
files become significantly fragmented. I believe the table and index files
get extended more or less simultaneously and end up somehow a bit mixed up
on disk. Seq scan perf suffers. reiserfs doesn't have an excellent
fragmentation behaviour... NTFS is worse than hell in this respect. So,
pre-alloc could be a good idea. Brutal Defrag (cp /var/lib/postgresql to
somewhere and back) gets seq scan perf back to disk throughput.

Also, by the way, InnoDB uses a BTree organized table. The advantage is
that data is always clustered on the primary key (which means you have to
use something as your primary key that isn't necessary "natural", you have
to choose it to get good clustering, and you can't always do it right, so
it somehow, in the end, sucks rather badly). Anyway, seq-scan on InnoDB is
very slow because, as the btree grows (just like postgres indexes) pages
are split and scanning the pages in btree order becomes a mess of seeks.
So, seq scan in InnoDB is very very slow unless periodic OPTIMIZE TABLE is
applied. (caveat to the postgres TODO item "implement automatic table
clustering"...)

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Subject:	Sorted writes in checkpoint
Date:	2007-06-14 07:39:37
Message-ID:	20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Greg Smith <gsmith(at)gregsmith(dot)com> wrote:

> On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > If the kernel can treat sequential writes better than random writes, is
> > it worth sorting dirty buffers in block order per file at the start of
> > checkpoints?

I wrote and tested the attached sorted-writes patch base on Heikki's
ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.

tests | pgbench | DBT-2 response time (avg/90%/max)
---------------------------+---------+-----------------------------------
LDC only | 181 tps | 1.12 / 4.38 / 12.13 s
+ BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s
+ Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s

(*) Don't write buffers that were dirtied after starting the checkpoint.

machine : 2GB-ram, SCSI*4 RAID-5
pgbench : -s400 -t40000 -c10 (about 5GB of database)
DBT-2 : 60WH (about 6GB of database)

> I think it has the potential to improve things. There are three obvious
> and one subtle argument against it I can think of:
>
> 1) Extra complexity for something that may not help. This would need some
> good, robust benchmarking improvements to justify its use.

Exactly. I think we need a discussion board for I/O performance issues.
Can I use Developers Wiki for this purpose? Since performance graphs and
result tables are important for the discussion, so it might be better
than mailing lists, that are text-based.

> 2) Block number ordering may not reflect actual order on disk. While
> true, it's got to be better correlated with it than writing at random.
> 3) The OS disk elevator should be dealing with this issue, particularly
> because it may really know the actual disk ordering.

Yes, both are true. However, I think there is pretty high correlation
in those orderings. In addition, we should use filesystem to assure
those orderings correspond to each other. For example, pre-allocation
of files might help us, as has often been discussed.

> Here's the subtle thing: by writing in the same order the LRU scan occurs
> in, you are writing dirty buffers in the optimal fashion to eliminate
> client backend writes during BuferAlloc. This makes the checkpoint a
> really effective LRU clearing mechanism. Writing in block order will
> change that.

The issue will probably go away after we have LDC, because it writes LRU
buffers during checkpoints.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment	Content-Type	Size
sorted-ckpt.patch	application/octet-stream	4.2 KB

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"PFC" <lists(at)peufeu(dot)com>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Controlling Load Distributed Checkpoints
Date:	2007-06-14 11:40:51
Message-ID:	87d4zywypo.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"PFC" <lists(at)peufeu(dot)com> writes:

> Anyway, seq-scan on InnoDB is very slow because, as the btree grows (just
> like postgres indexes) pages are split and scanning the pages in btree order
> becomes a mess of seeks. So, seq scan in InnoDB is very very slow unless
> periodic OPTIMIZE TABLE is applied. (caveat to the postgres TODO item
> "implement automatic table clustering"...)

Heikki already posted a patch which goes a long way towards implementing what
I think this patch refers to: trying to maintaining the cluster ordering on
updates and inserts.

It does it without changing the basic table structure at all. On updates and
inserts it consults the indexam of the clustered index to ask if for a
suggested block. If the index's suggested block has enough free space then the
tuple is put there.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Subject:	Re: Sorted writes in checkpoint
Date:	2007-06-14 11:45:21
Message-ID:	878xamwyi6.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:

> Exactly. I think we need a discussion board for I/O performance issues.
> Can I use Developers Wiki for this purpose? Since performance graphs and
> result tables are important for the discussion, so it might be better
> than mailing lists, that are text-based.

I would suggest keeping the discussion on mail and including links to refer to
charts and tables in the wiki.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject:	Re: Sorted writes in checkpoint
Date:	2007-06-14 13:22:06
Message-ID:	467140FE.1080404@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

ITAGAKI Takahiro wrote:
> Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
>> On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
>>> If the kernel can treat sequential writes better than random writes, is
>>> it worth sorting dirty buffers in block order per file at the start of
>>> checkpoints?
>
> I wrote and tested the attached sorted-writes patch base on Heikki's
> ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
>
> tests | pgbench | DBT-2 response time (avg/90%/max)
> ---------------------------+---------+-----------------------------------
> LDC only | 181 tps | 1.12 / 4.38 / 12.13 s
> + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s
> + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s
>
> (*) Don't write buffers that were dirtied after starting the checkpoint.
>
> machine : 2GB-ram, SCSI*4 RAID-5
> pgbench : -s400 -t40000 -c10 (about 5GB of database)
> DBT-2 : 60WH (about 6GB of database)

Wow, I didn't expect that much gain from the sorted writes. How was LDC
configured?

>> 3) The OS disk elevator should be dealing with this issue, particularly
>> because it may really know the actual disk ordering.

Yeah, but we don't give the OS that much chance to coalesce writes when
we spread them out.

>> Here's the subtle thing: by writing in the same order the LRU scan occurs
>> in, you are writing dirty buffers in the optimal fashion to eliminate
>> client backend writes during BuferAlloc. This makes the checkpoint a
>> really effective LRU clearing mechanism. Writing in block order will
>> change that.
>
> The issue will probably go away after we have LDC, because it writes LRU
> buffers during checkpoints.

I think so too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sorted writes in checkpoint
Date:	2007-06-14 15:58:33
Message-ID:	Pine.GSO.4.64.0706141140200.14861@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Thu, 14 Jun 2007, ITAGAKI Takahiro wrote:

> I think we need a discussion board for I/O performance issues. Can I use
> Developers Wiki for this purpose? Since performance graphs and result
> tables are important for the discussion, so it might be better than
> mailing lists, that are text-based.

I started pushing some of my stuff over to there recently to make it
easier to edit and other people can expand with their expertise.
http://developer.postgresql.org/index.php/Buffer_Cache%2C_Checkpoints%2C_and_the_BGW
is what I've done so far on this particular topic.

What I would like to see on the Wiki first are pages devoted to how to run
the common benchmarks people use for useful performance testing. A recent
thread on one of the lists reminded me how easy it is to get worthless
results out of DBT2 if you don't have any guidance on that. I've already
got a stack of documentation about how to wrestle with pgbench and am
generating more.

The problem with using the Wiki as the main focus is that when you get to
the point that you want to upload detailed test results, that interface
really isn't appropriate for it. For example, in the last day I've
collected up data from about 400 short tests runs that generated 800
graphs. It's all organized as HTML so you can drill down into the
specific tests that executed oddly. Heikki's DBT2 resuls are similar; not
as many files, because he's running longer tests, but the navigation is
even more complicated.

There is no way to easily put that type and level of information into the
Wiki page. You really just need a web server to copy the results onto.
Then the main problem you have to be concerned about is a repeat of the
OSDL situation, where all the results just dissapear if their hosting
sponsor goes away.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Subject:	Re: Sorted writes in checkpoint
Date:	2007-06-14 17:50:17
Message-ID:	1181843417.5776.118.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote:
> Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
>
> > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > > If the kernel can treat sequential writes better than random writes, is
> > > it worth sorting dirty buffers in block order per file at the start of
> > > checkpoints?
>
> I wrote and tested the attached sorted-writes patch base on Heikki's
> ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
>
> tests | pgbench | DBT-2 response time (avg/90%/max)
> ---------------------------+---------+-----------------------------------
> LDC only | 181 tps | 1.12 / 4.38 / 12.13 s
> + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s
> + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s
>
> (*) Don't write buffers that were dirtied after starting the checkpoint.
>
> machine : 2GB-ram, SCSI*4 RAID-5
> pgbench : -s400 -t40000 -c10 (about 5GB of database)
> DBT-2 : 60WH (about 6GB of database)

I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
of writes has been saved by doing that? We would expect a small
percentage of blocks only and so that shouldn't make a significant
difference. I thought we discussed this before, about a year ago. It
would be easy to get that wrong and to avoid writing a block that had
been re-dirtied after the start of checkpoint, but was already dirty
beforehand. How long was the write phase of the checkpoint, how long
between checkpoints?

I can see the sorted writes having an effect because the OS may not
receive blocks within a sufficient time window to fully optimise them.
That effect would grow with increasing sizes of shared_buffers and
decrease with size of controller cache. How big was the shared buffers
setting? What OS scheduler are you using? The effect would be greatest
when using Deadline.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	"Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Subject:	Re: Sorted writes in checkpoint
Date:	2007-06-15 02:37:14
Message-ID:	e692861c0706141937p47212c5y4ac6177ecd086430@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 6/14/07, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote:
> > Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
> >
> > > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > > > If the kernel can treat sequential writes better than random writes, is
> > > > it worth sorting dirty buffers in block order per file at the start of
> > > > checkpoints?
> >
> > I wrote and tested the attached sorted-writes patch base on Heikki's
> > ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
> >
> > tests | pgbench | DBT-2 response time (avg/90%/max)
> > ---------------------------+---------+-----------------------------------
> > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s
> > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s
> > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s
> >
> > (*) Don't write buffers that were dirtied after starting the checkpoint.
> >
> > machine : 2GB-ram, SCSI*4 RAID-5
> > pgbench : -s400 -t40000 -c10 (about 5GB of database)
> > DBT-2 : 60WH (about 6GB of database)
>
> I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
> of writes has been saved by doing that? We would expect a small
> percentage of blocks only and so that shouldn't make a significant
> difference. I thought we discussed this before, about a year ago. It
> would be easy to get that wrong and to avoid writing a block that had
> been re-dirtied after the start of checkpoint, but was already dirty
> beforehand. How long was the write phase of the checkpoint, how long
> between checkpoints?
>
> I can see the sorted writes having an effect because the OS may not
> receive blocks within a sufficient time window to fully optimise them.
> That effect would grow with increasing sizes of shared_buffers and
> decrease with size of controller cache. How big was the shared buffers
> setting? What OS scheduler are you using? The effect would be greatest
> when using Deadline.

Linux has some instrumentation that might be useful for this testing,

echo 1 > /proc/sys/vm/block_dump
Will have the kernel log all physical IO (disable syslog writing to
disk before turning it on if you don't want the system to blow up).

Certainly the OS elevator should be working well enough to not see
that much of an improvement. Perhaps frequent fsync behavior is having
unintended interaction with the elevator? ... It might be worthwhile
to contact some Linux kernel developers and see if there is some
misunderstanding.

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sorted writes in checkpoint
Date:	2007-06-15 04:53:41
Message-ID:	Pine.GSO.4.64.0706150040280.2986@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Thu, 14 Jun 2007, Gregory Maxwell wrote:

> Linux has some instrumentation that might be useful for this testing,
> echo 1 > /proc/sys/vm/block_dump

That bit was developed for tracking down who was spinning the hard drive
up out of power saving mode, and I was under the impression that very
rough feature isn't useful at all here. I just tried to track down again
where I got that impression from, and I think it was this thread:

http://linux.slashdot.org/comments.pl?sid=231817&cid=18832379

This mentions general issues figuring out who was responsible for a write
and specifically mentions how you'll have to reconcile two different paths
if fsync is mixed in. Not saying it won't work, it's just obvious using
the block_dump output isn't a simple job.

(For anyone who would like an intro to this feature, try
http://www.linuxjournal.com/node/7539/print and
http://toadstool.se/journal/2006/05/27/monitoring-filesystem-activity-under-linux-with-block_dump
)

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Subject:	Re: Sorted writes in checkpoint
Date:	2007-06-15 09:14:20
Message-ID:	E1539E0ED7043848906A8FF995BDA579022414E0@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

> > tests | pgbench | DBT-2 response time
> (avg/90%/max)
> >
> ---------------------------+---------+--------------------------------
> > ---------------------------+---------+---
> > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s
> > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s
> > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s
> >
> > (*) Don't write buffers that were dirtied after starting
> the checkpoint.
> >
> > machine : 2GB-ram, SCSI*4 RAID-5
> > pgbench : -s400 -t40000 -c10 (about 5GB of database)
> > DBT-2 : 60WH (about 6GB of database)
>
> I'm very surprised by the BM_CHECKPOINT_NEEDED results. What
> percentage of writes has been saved by doing that? We would
> expect a small percentage of blocks only and so that
> shouldn't make a significant difference. I thought we

Wouldn't pages that are dirtied during the checkpoint also usually be
rather hot ?
Thus if we lock one of those for writing, the chances are high that a
client needs to wait for the lock ?
A write os call should usually be very fast, but when the IO gets
bottlenecked it might easily become slower.

Probably the recent result, that it saves ~53% of the writes, is
sufficient explanation though.

Very nice results :-) Looks like we want all of it including the sort.

Andreas

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Subject:	Re: Sorted writes in checkpoint
Date:	2007-06-15 09:33:47
Message-ID:	20070615175302.7602.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Simon Riggs" <simon(at)2ndquadrant(dot)com> wrote:

> > tests | pgbench | DBT-2 response time (avg/90%/max)
> > ---------------------------+---------+-----------------------------------
> > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s
> > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s
> > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s
>
> I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
> of writes has been saved by doing that?
> How long was the write phase of the checkpoint, how long
> between checkpoints?
>
> I can see the sorted writes having an effect because the OS may not
> receive blocks within a sufficient time window to fully optimise them.
> That effect would grow with increasing sizes of shared_buffers and
> decrease with size of controller cache. How big was the shared buffers
> setting? What OS scheduler are you using? The effect would be greatest
> when using Deadline.

I didn't tune OS parameters, used default values.
In terms of cache amounts, postgres buffers were larger than kernel
write pool and controller cache. that's why the OS could not optimise
writes enough in checkpoint, I think.

- 200MB <- RAM * dirty_background_ratio
- 128MB <- Controller cache
- 2GB <- postgres shared_buffers

I forget to gather detail I/O information in the tests.
I'll retry it and report later.

RAM 2GB
Controller cache 128MB
shared_buffers 1GB
checkpoint_timeout = 15min
checkpoint_write_percent = 50.0

RHEL4 (Linux 2.6.9-42.0.2.EL)
vm.dirty_background_ratio = 10
vm.dirty_ratio = 40
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
Using cfq io scheduler

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Subject:	Re: Sorted writes in checkpoint
Date:	2007-06-15 10:55:02
Message-ID:	1181904902.17734.7.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Fri, 2007-06-15 at 18:33 +0900, ITAGAKI Takahiro wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> wrote:
>
> > > tests | pgbench | DBT-2 response time (avg/90%/max)
> > > ---------------------------+---------+-----------------------------------
> > > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s
> > > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s
> > > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s
> >
> > I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
> > of writes has been saved by doing that?
> > How long was the write phase of the checkpoint, how long
> > between checkpoints?
> >
> > I can see the sorted writes having an effect because the OS may not
> > receive blocks within a sufficient time window to fully optimise them.
> > That effect would grow with increasing sizes of shared_buffers and
> > decrease with size of controller cache. How big was the shared buffers
> > setting? What OS scheduler are you using? The effect would be greatest
> > when using Deadline.
>
> I didn't tune OS parameters, used default values.
> In terms of cache amounts, postgres buffers were larger than kernel
> write pool and controller cache. that's why the OS could not optimise
> writes enough in checkpoint, I think.
>
> - 200MB <- RAM * dirty_background_ratio
> - 128MB <- Controller cache
> - 2GB <- postgres shared_buffers
>
> I forget to gather detail I/O information in the tests.
> I'll retry it and report later.
>
> RAM 2GB
> Controller cache 128MB
> shared_buffers 1GB
> checkpoint_timeout = 15min
> checkpoint_write_percent = 50.0
>
> RHEL4 (Linux 2.6.9-42.0.2.EL)
> vm.dirty_background_ratio = 10
> vm.dirty_ratio = 40
> vm.dirty_expire_centisecs = 3000
> vm.dirty_writeback_centisecs = 500
> Using cfq io scheduler

Sounds like sorting the buffers before checkpoint is going to be a win
once we go above about ~128MB. We can do a simple test on NBuffers,
rather than have a sort_blocks_at_checkoint (!) GUC.

But it does seem there is a win for larger settings of shared_buffers.

Does performance go up in the non-sorted case if we make shared_buffers
smaller? Sounds like it might. We should check that first.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Subject:	Re: Sorted writes in checkpoint
Date:	2008-03-11 20:05:01
Message-ID:	200803112005.m2BK51325629@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Added to TODO:

* Consider sorting writes during checkpoint

http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
>
> > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > > If the kernel can treat sequential writes better than random writes, is
> > > it worth sorting dirty buffers in block order per file at the start of
> > > checkpoints?
>
> I wrote and tested the attached sorted-writes patch base on Heikki's
> ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
>
> tests | pgbench | DBT-2 response time (avg/90%/max)
> ---------------------------+---------+-----------------------------------
> LDC only | 181 tps | 1.12 / 4.38 / 12.13 s
> + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s
> + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s
>
> (*) Don't write buffers that were dirtied after starting the checkpoint.
>
> machine : 2GB-ram, SCSI*4 RAID-5
> pgbench : -s400 -t40000 -c10 (about 5GB of database)
> DBT-2 : 60WH (about 6GB of database)
>
>
> > I think it has the potential to improve things. There are three obvious
> > and one subtle argument against it I can think of:
> >
> > 1) Extra complexity for something that may not help. This would need some
> > good, robust benchmarking improvements to justify its use.
>
> Exactly. I think we need a discussion board for I/O performance issues.
> Can I use Developers Wiki for this purpose? Since performance graphs and
> result tables are important for the discussion, so it might be better
> than mailing lists, that are text-based.
>
>
> > 2) Block number ordering may not reflect actual order on disk. While
> > true, it's got to be better correlated with it than writing at random.
> > 3) The OS disk elevator should be dealing with this issue, particularly
> > because it may really know the actual disk ordering.
>
> Yes, both are true. However, I think there is pretty high correlation
> in those orderings. In addition, we should use filesystem to assure
> those orderings correspond to each other. For example, pre-allocation
> of files might help us, as has often been discussed.
>
>
> > Here's the subtle thing: by writing in the same order the LRU scan occurs
> > in, you are writing dirty buffers in the optimal fashion to eliminate
> > client backend writes during BuferAlloc. This makes the checkpoint a
> > really effective LRU clearing mechanism. Writing in block order will
> > change that.
>
> The issue will probably go away after we have LDC, because it writes LRU
> buffers during checkpoints.
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-patches(at)postgresql(dot)org
Subject:	Sorting writes during checkpoint
Date:	2008-04-15 09:19:43
Message-ID:	20080415181742.6C97.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Here is a patch for TODO item, "Consider sorting writes during checkpoint".
It writes dirty buffers in the order of block number during checkpoint
so that buffers are written sequentially.

I proposed the patch before, but it was rejected because 8.3 feature
has been frozen already at that time.
http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

I rewrote it to be applied cleanly against current HEAD, but the concept
is not changed at all -- Memorizing pairs of (buf_id, BufferTag) for each
dirty buffer into an palloc-ed array at the start of checkpoint. Sorting
the array in BufferTag order and writing buffers in the order.

There are 10% of performance win in pgbench on my machine with RAID-0
disks. There can be more benefits on RAID-5 disks, because random writes
are slower than sequential writes there.

[HEAD]
tps = 1134.233955 (excluding connections establishing)
[HEAD with patch]
tps = 1267.446249 (excluding connections establishing)

[pgbench]
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: simple
number of clients: 32
number of transactions per client: 100000
number of transactions actually processed: 3200000/3200000

[hardware]
2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0)

[postgresql.conf]
shared_buffers = 2GB
wal_buffers = 4MB
checkpoint_segments = 64
checkpoint_timeout = 5min
checkpoint_completion_target = 0.5

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment	Content-Type	Size
sorted-ckpt-84.patch	application/octet-stream	2.9 KB

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-04-15 13:16:40
Message-ID:	Pine.GSO.4.64.0804150851090.9688@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Tue, 15 Apr 2008, ITAGAKI Takahiro wrote:

> 2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0)

What is the disk controller in this system? I'm specifically curious
about what write cache was involved, so I can get a better feel for the
hardware your results came from.

I'm busy rebuilding my performance testing systems right now, once that's
done I can review this on a few platforms. One thing that jumped out at
me just reading the code is this happening inside BufferSync:

buf_to_write = (BufAndTag *) palloc(NBuffers * sizeof(BufAndTag));

If shared_buffers(=NBuffers) is set to something big, this could give some
memory churn. And I think it's a bad idea to allocate something this
large at checkpoint time, because what happens if that fails? Really not
the time you want to discover there's no RAM left.

Since you're always going to need this much memory for the system to
operate, and the current model has the system running a checkpoint >50% of
the time, the only thing that makes sense to me is to allocate it at
server start time once and be done with it. That should improve
performance over the original patch as well.

BufAndTag is a relatively small structure (5 ints). Let's call it 40
bytes; even that's only a 0.5% overhead relative to the shared buffer
allocation. If we can speed checkpoints significantly with that much
overhead it sounds like a good tradeoff to me.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-04-16 04:22:13
Message-ID:	20080416125802.78C9.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Greg Smith <gsmith(at)gregsmith(dot)com> wrote:

> On Tue, 15 Apr 2008, ITAGAKI Takahiro wrote:
>
> > 2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0)
>
> What is the disk controller in this system? I'm specifically curious
> about what write cache was involved, so I can get a better feel for the
> hardware your results came from.

I used HP ProLiant DL380 G5 with Smart Array P400 with 256MB cache
(http://h10010.www1.hp.com/wwpc/us/en/sm/WF06a/15351-15351-3328412-241644-241475-1121516.html)
and ext3fs on LVM of CentOS 5.1 (Linux version 2.6.18-53.el5).
Dirty region of database was probably larger than disk controller's cache.

> buf_to_write = (BufAndTag *) palloc(NBuffers * sizeof(BufAndTag));
>
> If shared_buffers(=NBuffers) is set to something big, this could give some
> memory churn. And I think it's a bad idea to allocate something this
> large at checkpoint time, because what happens if that fails? Really not
> the time you want to discover there's no RAM left.

Hmm, but I think we need to copy buffer tags into bgwriter's local memory
in order to avoid locking taga many times in the sorting. Is it better to
allocate sorting buffers at the first time and keep and reuse it from then on?

> BufAndTag is a relatively small structure (5 ints). Let's call it 40
> bytes; even that's only a 0.5% overhead relative to the shared buffer
> allocation. If we can speed checkpoints significantly with that much
> overhead it sounds like a good tradeoff to me.

I thinks sizeof(BufAndTag) is 20 bytes because sizeof(int) is 4 on typical
platforms (and if not, I should rewrite the patch to be always so).
It is 0.25% of shared buffers; when shared_buffers is set to 10GB,
it takes 25MB of process local memory. If we want to consume less memory
for it, RelFileNode in BufferTag could be hashed and packed into an integer;
The blockNum order is important for this purpose, but RelFileNode is not.
It makes the overhead to 12 bytes per page (0.15%). Is it worth doing?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-04-16 22:02:38
Message-ID:	Pine.GSO.4.64.0804161753290.13762@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Wed, 16 Apr 2008, ITAGAKI Takahiro wrote:

> Dirty region of database was probably larger than disk controller's cache.

Might be worthwhile to run with log_checkpoints on and collect some
statistics there next time you're running these tests. It's a good habit
to get other testers into regardless; it's nice to be able to say
something like "during the 15 checkpoints encountered during this test,
the largest dirty area was 516MB while the median was 175MB".

> Hmm, but I think we need to copy buffer tags into bgwriter's local memory
> in order to avoid locking taga many times in the sorting. Is it better to
> allocate sorting buffers at the first time and keep and reuse it from then on?

That what I was thinking: allocate the memory when the background writer
starts and just always have it there, the allocation you're doing is
always the same size. If it's in use 50% of the time anyway (which it is
if you have checkpoint_completion_target at its default), why introduce
the risk that an allocation will fail at checkpoint time? Just allocate
it once and keep it around.

> It is 0.25% of shared buffers; when shared_buffers is set to 10GB,
> it takes 25MB of process local memory.

Your numbers are probably closer to correct. I was being pessimistic
about the size of all the integers just to demonstrate that it's not
really a significant amount of memory even if they're large.

> If we want to consume less memory for it, RelFileNode in BufferTag could
> be hashed and packed into an integer

I personally don't feel it's worth making the code any more complicated
than it needs to be just to save a fraction of a percent of the total
memory used by the buffer pool.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-05-04 04:40:19
Message-ID:	4421.1209876019@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
>> If shared_buffers(=NBuffers) is set to something big, this could give some
>> memory churn. And I think it's a bad idea to allocate something this
>> large at checkpoint time, because what happens if that fails? Really not
>> the time you want to discover there's no RAM left.

> Hmm, but I think we need to copy buffer tags into bgwriter's local memory
> in order to avoid locking taga many times in the sorting.

I updated this patch to permanently allocate the working array as Greg
suggests, and to fix a bunch of commenting issues (attached).

However, I am completely unable to measure any performance improvement
from it. Given the possible risk of out-of-memory failures, I think the
patch should not be applied without some direct proof of performance
benefits, and I don't see any.

regards, tom lane

Attachment	Content-Type	Size
unknown_filename	text/plain	7.1 KB

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-05-04 23:12:32
Message-ID:	Pine.GSO.4.64.0805041905030.14259@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Sun, 4 May 2008, Tom Lane wrote:

> However, I am completely unable to measure any performance improvement
> from it. Given the possible risk of out-of-memory failures, I think the
> patch should not be applied without some direct proof of performance
> benefits, and I don't see any.

Fair enough. There were some pgbench results attached to the original
patch submission that gave me a good idea how to replicate the situation
where there's some improvement. I expect I can take a shot at quantifying
that independantly near the end of this month if nobody else gets to it
before then (I'm stuck sorting out a number of OS level issue right now
before my testing system is online again). Was planning to take a longer
look at Greg Stark's prefetching work at that point as well.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-05-04 23:35:46
Message-ID:	12673.1209944146@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> On Sun, 4 May 2008, Tom Lane wrote:
>> However, I am completely unable to measure any performance improvement
>> from it. Given the possible risk of out-of-memory failures, I think the
>> patch should not be applied without some direct proof of performance
>> benefits, and I don't see any.

> Fair enough. There were some pgbench results attached to the original
> patch submission that gave me a good idea how to replicate the situation
> where there's some improvement.

Well, I tried a pgbench test similar to that one --- on smaller hardware
than was reported, so it was a bit smaller test case, but it should have
given similar results. I didn't see any improvement; if anything it was
a bit worse. So that's what got me concerned.

Of course it's notoriously hard to get consistent numbers out of pgbench
anyway, so I'd rather see some other test case ...

> I expect I can take a shot at quantifying
> that independantly near the end of this month if nobody else gets to it
> before then (I'm stuck sorting out a number of OS level issue right now
> before my testing system is online again). Was planning to take a longer
> look at Greg Stark's prefetching work at that point as well.

Fair enough. Unless someone can volunteer to test sooner, I think we
should drop this item from the current commitfest queue.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-05-05 02:43:13
Message-ID:	Pine.GSO.4.64.0805042207410.6226@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Sun, 4 May 2008, Tom Lane wrote:

> Well, I tried a pgbench test similar to that one --- on smaller hardware
> than was reported, so it was a bit smaller test case, but it should have
> given similar results.

My pet theory on cases where sorting will help suggests you may need a
write-caching controller for this patch to be useful. I expect we'll see
the biggest improvement in situations where the total amount of dirty
buffers is larger than the write cache and the cache becomes blocked. If
you're not offloading to another device like that, the OS-level elevator
sorting will handle sorting for you close enough to optimally that I doubt
this will help much (and in fact may just get in the way).

> Of course it's notoriously hard to get consistent numbers out of pgbench
> anyway, so I'd rather see some other test case ...

I have some tools to run pgbench results many times and look for patterns
that work fairly well for the consistency part. pgbench will dirty a very
high percentage of the buffer cache by checkpoint time relative to how
much work it does, which makes it close to a best case for confirming
there is a potential improvement here.

I think a reasonable approach is to continue trying to quantify some
improvement using pgbench with an eye toward also doing DBT2 tests, which
provoke similar behavior at checkpoint time. I suspect someone who
already has a known good DBT2 lab setup with caching controller hardware
(EDB?) might be able to do a useful test of this patch without too much
trouble on their part.

> Unless someone can volunteer to test sooner, I think we should drop this
> item from the current commitfest queue.

This patch took a good step forward toward being commited this round with
your review, which is the important part from my perspective (as someone
who would like this to be committed if it truly works). I expect that
performance related patches will often take more than one commitfest to
pass through.

From the perspective of keeping the committer's plates clean, a reasonable
system for this situation might be for you to bounce this into the
rejected pile as "Returned for testing" immediately, to clearly remove it
from the main queue. A reasonable expectation there is that you might
consider it again during May if someone gets back with said testing
results before the 'fest ends.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-05-05 04:23:55
Message-ID:	15746.1209961435@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> On Sun, 4 May 2008, Tom Lane wrote:
>> Well, I tried a pgbench test similar to that one --- on smaller hardware
>> than was reported, so it was a bit smaller test case, but it should have
>> given similar results.

> ... If
> you're not offloading to another device like that, the OS-level elevator
> sorting will handle sorting for you close enough to optimally that I doubt
> this will help much (and in fact may just get in the way).

Yeah. It bothers me a bit that the patch forces writes to be done "all
of file A in order, then all of file B in order, etc". We don't know
enough about the disk layout of the files to be sure that that's good.
(This might also mean that whether there is a win is going to be
platform and filesystem dependent ...)

>> Unless someone can volunteer to test sooner, I think we should drop this
>> item from the current commitfest queue.

> From the perspective of keeping the committer's plates clean, a reasonable
> system for this situation might be for you to bounce this into the
> rejected pile as "Returned for testing" immediately, to clearly remove it
> from the main queue. A reasonable expectation there is that you might
> consider it again during May if someone gets back with said testing
> results before the 'fest ends.

Right, that's in the ground rules for commitfests: if the submitter can
respond to complaints before the fest is over, we'll reconsider the
patch.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-05-05 05:37:28
Message-ID:	Pine.GSO.4.64.0805050118001.24473@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, 5 May 2008, Tom Lane wrote:

> It bothers me a bit that the patch forces writes to be done "all of file
> A in order, then all of file B in order, etc". We don't know enough
> about the disk layout of the files to be sure that that's good. (This
> might also mean that whether there is a win is going to be platform and
> filesystem dependent ...)

I think most platform and filesystem implementations have disk location
correlated enough with block order that this particular issue isn't a
large one. If the writes are mainly going to one logical area (a single
partition or disk array), it should be a win as long as the sorting step
itself isn't introducing a delay. I am concered that in a more
complicated case than pgbench, where the writes are spread across multiple
arrays say, that forcing writes in order may slow things down.

Example: let's say there's two tablespaces mapped to two arrays, A and B,
that the data is being written to at checkpoint time. In the current
case, that I/O might be AABAABABBBAB, which is going to keep both arrays
busy writing. The sorted case would instead make that AAAAAABBBBBB so
only one array will be active at a time. It may very well be the case
that the improvement from lowering seeks on the writes to A and B is less
than the loss coming from not keeping both continuously busy.

I think I can simulate this by using a modified pgbench script that works
against an accounts1 and accounts2 with equal frequency, where 1&2 are
actually on different tablespaces on two disks.

> Right, that's in the ground rules for commitfests: if the submitter can
> respond to complaints before the fest is over, we'll reconsider the
> patch.

The small optimization I was trying to suggest was that you just bounce
this type of patch automatically to the "rejected for <x>" section of the
commitfest wiki page in cases like these. The standard practice on this
sort of queue is to automatically reclassify when someone has made a pass
over the patch, leaving the original source to re-open with more
information. That keeps the unprocessed part of the queue always
shrinking, and as long as people know that they can get it reconsidered by
submitting new results it's not unfair to them.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-07-04 08:37:10
Message-ID:	1215160630.4051.19.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, 2008-05-05 at 00:23 -0400, Tom Lane wrote:
> Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> > On Sun, 4 May 2008, Tom Lane wrote:
> >> Well, I tried a pgbench test similar to that one --- on smaller hardware
> >> than was reported, so it was a bit smaller test case, but it should have
> >> given similar results.
>
> > ... If
> > you're not offloading to another device like that, the OS-level elevator
> > sorting will handle sorting for you close enough to optimally that I doubt
> > this will help much (and in fact may just get in the way).
>
> Yeah. It bothers me a bit that the patch forces writes to be done "all
> of file A in order, then all of file B in order, etc". We don't know
> enough about the disk layout of the files to be sure that that's good.
> (This might also mean that whether there is a win is going to be
> platform and filesystem dependent ...)

No action on this seen since last commitfest, but I think we should do
something with it, rather than just ignore it.

Agree with all comments myself, so proposed solution is to implement
this as an I/O elevator hook. Standard elevator is to issue them in
order as they come, additional elevator in contrib is file/block sorted.
That will make testing easier and will also give Itagaki his benefit,
while allowing on-going research. If this solution's good enough for
Linux it ought to be good enough for us.

Note that if we do this for checkpoint we should also do this for
FlushRelationBuffers(), used during heap_sync(), for exactly the same
reasons.

Would suggest calling it bulk_io_hook() or similar.

Further observation would be that if there was an effect then it would
be at the block-device level, i.e. tablespace. Sorting the writes so
that we issued one tablespace at a time might at least help the I/O
elevators/disk caches to work with the whole problem at once. We might
get benefit on one tablespace but not on another.

Sorting by file might have inadvertently shown benefit at the tablespace
level on a larger server with spread out data whereas on Tom's test
system I would guess just a single tablespace was used.

Anyway, I note that we don't have an easy way of sorting by tablespace,
but I'm sure it would be possible to look up the tablespace for a file
within a plugin.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-07-04 16:05:54
Message-ID:	23541.1215187554@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Anyway, I note that we don't have an easy way of sorting by tablespace,

Say what? tablespace is the first component of relfilenode.

> but I'm sure it would be possible to look up the tablespace for a file
> within a plugin.

If the information weren't readily available from relfilenode, it would
*not* be possible for a bufmgr plugin to look it up. bufmgr is much too
low-level to be dependent on performing catalog lookups.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-07-04 16:22:23
Message-ID:	1215188543.4051.228.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Fri, 2008-07-04 at 12:05 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > Anyway, I note that we don't have an easy way of sorting by tablespace,
>
> Say what? tablespace is the first component of relfilenode.

OK, thats a mistake... what about the rest of the idea?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject:	Re: Sorting writes during checkpoint
Date:	2008-07-07 01:29:15
Message-ID:	20080707095158.73DF.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

(Go back to -hackers)

Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> No action on this seen since last commitfest, but I think we should do
> something with it, rather than just ignore it.

I will have a plan to test it on RAID-5 disks, where sequential writing
are much better than random writing. I'll send the result as an evidence.

Also, I have a relevant idea to sorting writes. Smoothed checkpoint in 8.3
spreads write(), but calls fsync() at once. With sorted writes, we can
call fsync() segment-by-segment for each writes of dirty pages contained
in the segment. It could improve worst response time during checkpoints.

> Note that if we do this for checkpoint we should also do this for
> FlushRelationBuffers(), used during heap_sync(), for exactly the same
> reasons.

Ah, I overlooked FlushRelationBuffers(). It is worth sorting.

> Would suggest calling it bulk_io_hook() or similar.

I think we need to reconsider the "bufmgr - smgr - md" layers, not only
an I/O elevator hook. If we will have spreading fsync(), bufmgr should
know where the file segments are switched. It seems to break area
between bufmgr and md in the current architecture unhappily.

In addition, the current smgr layer is completely useless because
it cannot be extended dynamically and cannot handle multiple md-layer
modules. I would rather merge current smgr and part of bufmgr into
a new smgr and add smgr_hook() than bulk_io_hook().

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-07-10 01:39:29
Message-ID:	Pine.GSO.4.64.0807092023510.8953@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Fri, 4 Jul 2008, Simon Riggs wrote:

> No action on this seen since last commitfest, but I think we should do
> something with it, rather than just ignore it.

Just no action worth reporting yet. Over the weekend I finally reached
the point where I've got a system that should be capable of independently
replicating the results improvement setup here, and I've started
performance testing of the patch. Getting useful checkpoint test results
from pgbench is really a pain.

> Sorting by file might have inadvertently shown benefit at the tablespace
> level on a larger server with spread out data whereas on Tom's test
> system I would guess just a single tablespace was used.

I doubt this has anything to do with it, only because the pgbench schema
doesn't split into tablespaces usefully. Almost all of the real action is
on a single table, accounts.

My suspicion is that sorting only benefits in situations where you have a
disk controller with a significant amount of RAM on it, but the server RAM
is much larger. In that case the sorting horizon of the controller itself
is smaller than what the server can do, and the sorting makes it less
likely you'll end up with the controller filled with unsorted stuff that
takes a long time to clear.

In Tom's test, there's probably only 8 or 16MB worth of cache on the disk
itself, so you can't get a large backlog of unsorted writes clogging the
write pipeline. But most server systems have 256MB or more of RAM there,
and if you get that filled with seek-heavy writes (which might only clear
at a couple of MB a second) the delay for that cache to empty can be
considerable.

That said, I've got a 256MB controller here and have a very similar disk
setup to the one postiive results were reported on, but so far I don't see
any significant difference after applying the sorted writes patch.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-07-10 07:06:12
Message-ID:	1215673572.4051.1198.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Wed, 2008-07-09 at 21:39 -0400, Greg Smith wrote:
> On Fri, 4 Jul 2008, Simon Riggs wrote:
>
> > No action on this seen since last commitfest, but I think we should
> do
> > something with it, rather than just ignore it.
>
> Just no action worth reporting yet. Over the weekend I finally
> reached the point where I've got a system that should be capable of
> independently replicating the results improvement setup here, and I've
> started performance testing of the patch. Getting useful checkpoint
> test results from pgbench is really a pain.

I agree completely. That's why I've suggested a plugin approach. That
way Itagaki can have his performance, the rest of us don't need to fret
and yet we hold open the door indefinitely for additional ways of doing
it. And we can test it on production systems with realistic workloads.
If one clear way emerges as best, we adopt that plugin permanently.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sorting writes during checkpoint
Date:	2008-07-16 05:19:22
Message-ID:	Pine.GSO.4.64.0807092139390.8953@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, 7 Jul 2008, ITAGAKI Takahiro wrote:

> I will have a plan to test it on RAID-5 disks, where sequential writing
> are much better than random writing. I'll send the result as an evidence.

If you're running more tests here, please turn on log_checkpoints and
collect the logs while the test is running. I'm really curious if there's
any significant difference in what that reports here in the sorted case
vs. the regular one.

> Smoothed checkpoint in 8.3 spreads write(), but calls fsync() at once.
> With sorted writes, we can call fsync() segment-by-segment for each
> writes of dirty pages contained in the segment. It could improve worst
> response time during checkpoints.

Further decreasing the amount of data that is fsync'd at any point in time
might be a bigger improvement than just the sorting itself is doing (so
far I haven't seen anything really significant just from the sort but am
still testing).

One thing I didn't see any comments from you on is how/if the sorted
writes patch lowers worst-case latency. That's the area I'd hope an
improved fsync protocol would help most with, rather than TPS, which might
even go backwards because writes won't be as bunched and therefore will
have more seeking. It's easy enough to analyze the data coming from
"pgbench -l" to figure that out; example shell snipped that shows just the
worst ones:

pgbench -l -N <db>
p=$!
wait $p
mv pgbench_log.${p} pgbench.log
cat pgbench.log | cut -f 3 -d " " | sort -n | tail

Actually graphing the latencies can be even more instructive, I have some
examples of that on my web page you may have seen before.

> In addition, the current smgr layer is completely useless because
> it cannot be extended dynamically and cannot handle multiple md-layer
> modules. I would rather merge current smgr and part of bufmgr into
> a new smgr and add smgr_hook() than bulk_io_hook().

I don't really have a firm opinion here about the code to comment on this
specific suggestion, but I will say that I've found the amount of layering
in this area makes it difficult to understand just what's going on
sometimes (especially when new to it). A lot of that abstraction felt a
bit pass-through to me, and anything that would collapse that a bit would
be helpful for streamlining the code instrumenting going on with things
like dtrace.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD