Re: Log levels for checkpoint/bgwriter monitoring

Lists: pgsql-hackers
From: Dimitri Fontaine <dim(at)dalibo(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 10:25:41
Message-ID: 200702191125.45772.dim@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi list,

Here's a proposal of this idea which stole a good part of my night.
I'll present first the idea, then 2 use cases where to read some rational and
few details. Please note I won't be able to participate in any development
effort associated with this idea, may such a thing happen!

The bare idea is to provide a way to 'attach' multiple storage facilities (say
volumes) to a given tablespace. Each volume may be attached in READ ONLY,
READ WRITE or WRITE ONLY mode.
You can mix RW and WO volumes into the same tablespace, but can't have RO with
any W form, or so I think.

It would be pretty handy to be able to add and remove volumes on a live
cluster, and this could be a way to implement moving/extending tablespaces.

Use Case A: better read performances while keeping data write reliability

The first application of this multiple volumes per tablespace idea is to keep
a tablespace both into RAM (tmpfs or ramfs) and on disk (both RW).

Then PG should be able to read from both volumes when dealing with read
queries, and would have to fwrite()/fsync() both volumes for each write.
Of course, write speed will be constrained by the slowest volume, but the
quicker one would then be able to take away some amount of read queries
meanwhile.

It would be neat if PG was able to account volumes relative write speed in
order to assign pounds to each tablespace volumes; and have the planner or
executor span read queries among volumes depending on that.
For example if a single query has a plan containing several full scan (of
indexes and/or tables) in the same tablespace, those could be done on
different volumes.

Use Case B: Synchronous Master Slave(s) Replication

By using a Distributed File System capable of being mounted from several nodes
at the same time, we could have a configuration where a master node has
('exports') a WO tablespace volume, and one or more slaves (depending on FS
capability) configures a RO tablespace volume.

PG has then to be able to cope with a RO volume: the data are not written by
PG itself (local node point of view), so some limitations would certainly
occur.
Will it be possible, for example, to add indexes to data on slaves?
I'd use the solution even without this, thus...

When the master/slave link is broken, the master can no more write to
tablespace, as if it was a local disk failure of some sort, so this should
prevent nasty desync' problems: data is written on all W volumes or data is
not written at all.

I realize this proposal is the first draft of a work to be done, and that I
won't be able to make a lot more than drafting this idea. This mail is sent
on the hackers list in the hope someone there will find this is worth
considering and polishing...

Regards, and thanks for the good work ;)
--
Dimitri Fontaine


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Dimitri Fontaine <dim(at)dalibo(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 11:58:47
Message-ID: 20070219115847.GE30737@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 19, 2007 at 11:25:41AM +0100, Dimitri Fontaine wrote:
> Hi list,
>
> Here's a proposal of this idea which stole a good part of my night.
> I'll present first the idea, then 2 use cases where to read some rational and
> few details. Please note I won't be able to participate in any development
> effort associated with this idea, may such a thing happen!
>
> The bare idea is to provide a way to 'attach' multiple storage facilities (say
> volumes) to a given tablespace. Each volume may be attached in READ ONLY,
> READ WRITE or WRITE ONLY mode.
> You can mix RW and WO volumes into the same tablespace, but can't have RO with
> any W form, or so I think.

Somehow this seems like implementing RAID within postgres, which seems
a bit outside of the scope of a DB.

> Use Case A: better read performances while keeping data write reliability
>
> The first application of this multiple volumes per tablespace idea is to keep
> a tablespace both into RAM (tmpfs or ramfs) and on disk (both RW).

For example, I don't beleive there is a restiction against having one
member of a RAID array being a RAM disk.

> Use Case B: Synchronous Master Slave(s) Replication
>
> By using a Distributed File System capable of being mounted from several nodes
> at the same time, we could have a configuration where a master node has
> ('exports') a WO tablespace volume, and one or more slaves (depending on FS
> capability) configures a RO tablespace volume.

Here you have the problem of row visibility. The data in the table isn't
very useful without the clog, and that's not stored in a tablespace...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Dimitri Fontaine <dim(at)dalibo(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 15:33:24
Message-ID: 9811.1171899204@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> Somehow this seems like implementing RAID within postgres,

RAID and LVM too. I can't get excited about re-inventing those wheels
when perfectly good implementations already exist for us to sit on top of.

regards, tom lane


From: Dimitri Fontaine <dim(at)dalibo(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 16:10:36
Message-ID: 200702191710.36553.dim@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Le lundi 19 février 2007 16:33, Tom Lane a écrit :
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > Somehow this seems like implementing RAID within postgres,
>
> RAID and LVM too. I can't get excited about re-inventing those wheels
> when perfectly good implementations already exist for us to sit on top of.

I though moving some knowledge about data availability into PostgreSQL code
could provide some valuable performance benefit, allowing to organize reads
(for example parallel tables scan/indexes scan to different volumes) and
obtaining data from 'quicker' known volume (or least used/charged).

You're both saying RAID/LVM implementations provide good enough performances
for PG not having to go this way, if I understand correctly.
And distributed file systems are enough to have the replication stuff, without
PG having to deal explicitly with the work involved.

May be I should have slept after all ;)

Thanks for your time and comments, regards,
--
Dimitri Fontaine


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dimitri Fontaine <dim(at)dalibo(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 16:20:11
Message-ID: 13427.1171902011@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dimitri Fontaine <dim(at)dalibo(dot)com> writes:
> You're both saying RAID/LVM implementations provide good enough performances
> for PG not having to go this way, if I understand correctly.

There's certainly no evidence to suggest that reimplementing them
ourselves would be a productive use of our time.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Dimitri Fontaine <dim(at)dalibo(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 16:27:51
Message-ID: 20070219162750.GF30737@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 19, 2007 at 05:10:36PM +0100, Dimitri Fontaine wrote:
> > RAID and LVM too. I can't get excited about re-inventing those wheels
> > when perfectly good implementations already exist for us to sit on top of.
>
> I though moving some knowledge about data availability into PostgreSQL code
> could provide some valuable performance benefit, allowing to organize reads
> (for example parallel tables scan/indexes scan to different volumes) and
> obtaining data from 'quicker' known volume (or least used/charged).

Well, organising requests to be handled quickly is a function of
LVM/RAID, so we don't go there. However, speeding up scans by having
multiple requests is an interesting approach, as would perhaps a
different random_page_cost for different tablespaces.

My point is, don't try to implement the mechanics of LVM/RAID into
postgres, instead, work on providing ways for users to take advantage
of these mechanisms if they have them. Look at it as if you have got
LVM/RAID setup for your ideas, how do you get postgres to take
advantage of them?

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Dimitri Fontaine <dim(at)dalibo(dot)com>
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 18:34:55
Message-ID: 200702191934.56521.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > Somehow this seems like implementing RAID within postgres,
>
> RAID and LVM too. I can't get excited about re-inventing those
> wheels when perfectly good implementations already exist for us to
> sit on top of.

I expect that someone will point out that Windows doesn't support RAID
or LVM, and we'll have to reimplement it anyway.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Dimitri Fontaine <dim(at)dalibo(dot)com>
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 18:39:21
Message-ID: 45D9EED9.7020907@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Tom Lane wrote:
>> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>>> Somehow this seems like implementing RAID within postgres,
>> RAID and LVM too. I can't get excited about re-inventing those
>> wheels when perfectly good implementations already exist for us to
>> sit on top of.
>
> I expect that someone will point out that Windows doesn't support RAID
> or LVM, and we'll have to reimplement it anyway.

windows supports software raid just fine since Windows 2000 or so ...

Stefan


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Dimitri Fontaine <dim(at)dalibo(dot)com>
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 18:39:35
Message-ID: 45D9EEE7.5050106@hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Tom Lane wrote:
>> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>>> Somehow this seems like implementing RAID within postgres,
>> RAID and LVM too. I can't get excited about re-inventing those
>> wheels when perfectly good implementations already exist for us to
>> sit on top of.
>
> I expect that someone will point out that Windows doesn't support RAID
> or LVM, and we'll have to reimplement it anyway.

Windows supports both RAID and LVM.

//Magnus


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Dimitri Fontaine <dim(at)dalibo(dot)com>
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 19:07:32
Message-ID: 200702192007.33786.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander wrote:
> Windows supports both RAID and LVM.

Oh good, so we've got that on record. :)

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Dimitri Fontaine <dim(at)dalibo(dot)com>
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 19:33:19
Message-ID: 45D9FB7F.5000001@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stefan Kaltenbrunner wrote:
> Peter Eisentraut wrote:
>> Tom Lane wrote:
>>> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>>>> Somehow this seems like implementing RAID within postgres,
>>> RAID and LVM too. I can't get excited about re-inventing those
>>> wheels when perfectly good implementations already exist for us to
>>> sit on top of.
>> I expect that someone will point out that Windows doesn't support RAID
>> or LVM, and we'll have to reimplement it anyway.
>
> windows supports software raid just fine since Windows 2000 or so ...

Longer than that... it supported mirroring and raid 5 in NT4 and
possibly even NT3.51 IIRC.

Joshua D. Drake

>
>
> Stefan
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 19:50:34
Message-ID: 20070219195034.GB9448@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > Somehow this seems like implementing RAID within postgres,
>
> RAID and LVM too. I can't get excited about re-inventing those wheels
> when perfectly good implementations already exist for us to sit on top of.

Ok, warning, this is a "you know what would be sweet" moment.

What would be nice is to be able to detach one of the volumes, and
know the span of the data in there without being able to access the
data.

The problem that a lot of warehouse operators have is something like
this: "We know we have all this data, but we don't know what we will
want to do with it later. So keep it all. I'll get back to you when
I want to know something."

It'd be nice to be able to load up all that data once, and then shunt
it off into (say) read-only media. If one could then run a query
that would tell one which spans of data are candidates for the
search, you could bring back online (onto reasonably fast storage,
for instance) just the volumes you need to read.

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
Users never remark, "Wow, this software may be buggy and hard
to use, but at least there is a lot of code underneath."
--Damien Katz


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 19:54:19
Message-ID: 45DA006B.8040802@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Sullivan wrote:
> On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
>> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>>> Somehow this seems like implementing RAID within postgres,
>> RAID and LVM too. I can't get excited about re-inventing those wheels
>> when perfectly good implementations already exist for us to sit on top of.
>
> Ok, warning, this is a "you know what would be sweet" moment.

The dreaded words from a developers mouth to every manager in the world.

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 19:57:02
Message-ID: Pine.GSO.4.64.0702191446000.11463@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 19 Feb 2007, Joshua D. Drake wrote:

> Longer than that... it supported mirroring and raid 5 in NT4 and
> possibly even NT3.51 IIRC.

Mirroring and RAID 5 go back to Windows NT 3.1 Advanced Server in 1993:

http://support.microsoft.com/kb/114779
http://www.byte.com/art/9404/sec8/art7.htm

The main source of confusion about current support for this feature is
that the desktop/workstation version of Windows don't have it. For
Windows XP, you need the XP Professional version to get "dynamic disk"
support; it's not in the home edition.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 20:08:22
Message-ID: 200702192008.l1JK8Md24270@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joshua D. Drake wrote:
> Andrew Sullivan wrote:
> > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> >> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> >>> Somehow this seems like implementing RAID within postgres,
> >> RAID and LVM too. I can't get excited about re-inventing those wheels
> >> when perfectly good implementations already exist for us to sit on top of.
> >
> > Ok, warning, this is a "you know what would be sweet" moment.
>
> The dreaded words from a developers mouth to every manager in the world.

Yea, I just instinctively hit "delete" when I saw that phrase.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 20:35:37
Message-ID: 20070219203537.GA28395@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Sullivan wrote:
> On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> > Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > > Somehow this seems like implementing RAID within postgres,
> >
> > RAID and LVM too. I can't get excited about re-inventing those wheels
> > when perfectly good implementations already exist for us to sit on top of.
>
> Ok, warning, this is a "you know what would be sweet" moment.
>
> What would be nice is to be able to detach one of the volumes, and
> know the span of the data in there without being able to access the
> data.
>
> The problem that a lot of warehouse operators have is something like
> this: "We know we have all this data, but we don't know what we will
> want to do with it later. So keep it all. I'll get back to you when
> I want to know something."

You should be able to do that with tablespaces and VACUUM FREEZE, the
point of the latter being that you can take the disk containing the
"read only" data offline, and still have the data readable after
plugging it back in, no matter how far along the transaction ID counter
is.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc: "Andrew Sullivan" <ajs(at)crankycanuck(dot)ca>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-19 23:10:47
Message-ID: 1171926647.25938.472.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-02-19 at 17:35 -0300, Alvaro Herrera wrote:
> Andrew Sullivan wrote:
> > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> > > Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > > > Somehow this seems like implementing RAID within postgres,
> > >
> > > RAID and LVM too. I can't get excited about re-inventing those wheels
> > > when perfectly good implementations already exist for us to sit on top of.
> >
> > Ok, warning, this is a "you know what would be sweet" moment.
> >
> > What would be nice is to be able to detach one of the volumes, and
> > know the span of the data in there without being able to access the
> > data.
> >
> > The problem that a lot of warehouse operators have is something like
> > this: "We know we have all this data, but we don't know what we will
> > want to do with it later. So keep it all. I'll get back to you when
> > I want to know something."
>
> You should be able to do that with tablespaces and VACUUM FREEZE, the
> point of the latter being that you can take the disk containing the
> "read only" data offline, and still have the data readable after
> plugging it back in, no matter how far along the transaction ID counter
> is.

Doesn't work anymore because VACUUM FREEZE doesn't (and can't) take a
full table lock, so somebody can always update data after a data block
has been frozen. That can lead to putting a table onto read-only media
when it still needs vacuuming, which is a great way to break the DB. It
also doesn't freeze still visible data, so there's no easy way of doing
this. Waiting until the VF is the oldest Xid is prone to deadlock as
well.

Ideally, we'd have a copy to read-only media whilst freezing, as an
atomic operation, with some guarantees that it will actually have frozen
*everything*, or fail:

ALTER TABLE SET TABLESPACE foo READONLY;

Can we agree that as a TODO item?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: David Fetter <david(at)fetter(dot)org>
To: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-20 03:10:52
Message-ID: 20070220031052.GD8081@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 19, 2007 at 02:50:34PM -0500, Andrew Sullivan wrote:
> On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> > Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > > Somehow this seems like implementing RAID within postgres,
> >
> > RAID and LVM too. I can't get excited about re-inventing those
> > wheels when perfectly good implementations already exist for us to
> > sit on top of.
>
> Ok, warning, this is a "you know what would be sweet" moment.
>
> What would be nice is to be able to detach one of the volumes, and
> know the span of the data in there without being able to access the
> data.
>
> The problem that a lot of warehouse operators have is something like
> this: "We know we have all this data, but we don't know what we will
> want to do with it later. So keep it all. I'll get back to you
> when I want to know something."
>
> It'd be nice to be able to load up all that data once, and then
> shunt it off into (say) read-only media. If one could then run a
> query that would tell one which spans of data are candidates for the
> search, you could bring back online (onto reasonably fast storage,
> for instance) just the volumes you need to read.

Isn't this one of the big use cases for table partitioning?

Cheers,
D
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter

Remember to vote!


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Log levels for checkpoint/bgwriter monitoring
Date: 2007-02-20 03:59:38
Message-ID: Pine.GSO.4.64.0702192155530.20153@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I have a WIP patch that adds the main detail I have found I need to
properly tune checkpoint and background writer activity. I think it's
almost ready to submit (you can see the current patch against 8.2 at
http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt )
after making it a bit more human-readable. But I've realized that along
with that, I need some guidance in regards to what log level is
appropriate for this information.

An example works better than explaining what the patch does:

2007-02-19 21:53:24.602 EST - DEBUG: checkpoint required (wrote
checkpoint_segments)
2007-02-19 21:53:24.685 EST - DEBUG: checkpoint starting
2007-02-19 21:53:24.705 EST - DEBUG: checkpoint flushing buffer pool
2007-02-19 21:53:24.985 EST - DEBUG: checkpoint database fsync starting
2007-02-19 21:53:42.725 EST - DEBUG: checkpoint database fsync complete
2007-02-19 21:53:42.726 EST - DEBUG: checkpoint buffer flush dirty=8034
write=279956 us sync=17739974 us

Remember that "Load distributed checkpoint" discussion back in December?
You can see exactly how bad the problem is on your system with this log
style (this is from a pgbench run where it's postively awful--it really
does take over 17 seconds for the fsync to execute, and there are clients
that are hung the whole time waiting for it).

I also instrumented the background writer. You get messages like this:

2007-02-19 21:58:54.328 EST - DEBUG: BGWriter Scan All - Written = 5/5
Unscanned = 23/54

This shows that we wrote (5) the maximum pages we were allowed to write
(5) while failing to scan almost half (23) of the buffers we meant to look
at (54). By taking a look at this output while the system is under load,
I found I was able to do bgwriter optimization that used to take me days
of frustrating testing in hours. I've been waiting for a good guide to
bgwriter tuning since 8.1 came out. Once you have this, combined with
knowing how many buffers were dirty at checkpoint time because the
bgwriter didn't get to them in time (the number you want to minimize), the
tuning guide practically writes itself.

So my question is...what log level should all this go at? Right now, I
have the background writer stuff adjusting its level dynamically based on
what happened; it logs at DEBUG2 if it hits the write limit (which should
be a rare event once you're tuned properly), DEBUG3 for writes that
scanned everything they were supposed to, and DEBUG4 if it scanned but
didn't find anything to write. The source of checkpoint information logs
at DEBUG1, the fsync/write info at DEBUG2.

I'd like to move some of these up. On my system, I even have many of the
checkpoint messages logged at INFO (the source of the checkpoint and the
total write time line). It's a bit chatty, but when you get some weird
system pause issue it makes it easy to figure out if checkpoints were to
blame. Given how useful I feel some of these messages are to system
tuning, and to explaining what currently appears as inexplicable pauses, I
don't want them to be buried at DEBUG levels where people are unlikely to
ever see them (I think some people may be concerned about turning on
things labeled DEBUG at all). I am aware that I am too deep into this to
have an unbiased opinion at this point though, which is why I ask for
feedback on how to proceed here.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-20 06:36:54
Message-ID: 200702200136.54854.xzilla@users.sourceforge.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday 19 February 2007 15:08, Bruce Momjian wrote:
> Joshua D. Drake wrote:
> > Andrew Sullivan wrote:
> > > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> > >> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > >>> Somehow this seems like implementing RAID within postgres,
> > >>
> > >> RAID and LVM too. I can't get excited about re-inventing those wheels
> > >> when perfectly good implementations already exist for us to sit on top
> > >> of.
> > >
> > > Ok, warning, this is a "you know what would be sweet" moment.
> >
> > The dreaded words from a developers mouth to every manager in the world.
>
> Yea, I just instinctively hit "delete" when I saw that phrase.

Too bad... I know oracle can do what he wants... possibly other db systems as
well.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


From: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To: pgsql-hackers(at)postgresql(dot)org, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Dimitri Fontaine <dim(at)dalibo(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-20 06:44:04
Message-ID: 200702200144.04489.xzilla@users.sourceforge.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday 19 February 2007 11:27, Martijn van Oosterhout wrote:
> On Mon, Feb 19, 2007 at 05:10:36PM +0100, Dimitri Fontaine wrote:
> > > RAID and LVM too. I can't get excited about re-inventing those wheels
> > > when perfectly good implementations already exist for us to sit on top
> > > of.
> >
> > I though moving some knowledge about data availability into PostgreSQL
> > code could provide some valuable performance benefit, allowing to
> > organize reads (for example parallel tables scan/indexes scan to
> > different volumes) and obtaining data from 'quicker' known volume (or
> > least used/charged).
>
> Well, organising requests to be handled quickly is a function of
> LVM/RAID, so we don't go there. However, speeding up scans by having
> multiple requests is an interesting approach, as would perhaps a
> different random_page_cost for different tablespaces.
>

On one of my systems I have 1 tablespace for read data (99-1), 1 for read
mostly data (90-10), and 1 for write mostly (40-60). The breakdown is based
on a combination of the underlying hardware and usage patterns of the tables
involved. I suspect that isn't that uncommon really. I've often thought that
being able to set guc variables to a specific tablespace (like you can do for
users) would allow for a lot of flexibility in tuning queries that go across
different tablespaces.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multiple Storage per Tablespace, or Volumes
Date: 2007-02-20 16:04:33
Message-ID: 20070220160433.GA12596@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 19, 2007 at 07:10:52PM -0800, David Fetter wrote:
>
> Isn't this one of the big use cases for table partitioning?

Sure, but you can't detach that data in the meantime, AFAIK. Maybe
I've missed something.

If I have 10 years of finace data, and I have to keep it all online
all the time, my electricity costs alone are outrageous. If I can
know _where_ the data is without being able to get it until I've
actually made it available, then my query tools could be smart enough
to detect "table partitioned; this data is offline", I could roll
back to my savepoint, return a partial result set to the user, and
tell it "call back in 24 hours for your full report".

Yes, I know, hands waving in the air. But I already said I was
having a "you know what would be sweet" moment.

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.
--Dennis Ritchie


From: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-02-22 04:22:34
Message-ID: 200702212322.34447.xzilla@users.sourceforge.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday 19 February 2007 22:59, Greg Smith wrote:
> I have a WIP patch that adds the main detail I have found I need to
> properly tune checkpoint and background writer activity. I think it's
> almost ready to submit (you can see the current patch against 8.2 at
> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt )
> after making it a bit more human-readable. But I've realized that along
> with that, I need some guidance in regards to what log level is
> appropriate for this information.
>
> An example works better than explaining what the patch does:
>
> 2007-02-19 21:53:24.602 EST - DEBUG: checkpoint required (wrote
> checkpoint_segments)
> 2007-02-19 21:53:24.685 EST - DEBUG: checkpoint starting
> 2007-02-19 21:53:24.705 EST - DEBUG: checkpoint flushing buffer pool
> 2007-02-19 21:53:24.985 EST - DEBUG: checkpoint database fsync starting
> 2007-02-19 21:53:42.725 EST - DEBUG: checkpoint database fsync complete
> 2007-02-19 21:53:42.726 EST - DEBUG: checkpoint buffer flush dirty=8034
> write=279956 us sync=17739974 us
>
> Remember that "Load distributed checkpoint" discussion back in December?
> You can see exactly how bad the problem is on your system with this log
> style (this is from a pgbench run where it's postively awful--it really
> does take over 17 seconds for the fsync to execute, and there are clients
> that are hung the whole time waiting for it).
>
> I also instrumented the background writer. You get messages like this:
>
> 2007-02-19 21:58:54.328 EST - DEBUG: BGWriter Scan All - Written = 5/5
> Unscanned = 23/54
>
> This shows that we wrote (5) the maximum pages we were allowed to write
> (5) while failing to scan almost half (23) of the buffers we meant to look
> at (54). By taking a look at this output while the system is under load,
> I found I was able to do bgwriter optimization that used to take me days
> of frustrating testing in hours. I've been waiting for a good guide to
> bgwriter tuning since 8.1 came out. Once you have this, combined with
> knowing how many buffers were dirty at checkpoint time because the
> bgwriter didn't get to them in time (the number you want to minimize), the
> tuning guide practically writes itself.
>
> So my question is...what log level should all this go at? Right now, I
> have the background writer stuff adjusting its level dynamically based on
> what happened; it logs at DEBUG2 if it hits the write limit (which should
> be a rare event once you're tuned properly), DEBUG3 for writes that
> scanned everything they were supposed to, and DEBUG4 if it scanned but
> didn't find anything to write. The source of checkpoint information logs
> at DEBUG1, the fsync/write info at DEBUG2.
>
> I'd like to move some of these up. On my system, I even have many of the
> checkpoint messages logged at INFO (the source of the checkpoint and the
> total write time line). It's a bit chatty, but when you get some weird
> system pause issue it makes it easy to figure out if checkpoints were to
> blame. Given how useful I feel some of these messages are to system
> tuning, and to explaining what currently appears as inexplicable pauses, I
> don't want them to be buried at DEBUG levels where people are unlikely to
> ever see them (I think some people may be concerned about turning on
> things labeled DEBUG at all). I am aware that I am too deep into this to
> have an unbiased opinion at this point though, which is why I ask for
> feedback on how to proceed here.
>

My impression of this is that DBA's would typically want to run this for a
short period of time to get thier systems tuned and then it pretty much
becomes chatter. Can you come up with an idea of what information DBA's need
to know? Would it be better to hide all of this logging behind a guc that
can be turned on or off (log_bgwriter_activity)? Maybe you could just
reinsterment it as dtrace probes so a seperate stand-alone process could pull
the information as needed? :-)

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-02-22 19:36:18
Message-ID: 20070222193618.GO19527@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 19, 2007 at 10:59:38PM -0500, Greg Smith wrote:
> I have a WIP patch that adds the main detail I have found I need to
> properly tune checkpoint and background writer activity. I think it's
> almost ready to submit (you can see the current patch against 8.2 at
> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt )
> after making it a bit more human-readable. But I've realized that along
> with that, I need some guidance in regards to what log level is
> appropriate for this information.
It would also be extremely useful to make checkpoint stats visible
somewhere in the database (presumably via the existing stats mechanism).
The log output is great for doing initial tuning, but you'd want to also
be able to monitor things, which would be impractical via logging. I'm
thinking just tracking how many pages had to be flushed during a
checkpoint would be a good start.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-06 03:34:35
Message-ID: Pine.GSO.4.64.0703052224480.16130@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 22 Feb 2007, Jim C. Nasby wrote:

> It would also be extremely useful to make checkpoint stats visible
> somewhere in the database (presumably via the existing stats
> mechanism)... I'm thinking just tracking how many pages had to be
> flushed during a checkpoint would be a good start.

I'm in the middle of testing an updated version of the patch, once I nail
down exactly what needs to be logged I'd planned to open a discussion on
which of those things would be best served by pg_stats instead of a log.

I decided specifically to aim for the logs instead for the checkpoint data
because if you're in a situation where are inserting fast enough that the
checkpoints are spaced closely together, you'd end up having to poll
pg_stats all the time for make sure you catch them all, which becomes even
less efficient than just logging the data.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-06 03:38:59
Message-ID: Pine.GSO.4.64.0703052234440.16130@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 21 Feb 2007, Robert Treat wrote:

> My impression of this is that DBA's would typically want to run this for a
> short period of time to get thier systems tuned and then it pretty much
> becomes chatter. Can you come up with an idea of what information DBA's need
> to know?

I am structing the log levels so that you can see all the events that are
likely sources for extreme latency by using log level DEBUG1. That level
isn't too chatty and I've decided I can leave the server there forever if
need by. I've actually come with a much better interface to the
background writer logging than the one I originally outlined, that I'll
have ready for feedback in a few more days. Now it dumps a nice summary
report every time it completes a scan, rather than logging lower-level
info.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-06 15:39:06
Message-ID: E1B40D86-CAD1-4602-AAF6-064F84A6DFD2@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 5, 2007, at 8:34 PM, Greg Smith wrote:
> On Thu, 22 Feb 2007, Jim C. Nasby wrote:
>
>> It would also be extremely useful to make checkpoint stats visible
>> somewhere in the database (presumably via the existing stats
>> mechanism)... I'm thinking just tracking how many pages had to be
>> flushed during a checkpoint would be a good start.
>
> I'm in the middle of testing an updated version of the patch, once
> I nail down exactly what needs to be logged I'd planned to open a
> discussion on which of those things would be best served by
> pg_stats instead of a log.
>
> I decided specifically to aim for the logs instead for the
> checkpoint data because if you're in a situation where are
> inserting fast enough that the checkpoints are spaced closely
> together, you'd end up having to poll pg_stats all the time for
> make sure you catch them all, which becomes even less efficient
> than just logging the data.

It's always good to be able to log stuff for detailed
troubleshooting, so that's a good place to start. The flipside is
that it's much easier to machine-parse a table rather than trying to
scrape the logs. And I don't think we'll generally care about each
individual checkpoint; rather we'll want to look at things like
'checkpoints/hour' and 'checkpoint written pages/hour'.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-07 04:14:37
Message-ID: Pine.GSO.4.64.0703062304230.24241@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 6 Mar 2007, Jim Nasby wrote:

> The flipside is that it's much easier to machine-parse a table rather
> than trying to scrape the logs.

Now you might realize why I've been so vocal on the SQL log export
implementation details.

> And I don't think we'll generally care about each individual checkpoint;
> rather we'll want to look at things like 'checkpoints/hour' and
> 'checkpoint written pages/hour'.

After a few months of staring at this data, I've found averages like that
misleading. The real problem areas correlate with the peak pages written
at any one checkpoint. Lowering that value is really the end-game for
optimizing the background writer, and the peaks are what will nail you
with a nasty fsync pause at checkpoint time.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-07 05:11:38
Message-ID: 20070307131820.5E23.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> wrote:

> After a few months of staring at this data, I've found averages like that
> misleading. The real problem areas correlate with the peak pages written
> at any one checkpoint. Lowering that value is really the end-game for
> optimizing the background writer, and the peaks are what will nail you
> with a nasty fsync pause at checkpoint time.

If you've already had some technical knowledge to lead best settings from
activity logs, could you write it down in the codes? I hope some kinds of
automatic control features in bgwriter if its best configurations vary by
usages or activities.

BTW, I'm planning two changes in bgwriter.

[Load distributed checkpoint]
http://archives.postgresql.org/pgsql-patches/2007-02/msg00522.php
[Automatic adjustment of bgwriter_lru_maxpages]
http://archives.postgresql.org/pgsql-patches/2007-03/msg00092.php

I have some results that if we have plenty of time for checkpoints,
bgwriter_all_maxpages is not a so important parameter because it is
adjusted to "shared_buffers / duration of checkpoint".
Also, my recommended bgwriter_lru_maxpages is "average number of
recycled buffers per cycle", that is hardly able to tune manually.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Jim Nasby <decibel(at)decibel(dot)org>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-07 22:24:02
Message-ID: 1368D272-5E19-429D-BF90-E7AF424532B3@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 6, 2007, at 10:11 PM, ITAGAKI Takahiro wrote:
> I have some results that if we have plenty of time for checkpoints,
> bgwriter_all_maxpages is not a so important parameter because it is
> adjusted to "shared_buffers / duration of checkpoint".
> Also, my recommended bgwriter_lru_maxpages is "average number of
> recycled buffers per cycle", that is hardly able to tune manually.

What do you mean by 'number of recycled buffers per cycle"?
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-08 04:33:46
Message-ID: Pine.GSO.4.64.0703072313470.25364@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 7 Mar 2007, ITAGAKI Takahiro wrote:

> Also, my recommended bgwriter_lru_maxpages is "average number of
> recycled buffers per cycle", that is hardly able to tune manually.

This is completely dependent on what percentage of your buffer cache is
pinned. If your load is something like the standard pgbench, the LRU
writer will rarely find anything useful to write, so this entire line of
thinking won't work. The proper behavior for heavily pinned data is to
turn off the LRU writer altogether so there's more time to run the all
scan.

The job I'm trying to take on here is not to presume I can solve these
problems myself yet. I've instead recognized that people need usefully
organized information in order to even move in that direction, and that
informatoin is not even close to being available right now.

What my latest work in progress patches do is summarize each scan of the
buffer pool with information about how much was written by each of the two
writers, along with noting what percentage of the pool was pinned data.
I'm trying to get that one ready to submit this week. Those three values
suggest some powerful techniques for tuning, but it's not quite good
enough to allow auto-tuning. It also needs a feel for how much time is
left before the next checkpoint.

What really needs to go along with all this is a sort of progress bar that
esimates how long we are from a checkpoint based on both a) the timeout,
and b) how many segments have been written. The timeout one is easy to
work with that way (from what I read of your code, you've worked that
angle). The part I had trouble doing was getting the WAL writers to
communicate a progress report on how many segments they filled back to the
bgwriter.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Jim Nasby <decibel(at)decibel(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-09 01:42:13
Message-ID: 20070309101224.633B.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim Nasby <decibel(at)decibel(dot)org> wrote:

> > Also, my recommended bgwriter_lru_maxpages is "average number of
> > recycled buffers per cycle", that is hardly able to tune manually.
>
> What do you mean by 'number of recycled buffers per cycle"?

There is the following description in the documentation:

| * bgwriter_lru_percent (floating point)
| To reduce the probability that server processes will need to issue their
| own writes, the background writer tries to write buffers that are likely
| to be recycled soon.
| * bgwriter_lru_maxpages (integer)
| In each round, no more than this many buffers will be written as a
| result of scanning soon-to-be-recycled buffers.

The number of recycled buffers per round is the same as the number of
StrategyGetBuffer() calls per round. I think the number is suitable for
bgwriter_lru_maxpages if we want bgwriter to write almost of dirty pages.

I proposed to change the semantics of bgwriter_lru_maxpages. It represented
"maximum writes per round", but the new meaning is "reserved buffers for
recycle". Non-dirty unused buffers will be counted among it.

I'm measuring the difference of performance between manual and automatic
tuning, especially adding some jobs before writes. I'll inform you about
them when I get the results.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-09 02:04:58
Message-ID: 20070309104251.6341.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Greg Smith <gsmith(at)gregsmith(dot)com> wrote:

> > Also, my recommended bgwriter_lru_maxpages is "average number of
> > recycled buffers per cycle", that is hardly able to tune manually.
>
> This is completely dependent on what percentage of your buffer cache is
> pinned.

Don't you mean usage_count? In my understanding, each backend pins two
or so buffers at once. So percentage of pinned buffers should be low.

> If your load is something like the standard pgbench, the LRU
> writer will rarely find anything useful to write, so this entire line of
> thinking won't work. The proper behavior for heavily pinned data is to
> turn off the LRU writer altogether so there's more time to run the all
> scan.

I think you are pointing out the problem of buffer management algorithm
itself, not only bgwriter. Even if you turn off the LRU writer, each
backend pays the same cost to find recyclable buffers every time they
allocate a buffer.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-09 05:51:11
Message-ID: Pine.GSO.4.64.0703082321510.14630@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:

> In my understanding, each backend pins two or so buffers at once. So
> percentage of pinned buffers should be low.

With the pgbench workload, a substantial percentage of the buffer cache
ends up pinned. From staring at the buffer cache using
contrib/pg_buffercache, I believe most of that consists of the index
blocks for the records being updated in the accounts table.

I just posted a new version of the patch I asked for feedback on at the
beginning of this thread, the latest one is at
http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt
I've been adjusting it to monitor the same data I think you need to refine
your patch. I believe the approach you're taking includes some
assumptions that seem perfectly reasonable, but that my testing doesn't
agree with. There's nothing like measuring something to settle what's
really going on, though, so that's what I've been focusing on. I'd love
to get some feedback on whether other people can replicate the kind of
things I'm seeing.

The new code generates statistics about exactly what the background writer
scan found during each round. If there's been substantial write activity,
it prints a log line when it recycles back to the beginning of the all
scan, to help characterize what the buffer pool looked like during that
scan from the perspective of the bgwriter. Here's some sample log output
from my underpowered laptop while running pgbench:

bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB (31.9%)
...
checkpoint required (wrote checkpoint_segments)
checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms

Here 69% of the buffer cache contained dirty data, and 49% of the cache
was both pinned and dirty. During that same time period, the LRU write
also wrote out a fair amount of data, operating on the 20% of the cache
that was dirty but not pinned. On my production server, where the
background writer is turned way up to reduce checkpoint times, these
numbers are even more extreme; almost everything that's dirty is also
pinned during pgbench, and the LRU is lucky to find anything it can write
as a result.

That patch is against the 8.2 codebase; now that I'm almost done I'm
planning to move it to HEAD instead soon (where it will conflict
considerably with your patch). If you have an 8.2 configuration you can
test with my patch applied, set log_min_messages = debug2, try it out, and
see what you get when running pgbench for a while. I think you'll
discover some interesting and unexpected things.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-09 06:43:38
Message-ID: 22147.1173422618@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> With the pgbench workload, a substantial percentage of the buffer cache
> ends up pinned.

[ raised eyebrow... ] Prove that. AFAIK it's impossible for the
pgbench queries to pin more than about three or four buffers per backend
concurrently.

regards, tom lane


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-09 07:08:59
Message-ID: 20070309155356.636E.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Greg Smith <gsmith(at)gregsmith(dot)com> wrote:

> > In my understanding, each backend pins two or so buffers at once. So
> > percentage of pinned buffers should be low.
>
> With the pgbench workload, a substantial percentage of the buffer cache
> ends up pinned.

> http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt
> bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB (31.9%)
> ...
> checkpoint required (wrote checkpoint_segments)
> checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms
>
> Here 69% of the buffer cache contained dirty data, and 49% of the cache
> was both pinned and dirty.

No. "Pinned" means bufHdr->refcount > 0 and you don't distinguish pinned or
recently-used (bufHdr->usage_count > 0) buffers in your patch.

! if (bufHdr->refcount != 0 || bufHdr->usage_count != 0)
{
! if (skip_pinned)
! {
! UnlockBufHdr(bufHdr);
! return BUF_PINNED;
! }
! buffer_write_type=BUF_WRITTEN_AND_PINNED;

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-09 13:57:48
Message-ID: Pine.GSO.4.64.0703090812410.15206@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:

> "Pinned" means bufHdr->refcount > 0 and you don't distinguish pinned or
> recently-used (bufHdr->usage_count > 0) buffers in your patch.

Thank you, I will revise the terminology used accordingly. I was using
"pinned" as a shortcut for "will be ignored by skip_pinned" which was
sloppy of me. As I said, I was trying to show how the buffer cache looks
from the perspective of the background writer, and therefore lumping them
together because that's how SyncOneBuffer views them. A buffer cache full
of either type will be largely ignored by the LRU writer, and that's what
I've been finding when running insert/update heavy workloads like pgbench.

If I might suggest a terminology change to avoid this confusion in the
future, I'd like to rename the SyncOneBuffer "skip_pinned" parameter to
something like "skip_active", which is closer to the real behavior. I
know Oracle refers to these as "hot" and "cold" LRU entries.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-09 21:28:48
Message-ID: 7A38AFE6-D2FF-4951-BBF1-D12CFF59513E@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
> almost everything that's dirty is also pinned during pgbench, and
> the LRU is lucky to find anything it can write as a result

I'm wondering if pg_bench is a good test of this stuff. ISTM it's
unrealistically write-heavy, which is going to tend to not only put a
lot of dirty buffers into the pool, but also keep them pinned enough
that you can't write them.

Perhaps you should either modify pg_bench to do a lot more selects
out of the various tables or look towards a different benchmark.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-09 21:34:49
Message-ID: E440FC7A-8148-41FB-BF56-BB651603D4C1@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
> almost everything that's dirty is also pinned during pgbench, and
> the LRU is lucky to find anything it can write as a result

I'm wondering if pg_bench is a good test of this stuff. ISTM it's
unrealistically write-heavy, which is going to tend to not only put a
lot of dirty buffers into the pool, but also keep them pinned enough
that you can't write them.

Perhaps you should either modify pg_bench to do a lot more selects
out of the various tables
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-09 21:38:10
Message-ID: FA60F165-253F-4791-AF6B-7BEA34AF3D2B@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 9, 2007, at 7:57 AM, Greg Smith wrote:
> On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:
>
>> "Pinned" means bufHdr->refcount > 0 and you don't distinguish
>> pinned or recently-used (bufHdr->usage_count > 0) buffers in your
>> patch.
>
> Thank you, I will revise the terminology used accordingly. I was
> using "pinned" as a shortcut for "will be ignored by skip_pinned"
> which was sloppy of me. As I said, I was trying to show how the
> buffer cache looks from the perspective of the background writer,
> and therefore lumping them together because that's how
> SyncOneBuffer views them. A buffer cache full of either type will
> be largely ignored by the LRU writer, and that's what I've been
> finding when running insert/update heavy workloads like pgbench.
>
> If I might suggest a terminology change to avoid this confusion in
> the future, I'd like to rename the SyncOneBuffer "skip_pinned"
> parameter to something like "skip_active", which is closer to the
> real behavior. I know Oracle refers to these as "hot" and "cold"
> LRU entries.

Well, AIUI, whether the buffer is actually pinned or not is almost
inconsequential (other than if a buffer *is* pinned then it's usage
count is about to become > 0, so there's no reason to consider
writing it).

What that parameter really does is control whether you're going to
follow the LRU semantics or not...
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jim Nasby <decibel(at)decibel(dot)org>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-09 23:40:44
Message-ID: 4608.1173483644@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim Nasby <decibel(at)decibel(dot)org> writes:
> On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
>> almost everything that's dirty is also pinned during pgbench, and
>> the LRU is lucky to find anything it can write as a result

> I'm wondering if pg_bench is a good test of this stuff.

On reflection I think that Greg's result is probably unsurprising, and
furthermore does not indicate that anything is wrong.

What it shows (now that we got past the terminology) is that only about
half of the buffer pool is subject to replacement during any given clock
sweep. For low-usage pages that's about what you'd expect: a page is
sucked in on demand (using a buffer returned by the clock sweep), and
when we're done with it it'll have usage_count = 1. If it's not touched
again then when the clock sweep returns to it it'll be decremented to
usage_count 0, and on the next visit it'll be recycled for use as
something else. Thus for low-usage pages you'd fully expect that about
half of the buffer population has usage_count 1 and the rest has usage
count 0; which is strikingly close to Greg's measurement that 48.8%
of the population has usage_count 0.

What this seems to tell us is that pgbench's footprint of heavily used
pages (those able to achieve usage_counts above 1) is very small. Which
is probably right unless you've used a very large scale factor. I'd be
interested to know what scale factor and shared_buffers setting led to
the above measurement.

It strikes me that the patch would be more useful if it produced a
histogram of the observed usage_counts, rather than merely the count
for usage_count = 0.

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-10 00:33:38
Message-ID: Pine.GSO.4.64.0703091923300.9297@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 9 Mar 2007, Jim Nasby wrote:

> I'm wondering if pg_bench is a good test of this stuff. ISTM it's
> unrealistically write-heavy, which is going to tend to not only put a
> lot of dirty buffers into the pool, but also keep them pinned enough
> that you can't write them.

Whether it's "unrealistically" write-heavy kind of depends on what your
real app is. The standard pgbench is a bit weird because it does so many
updates to tiny tables, which adds a level of locking contention that
doesn't really reflect many real-world situations. But the no-branch mode
(update/select to accounts, insert into history) isn't too dissimilar from
some insert-heavy logging applications I've seen.

The main reason I brought this all up was because Itagaki seemed to be
using pgbench for some of his performance tests. I just wanted to point
out that the LRU background writer specifically tends to be very
underutilized when using pgbench.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-10 00:56:54
Message-ID: Pine.GSO.4.64.0703091936030.9297@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 9 Mar 2007, Tom Lane wrote:

> I'd be interested to know what scale factor and shared_buffers setting
> led to the above measurement.

That was just a trivial example with 1 client, scale=10 (~160MB database),
and shared_buffers=24MB. Where things really get interesting with pgbench
is on a system with enough horsepower+clients to dirty the whole buffer
cache well before a checkpoint. I regularly see >75% of the cache dirty
and blocked from LRU writes with pgbench's slightly pathological workload
in that situation.

You're correct that these results aren't particularly surprising or
indicative of a problem to be corrected. But they do shed some light on
what pgbench is and isn't appropriate for testing.

> It strikes me that the patch would be more useful if it produced a
> histogram of the observed usage_counts, rather than merely the count
> for usage_count = 0.

I'll start working in that direction. With the feedback everyone has
given me on how few of the buffers are truly "pinned" via the correct
usage of the term, I'm going to revisit the usage details and revise that
section. I'm happy with how I'm reporting the checkpoint details now,
still some work left to do on the bgwriter activity.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-13 02:42:04
Message-ID: Pine.GSO.4.64.0703121958540.27778@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 9 Mar 2007, Tom Lane wrote:

> It strikes me that the patch would be more useful if it produced a
> histogram of the observed usage_counts

Don't have something worth releasing yet, but I did code a first rev of
this today. The results are quite instructive and it's well worth looking
at.

The main server I work on has shared_buffers=60000 for pgbench testing and
the background writers turned way up (in hopes of reducing checkpoint
times; that's a whole nother topic). I run pgbench with s=100 (~1.6GB
database). Here's what I get as statistics on the buffer pool after a
scan when the server is "happy", from a run with 20 clients:

writes=38.3MB (8.2%) pinned+used=38.3MB (8.2%)
dirty buffer usage count histogram:
0=0.1% 1=0.3% 2=26% 3=17% 4=21% 5+=36%

Basically, everything that's left dirty is also heavily used; as I noted
before, when I inspect with pg_buffercache these are mostly index blocks.
I note that I should probably generate a second histogram that compares
the clean data. The all scan is writing pages out as soon as they dirty,
the LRU background writer is lucky if it can find anything to do. (Note
that I don't rely on the LRU writer more because that causes a significant
lull in writes after a checkpoint. By the time it gets going again it's
harder to catch up, and delaying PostgreSQL writes also aggrevates issues
with the way Linux caches writes.)

What's I found really interesting was comparing a "sad" section where
performance breaks down. This is from a minute later:

writes=441.6MB (94.2%) pinned+used=356.2MB (76.0%)
dirty buffer usage count histogram:
0=18.7% 1=26.4% 2=31% 3=11% 4=9% 5+=4%

Note how the whole buffer distribution has shifted toward lower usage
counts. The breakdown seems to involve evicting the index blocks to make
room for the recently dirty data when it can't be written out fast enough.
As the index blocks go poof, things slow further, and into the vicious
circle you go. Eventually you can end up blocked on a combination of
buffer evictions and disk seeks for uncached data that are fighting with
the writes.

The bgwriter change this suggested to me is defining a triage behavior
where the all scan switches to acting like the LRU one:

-Each sample period, note what % of the usage_count=0 records are dirty
-If that number is above a tunable threshold, switch to only writing
usage_count=0 records until it isn't anymore.

On my system a first guess for that tunable would be 2-5%, based on what
values I see on either side of the sad periods. No doubt some systems
would set that much higher, haven't tested my system at home yet to have a
guideline for a more typical PC.

As for why this behavior matters so much to me, I actually have a
prototype auto-tuning background writer design that was hung up on this
particular problem. It notes how often you write out max_pages, uses that
to model the average percentage of the buffer cache you're traversing each
pass, then runs a couple of weighted-average feedback loops to aim for a
target seconds/sweep.

The problem was that it went berzerk when the whole buffer cache was dirty
(I hope someone appreciates that I've been calling this "Felix Unger on
crack mode" in my notes). I needed some way to prioritize which buffers
to concentrate on when that happens, and so far the above has been a good
first-cut way to help with that.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-13 03:01:56
Message-ID: 23366.1173754916@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> Here's what I get as statistics on the buffer pool after a
> scan when the server is "happy", from a run with 20 clients:

> writes=38.3MB (8.2%) pinned+used=38.3MB (8.2%)
> dirty buffer usage count histogram:
> 0=0.1% 1=0.3% 2=26% 3=17% 4=21% 5+=36%

Interesting, but not real meaningful by itself --- what's the situation
for non-dirty buffers? And how many are there of each?

It might also be interesting to know exactly how many buffers were
pinned at the time the scan passed over them. In theory it should be a
small fraction, but maybe it isn't ...

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-14 04:13:04
Message-ID: Pine.GSO.4.64.0703132350420.26349@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 12 Mar 2007, Tom Lane wrote:

> It might also be interesting to know exactly how many buffers were
> pinned at the time the scan passed over them. In theory it should be a
> small fraction, but maybe it isn't ...

It is; the theory holds for all the tests I tried today. The actual
pinned buffers were so few (typically a fraction of the clients) that I
reverted to just lumping them in with the recently used ones. To better
reflect the vast majority of what it's interacting with, in my patch I
renamed the SyncOneBuffer "skip_pinned" to "skip_recently_used". It seems
natural that something currently pinned would also be considered recently
used, the current naming I didn't find so obvious.

I'm also now collecting clean vs. dirty usage histogram counts as well
since you suggested it. Nothing exciting to report there so far, may note
something interesting after I collect more data.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-27 15:57:23
Message-ID: 200703271557.l2RFvNl28519@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Is this patch ready?

---------------------------------------------------------------------------

Greg Smith wrote:
> I have a WIP patch that adds the main detail I have found I need to
> properly tune checkpoint and background writer activity. I think it's
> almost ready to submit (you can see the current patch against 8.2 at
> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt )
> after making it a bit more human-readable. But I've realized that along
> with that, I need some guidance in regards to what log level is
> appropriate for this information.
>
> An example works better than explaining what the patch does:
>
> 2007-02-19 21:53:24.602 EST - DEBUG: checkpoint required (wrote
> checkpoint_segments)
> 2007-02-19 21:53:24.685 EST - DEBUG: checkpoint starting
> 2007-02-19 21:53:24.705 EST - DEBUG: checkpoint flushing buffer pool
> 2007-02-19 21:53:24.985 EST - DEBUG: checkpoint database fsync starting
> 2007-02-19 21:53:42.725 EST - DEBUG: checkpoint database fsync complete
> 2007-02-19 21:53:42.726 EST - DEBUG: checkpoint buffer flush dirty=8034
> write=279956 us sync=17739974 us
>
> Remember that "Load distributed checkpoint" discussion back in December?
> You can see exactly how bad the problem is on your system with this log
> style (this is from a pgbench run where it's postively awful--it really
> does take over 17 seconds for the fsync to execute, and there are clients
> that are hung the whole time waiting for it).
>
> I also instrumented the background writer. You get messages like this:
>
> 2007-02-19 21:58:54.328 EST - DEBUG: BGWriter Scan All - Written = 5/5
> Unscanned = 23/54
>
> This shows that we wrote (5) the maximum pages we were allowed to write
> (5) while failing to scan almost half (23) of the buffers we meant to look
> at (54). By taking a look at this output while the system is under load,
> I found I was able to do bgwriter optimization that used to take me days
> of frustrating testing in hours. I've been waiting for a good guide to
> bgwriter tuning since 8.1 came out. Once you have this, combined with
> knowing how many buffers were dirty at checkpoint time because the
> bgwriter didn't get to them in time (the number you want to minimize), the
> tuning guide practically writes itself.
>
> So my question is...what log level should all this go at? Right now, I
> have the background writer stuff adjusting its level dynamically based on
> what happened; it logs at DEBUG2 if it hits the write limit (which should
> be a rare event once you're tuned properly), DEBUG3 for writes that
> scanned everything they were supposed to, and DEBUG4 if it scanned but
> didn't find anything to write. The source of checkpoint information logs
> at DEBUG1, the fsync/write info at DEBUG2.
>
> I'd like to move some of these up. On my system, I even have many of the
> checkpoint messages logged at INFO (the source of the checkpoint and the
> total write time line). It's a bit chatty, but when you get some weird
> system pause issue it makes it easy to figure out if checkpoints were to
> blame. Given how useful I feel some of these messages are to system
> tuning, and to explaining what currently appears as inexplicable pauses, I
> don't want them to be buried at DEBUG levels where people are unlikely to
> ever see them (I think some people may be concerned about turning on
> things labeled DEBUG at all). I am aware that I am too deep into this to
> have an unbiased opinion at this point though, which is why I ask for
> feedback on how to proceed here.
>
> --
> * Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-27 17:30:12
Message-ID: 460954A4.90803@hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Would not at least some of these numbers be better presented through the
stats collector, so they can be easily monitored?

That goes along the line of my way way way away from finished attempt
earlier, perhaps a combination of these two patches?

//Magnus

Bruce Momjian wrote:
> Is this patch ready?
>
> ---------------------------------------------------------------------------
>
> Greg Smith wrote:
>> I have a WIP patch that adds the main detail I have found I need to
>> properly tune checkpoint and background writer activity. I think it's
>> almost ready to submit (you can see the current patch against 8.2 at
>> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt )
>> after making it a bit more human-readable. But I've realized that along
>> with that, I need some guidance in regards to what log level is
>> appropriate for this information.
>>
>> An example works better than explaining what the patch does:
>>
>> 2007-02-19 21:53:24.602 EST - DEBUG: checkpoint required (wrote
>> checkpoint_segments)
>> 2007-02-19 21:53:24.685 EST - DEBUG: checkpoint starting
>> 2007-02-19 21:53:24.705 EST - DEBUG: checkpoint flushing buffer pool
>> 2007-02-19 21:53:24.985 EST - DEBUG: checkpoint database fsync starting
>> 2007-02-19 21:53:42.725 EST - DEBUG: checkpoint database fsync complete
>> 2007-02-19 21:53:42.726 EST - DEBUG: checkpoint buffer flush dirty=8034
>> write=279956 us sync=17739974 us
>>
>> Remember that "Load distributed checkpoint" discussion back in December?
>> You can see exactly how bad the problem is on your system with this log
>> style (this is from a pgbench run where it's postively awful--it really
>> does take over 17 seconds for the fsync to execute, and there are clients
>> that are hung the whole time waiting for it).
>>
>> I also instrumented the background writer. You get messages like this:
>>
>> 2007-02-19 21:58:54.328 EST - DEBUG: BGWriter Scan All - Written = 5/5
>> Unscanned = 23/54
>>
>> This shows that we wrote (5) the maximum pages we were allowed to write
>> (5) while failing to scan almost half (23) of the buffers we meant to look
>> at (54). By taking a look at this output while the system is under load,
>> I found I was able to do bgwriter optimization that used to take me days
>> of frustrating testing in hours. I've been waiting for a good guide to
>> bgwriter tuning since 8.1 came out. Once you have this, combined with
>> knowing how many buffers were dirty at checkpoint time because the
>> bgwriter didn't get to them in time (the number you want to minimize), the
>> tuning guide practically writes itself.
>>
>> So my question is...what log level should all this go at? Right now, I
>> have the background writer stuff adjusting its level dynamically based on
>> what happened; it logs at DEBUG2 if it hits the write limit (which should
>> be a rare event once you're tuned properly), DEBUG3 for writes that
>> scanned everything they were supposed to, and DEBUG4 if it scanned but
>> didn't find anything to write. The source of checkpoint information logs
>> at DEBUG1, the fsync/write info at DEBUG2.
>>
>> I'd like to move some of these up. On my system, I even have many of the
>> checkpoint messages logged at INFO (the source of the checkpoint and the
>> total write time line). It's a bit chatty, but when you get some weird
>> system pause issue it makes it easy to figure out if checkpoints were to
>> blame. Given how useful I feel some of these messages are to system
>> tuning, and to explaining what currently appears as inexplicable pauses, I
>> don't want them to be buried at DEBUG levels where people are unlikely to
>> ever see them (I think some people may be concerned about turning on
>> things labeled DEBUG at all). I am aware that I am too deep into this to
>> have an unbiased opinion at this point though, which is why I ask for
>> feedback on how to proceed here.
>>
>> --
>> * Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 7: You can help support the PostgreSQL project by donating at
>>
>> http://www.postgresql.org/about/donate
>


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-27 17:40:37
Message-ID: 200703271740.l2RHebN07393@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander wrote:
> Would not at least some of these numbers be better presented through the
> stats collector, so they can be easily monitored?
>
> That goes along the line of my way way way away from finished attempt
> earlier, perhaps a combination of these two patches?

Yes.

---------------------------------------------------------------------------

>
> //Magnus
>
>
> Bruce Momjian wrote:
> > Is this patch ready?
> >
> > ---------------------------------------------------------------------------
> >
> > Greg Smith wrote:
> >> I have a WIP patch that adds the main detail I have found I need to
> >> properly tune checkpoint and background writer activity. I think it's
> >> almost ready to submit (you can see the current patch against 8.2 at
> >> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt )
> >> after making it a bit more human-readable. But I've realized that along
> >> with that, I need some guidance in regards to what log level is
> >> appropriate for this information.
> >>
> >> An example works better than explaining what the patch does:
> >>
> >> 2007-02-19 21:53:24.602 EST - DEBUG: checkpoint required (wrote
> >> checkpoint_segments)
> >> 2007-02-19 21:53:24.685 EST - DEBUG: checkpoint starting
> >> 2007-02-19 21:53:24.705 EST - DEBUG: checkpoint flushing buffer pool
> >> 2007-02-19 21:53:24.985 EST - DEBUG: checkpoint database fsync starting
> >> 2007-02-19 21:53:42.725 EST - DEBUG: checkpoint database fsync complete
> >> 2007-02-19 21:53:42.726 EST - DEBUG: checkpoint buffer flush dirty=8034
> >> write=279956 us sync=17739974 us
> >>
> >> Remember that "Load distributed checkpoint" discussion back in December?
> >> You can see exactly how bad the problem is on your system with this log
> >> style (this is from a pgbench run where it's postively awful--it really
> >> does take over 17 seconds for the fsync to execute, and there are clients
> >> that are hung the whole time waiting for it).
> >>
> >> I also instrumented the background writer. You get messages like this:
> >>
> >> 2007-02-19 21:58:54.328 EST - DEBUG: BGWriter Scan All - Written = 5/5
> >> Unscanned = 23/54
> >>
> >> This shows that we wrote (5) the maximum pages we were allowed to write
> >> (5) while failing to scan almost half (23) of the buffers we meant to look
> >> at (54). By taking a look at this output while the system is under load,
> >> I found I was able to do bgwriter optimization that used to take me days
> >> of frustrating testing in hours. I've been waiting for a good guide to
> >> bgwriter tuning since 8.1 came out. Once you have this, combined with
> >> knowing how many buffers were dirty at checkpoint time because the
> >> bgwriter didn't get to them in time (the number you want to minimize), the
> >> tuning guide practically writes itself.
> >>
> >> So my question is...what log level should all this go at? Right now, I
> >> have the background writer stuff adjusting its level dynamically based on
> >> what happened; it logs at DEBUG2 if it hits the write limit (which should
> >> be a rare event once you're tuned properly), DEBUG3 for writes that
> >> scanned everything they were supposed to, and DEBUG4 if it scanned but
> >> didn't find anything to write. The source of checkpoint information logs
> >> at DEBUG1, the fsync/write info at DEBUG2.
> >>
> >> I'd like to move some of these up. On my system, I even have many of the
> >> checkpoint messages logged at INFO (the source of the checkpoint and the
> >> total write time line). It's a bit chatty, but when you get some weird
> >> system pause issue it makes it easy to figure out if checkpoints were to
> >> blame. Given how useful I feel some of these messages are to system
> >> tuning, and to explaining what currently appears as inexplicable pauses, I
> >> don't want them to be buried at DEBUG levels where people are unlikely to
> >> ever see them (I think some people may be concerned about turning on
> >> things labeled DEBUG at all). I am aware that I am too deep into this to
> >> have an unbiased opinion at this point though, which is why I ask for
> >> feedback on how to proceed here.
> >>
> >> --
> >> * Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD
> >>
> >> ---------------------------(end of broadcast)---------------------------
> >> TIP 7: You can help support the PostgreSQL project by donating at
> >>
> >> http://www.postgresql.org/about/donate
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Log levels for checkpoint/bgwriter monitoring
Date: 2007-03-28 13:33:50
Message-ID: Pine.GSO.4.64.0703280849320.19140@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 27 Mar 2007, Magnus Hagander wrote:

> Would not at least some of these numbers be better presented through the
> stats collector, so they can be easily monitored?
> That goes along the line of my way way way away from finished attempt
> earlier, perhaps a combination of these two patches?

When I saw your patch recently, I thought to myself "hmmm, the data
collected here sure looks familiar"--you even made some of the exact same
code changes I did. I've been bogged down recently chasing a performance
issue that, come to find, was mainly caused by the "high CPU usage for
stats collector" bug. That caused the background writer to slow to a
crawl under heavy load, which is why I was having all these checkpoint and
writer issues that got me monitoring that code in the first place.

With that seemingly resolved, slightly new plan now. Next I want to take
the data I've been collecting in my patch, bundle the most important parts
of that into messages sent to the stats writer the way it was suggested
you rewrite your patch, then submit the result. I got log files down and
have a real good idea what data should be collected, but as this would be
my first time adding stats I'd certainly love some help with that.

Once that monitoring infrastructure is in place, I then planned to merge
Itagati's "Load distributed checkpoint" patch (it touches a lot of the
same code) and test that out under heavy load. I think it gives a much
better context to evaluate that patch in if rather than measuring just its
gross results, you can say something like "with the patch in place the
average fsync time on my system dropped from 3 seconds to 1.2 seconds when
writing out more than 100MB at checkpoint time". That's the direct cause
of the biggest problem in that area of code, so why not stare right at it
rather than measuring it indirectly.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD