Re: data on devel code perf dip

Lists: pgsql-hackerspgsql-patches
From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject: data on devel code perf dip
Date: 2005-08-11 22:13:15
Message-ID: 1123798396.7224.85.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


I have an example of runs that illustrate a performance
problem that occurred between installing the 7/18 and 8/1
development release codes. I'm running on a PPC64 8-way system,
with 16GB of memory (on a 15GB virtual machine), with a DBT2
workload configured as a 16 warehouse DBT2 with 16 db connections,
no think time. Using Linux 2.6.11/gentoo.

Overview:

Before picture:
run 42 code updated from cvs July 18 12:18pm pacific
metric: 8374.71 NOTPM

After picture (same config params, db size, etc):
run 59 code downloaded from cvs (all of it) Aug1 9:58AM pacific
metric: 4787.61 NOTPM

Oprofile data in 59 (top part copied at the end of this message) but
not in 42 (although the binary is still around and I could in theory
rerun and get it).

I'd like to first get back to the throughput of run 42, and then
remove what's keeping me from getting higher than 45% user with
the original July 18 drop.

************************************************************
Here are more details:

(RUN 42) the previous performance 8374.71 new-order transactions per
minute (NOTPM) and %user around 45% on 8 processors. Even this level of
user utilization is not good, as there is still some bandwidth left in
the log device (not much, but some).

http://developer.osdl.org/maryedie/DBT2_PGSQL/42/

This development code we got on July 18 12:18pm pacific. Tuned it and got
to run 42 on July 26.

(RUN 59) This is an example of a run with code downloaded (whole) on
Aug1 9:58am Pacific. 4787.61 new-order transactions per minute (NOTPM)
and ~22% user over 8 cpus. This has the same config parameters as 42,
with autovacuum off. However, I turned on oprofile for this run. (runs
without oprofile are equivalent).

http://developer.osdl.org/maryedie/DBT2_PGSQL/59/

Top things for postgreSQL in the oprofile are copied below.

CPU: ppc64 POWER5, speed 1656.38 MHz (estimated)
Counted CYCLES events (Processor cycles) with a unit mask of 0x00 (No unit mask) count 100000
samples % app name symbol name
164623113 70.5372 kernel-2.6.11.3 .shared_idle
6641569 2.8458 libc-2.3.4.so (no symbols)
4011874 1.7190 postgres .AllocSetAlloc
2798355 1.1990 postgres .CreateLWLocks
1731687 0.7420 postgres .SearchCatCache
1701936 0.7292 postgres .MemoryContextAllocZeroAligned
1509903 0.6470 postgres .FunctionCall2
1320467 0.5658 postgres .hash_search
1275278 0.5464 oprofiled (no symbols)
1271477 0.5448 postgres .yyparse
1099396 0.4711 postgres .OpernameGetCandidates
794802 0.3406 postgres .expression_tree_walker
731969 0.3136 postgres .MemoryContextAlloc
730266 0.3129 postgres ._bt_compare
709793 0.3041 postgres .DirectFunctionCall1
689405 0.2954 postgres .base_yylex
683776 0.2930 postgres .hash_any
635680 0.2724 postgres .DLMoveToFront
632098 0.2708 kernel-2.6.11.3 .___dst_free
562111 0.2409 kernel-2.6.11.3 .sys_setrlimit
508608 0.2179 postgres .AtEOXact_CatCache
499318 0.2139 postgres .heap_release_fetch
498136 0.2134 kernel-2.6.11.3 .tty_write
494020 0.2117 postgres .XLogInsert

--
Mary Edie Meredith
Initiative Manager
Open Source Development Labs
maryedie(at)hotmail(dot)com
503-906-1942


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: maryedie(at)osdl(dot)org
Cc: pgsql-hackers(at)postgresql(dot)org, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject: Re: data on devel code perf dip
Date: 2005-08-12 01:02:03
Message-ID: 6557.1123808523@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Mary Edie Meredith <maryedie(at)osdl(dot)org> writes:
> I have an example of runs that illustrate a performance
> problem that occurred between installing the 7/18 and 8/1
> development release codes.

I dug through the CVS logs to see what had changed, and I'm afraid there
is just one plausible-looking candidate:

2005-07-28 23:22 momjian

* src/backend/access/transam/xlog.c:
Use O_DIRECT if available when using O_SYNC for wal_sync_method.

Also, write multiple WAL buffers out in one write() operation.

ITAGAKI Takahiro

Most of the CVS activity in that time period had to with stuff like
roles and the interval datatype. It's conceivable that these things
had some marginal performance cost, but if so I'd have expected it to
show up as extra CPU effort (more time checking permissions, say).
This figure:

> samples % app name symbol name
> 164623113 70.5372 kernel-2.6.11.3 .shared_idle

says pretty clearly that your problem is all I/O wait, and there are
no other commits that might have increased our tendency to wait for I/O.

I am sure I will get some pushback if I propose reverting the O_DIRECT
patch, so could you try to get some more-specific evidence? Like pull
the CVS tree from just before and just after this patch and compare
performance?

BTW I did check that both runs are using wal_sync_method = fdatasync
and wal_buffers = 1000, so it's not a problem of those parameters having
been changed by the patch.

regards, tom lane


From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: maryedie(at)osdl(dot)org, pgsql-hackers(at)postgresql(dot)org, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject: Re: data on devel code perf dip
Date: 2005-08-12 01:31:12
Message-ID: 20050812013112.GA9504@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, Aug 11, 2005 at 09:02:03PM -0400, Tom Lane wrote:

> I am sure I will get some pushback if I propose reverting the O_DIRECT
> patch, so could you try to get some more-specific evidence? Like pull
> the CVS tree from just before and just after this patch and compare
> performance?

Quoth the open(2) manpage:

O_DIRECT
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user
space buffers. The I/O is synchronous, i.e., at the comple-
tion of the read(2) or write(2) system call, data is guaran-
teed to have been transferred.

In light of this, may I ask whether it makes sense to compare the
performance of two runs with similar shared_buffer settings? With
O_DIRECT, I understand from this manpage that the OS is going to do
little or no page caching, so shared_buffers should be increased to
account for this fact.

Am I missing something?

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"Hay que recordar que la existencia en el cosmos, y particularmente la
elaboración de civilizaciones dentre de él no son, por desgracia,
nada idílicas" (Ijon Tichy)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: maryedie(at)osdl(dot)org, pgsql-hackers(at)postgresql(dot)org, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject: Re: data on devel code perf dip
Date: 2005-08-12 01:38:22
Message-ID: 9736.1123810702@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> In light of this, may I ask whether it makes sense to compare the
> performance of two runs with similar shared_buffer settings? With
> O_DIRECT, I understand from this manpage that the OS is going to do
> little or no page caching, so shared_buffers should be increased to
> account for this fact.

> Am I missing something?

O_DIRECT is only being used for WAL page writes (or I sure hope so
anyway), so shared_buffers should be irrelevant.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: maryedie(at)osdl(dot)org, pgsql-hackers(at)postgresql(dot)org, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: data on devel code perf dip
Date: 2005-08-12 01:40:39
Message-ID: 200508120140.j7C1edI25918@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Most of the CVS activity in that time period had to with stuff like
> roles and the interval datatype. It's conceivable that these things
> had some marginal performance cost, but if so I'd have expected it to
> show up as extra CPU effort (more time checking permissions, say).
> This figure:
>
> > samples % app name symbol name
> > 164623113 70.5372 kernel-2.6.11.3 .shared_idle
>
> says pretty clearly that your problem is all I/O wait, and there are
> no other commits that might have increased our tendency to wait for I/O.
>
> I am sure I will get some pushback if I propose reverting the O_DIRECT
> patch, so could you try to get some more-specific evidence? Like pull
> the CVS tree from just before and just after this patch and compare
> performance?
>
> BTW I did check that both runs are using wal_sync_method = fdatasync
> and wal_buffers = 1000, so it's not a problem of those parameters having
> been changed by the patch.

We can supply a patch with just the O_DIRECT for you to test. The
O_DIRECT patch also had grouped WAL writes, so that might be an issue
too. Also, O_DIRECT is only used for open_* wal sync methods, so I
don't see how it would affect this, but the grouped WAL writes might.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: data on devel code perf dip
Date: 2005-08-12 01:44:18
Message-ID: slrndfnvni.bj3.andrew+nonews@trinity.supernews.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On 2005-08-12, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> In light of this, may I ask whether it makes sense to compare the
> performance of two runs with similar shared_buffer settings? With
> O_DIRECT, I understand from this manpage that the OS is going to do
> little or no page caching, so shared_buffers should be increased to
> account for this fact.
>
> Am I missing something?

O_DIRECT should only be being used for the WAL, not for buffer I/O.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, maryedie(at)osdl(dot)org, pgsql-hackers(at)postgresql(dot)org, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: data on devel code perf dip
Date: 2005-08-12 01:44:53
Message-ID: 200508120144.j7C1irp26488@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> > In light of this, may I ask whether it makes sense to compare the
> > performance of two runs with similar shared_buffer settings? With
> > O_DIRECT, I understand from this manpage that the OS is going to do
> > little or no page caching, so shared_buffers should be increased to
> > account for this fact.
>
> > Am I missing something?
>
> O_DIRECT is only being used for WAL page writes (or I sure hope so
> anyway), so shared_buffers should be irrelevant.

Uh, O_DIRECT really just enables when open_sync is used, and I assume
that is not used for writing dirty buffers during a checkpoint.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: andrew(at)supernews(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: data on devel code perf dip
Date: 2005-08-12 02:01:21
Message-ID: 200508120201.j7C21LX00128@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew - Supernews wrote:
> On 2005-08-12, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> > In light of this, may I ask whether it makes sense to compare the
> > performance of two runs with similar shared_buffer settings? With
> > O_DIRECT, I understand from this manpage that the OS is going to do
> > little or no page caching, so shared_buffers should be increased to
> > account for this fact.
> >
> > Am I missing something?
>
> O_DIRECT should only be being used for the WAL, not for buffer I/O.

And only when open_sync or open_datasync are used. But the group write
patch is used in all cases, but again only WAL.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, maryedie(at)osdl(dot)org, pgsql-hackers(at)postgresql(dot)org, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: data on devel code perf dip
Date: 2005-08-12 02:09:19
Message-ID: 11853.1123812559@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> O_DIRECT is only being used for WAL page writes (or I sure hope so
>> anyway), so shared_buffers should be irrelevant.

> Uh, O_DIRECT really just enables when open_sync is used, and I assume
> that is not used for writing dirty buffers during a checkpoint.

I double-checked that O_DIRECT is really just used for WAL, and only
when the sync mode is open_sync or open_datasync. So it seems
impossible that it affected a run with mode fdatasync. What seems the
best theory at the moment is that the grouped-WAL-write part of the
patch doesn't work so well as we thought.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, maryedie(at)osdl(dot)org, pgsql-hackers(at)postgresql(dot)org, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: data on devel code perf dip
Date: 2005-08-12 02:11:42
Message-ID: 200508120211.j7C2Bgb01740@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> >> O_DIRECT is only being used for WAL page writes (or I sure hope so
> >> anyway), so shared_buffers should be irrelevant.
>
> > Uh, O_DIRECT really just enables when open_sync is used, and I assume
> > that is not used for writing dirty buffers during a checkpoint.
>
> I double-checked that O_DIRECT is really just used for WAL, and only
> when the sync mode is open_sync or open_datasync. So it seems
> impossible that it affected a run with mode fdatasync. What seems the
> best theory at the moment is that the grouped-WAL-write part of the
> patch doesn't work so well as we thought.

Yes, that's my only guess. Let us know if you want the patch to test,
rather than pulling CVS before and after the patch was applied.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Mark Wong <markw(at)osdl(dot)org>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, maryedie(at)osdl(dot)org, pgsql-hackers(at)postgresql(dot)org, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: data on devel code perf dip
Date: 2005-08-12 15:43:10
Message-ID: 200508121542.j7CFgejA017416@smtp.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 11 Aug 2005 22:11:42 -0400 (EDT)
Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:

> Tom Lane wrote:
> > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > >> O_DIRECT is only being used for WAL page writes (or I sure hope so
> > >> anyway), so shared_buffers should be irrelevant.
> >
> > > Uh, O_DIRECT really just enables when open_sync is used, and I assume
> > > that is not used for writing dirty buffers during a checkpoint.
> >
> > I double-checked that O_DIRECT is really just used for WAL, and only
> > when the sync mode is open_sync or open_datasync. So it seems
> > impossible that it affected a run with mode fdatasync. What seems the
> > best theory at the moment is that the grouped-WAL-write part of the
> > patch doesn't work so well as we thought.
>
> Yes, that's my only guess. Let us know if you want the patch to test,
> rather than pulling CVS before and after the patch was applied.

Yeah, a patch would be a little easier. :)

Thanks,
Mark


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Mark Wong <markw(at)osdl(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, maryedie(at)osdl(dot)org, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: [HACKERS] data on devel code perf dip
Date: 2005-08-12 16:12:17
Message-ID: 200508121612.j7CGCHT03271@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Mark Wong wrote:
> On Thu, 11 Aug 2005 22:11:42 -0400 (EDT)
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
>
> > Tom Lane wrote:
> > > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > > >> O_DIRECT is only being used for WAL page writes (or I sure hope so
> > > >> anyway), so shared_buffers should be irrelevant.
> > >
> > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume
> > > > that is not used for writing dirty buffers during a checkpoint.
> > >
> > > I double-checked that O_DIRECT is really just used for WAL, and only
> > > when the sync mode is open_sync or open_datasync. So it seems
> > > impossible that it affected a run with mode fdatasync. What seems the
> > > best theory at the moment is that the grouped-WAL-write part of the
> > > patch doesn't work so well as we thought.
> >
> > Yes, that's my only guess. Let us know if you want the patch to test,
> > rather than pulling CVS before and after the patch was applied.
>
> Yeah, a patch would be a little easier. :)

OK, patch attached. The code has been cleaned up a little since then but
this is the basic change that should be tested. It is based on CVS of
2005/07/29 03:22:33 GMT.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

Attachment Content-Type Size
unknown_filename text/plain 13.3 KB

From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Mark Wong <markw(at)osdl(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: [HACKERS] data on devel code perf dip
Date: 2005-08-12 18:26:46
Message-ID: 1123871206.9956.113.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Just to be certain I know what I have and how to use it,
please confirm the following is correct.

According to Markw, the tarball we have at OSDL that is
dated 7/29 already has the O_DIRECT patch + wal grouping
applied. Hence, I will use the one from the day before:
postgresql-20050728.tar.bz2 and apply your patch.

I expect to have, after applying the patch, the O_DIRECT patch
for the log (which I should not be using given the config
parameters I have), but I will _not have the wal grouping?

Is that correct?

On Fri, 2005-08-12 at 12:12 -0400, Bruce Momjian wrote:
> Mark Wong wrote:
> > On Thu, 11 Aug 2005 22:11:42 -0400 (EDT)
> > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
> >
> > > Tom Lane wrote:
> > > > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > > > >> O_DIRECT is only being used for WAL page writes (or I sure hope so
> > > > >> anyway), so shared_buffers should be irrelevant.
> > > >
> > > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume
> > > > > that is not used for writing dirty buffers during a checkpoint.
> > > >
> > > > I double-checked that O_DIRECT is really just used for WAL, and only
> > > > when the sync mode is open_sync or open_datasync. So it seems
> > > > impossible that it affected a run with mode fdatasync. What seems the
> > > > best theory at the moment is that the grouped-WAL-write part of the
> > > > patch doesn't work so well as we thought.
> > >
> > > Yes, that's my only guess. Let us know if you want the patch to test,
> > > rather than pulling CVS before and after the patch was applied.
> >
> > Yeah, a patch would be a little easier. :)
>
> OK, patch attached. The code has been cleaned up a little since then but
> this is the basic change that should be tested. It is based on CVS of
> 2005/07/29 03:22:33 GMT.
>
> Plain text document attachment (/bjm/diff)
> Index: src/backend/access/transam/xlog.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
> retrieving revision 1.210
> retrieving revision 1.211
> diff -c -r1.210 -r1.211
> *** src/backend/access/transam/xlog.c 23 Jul 2005 15:31:16 -0000 1.210
> --- src/backend/access/transam/xlog.c 29 Jul 2005 03:22:33 -0000 1.211
> ***************
> *** 48,77 ****
>
>
> /*
> * This chunk of hackery attempts to determine which file sync methods
> * are available on the current platform, and to choose an appropriate
> * default method. We assume that fsync() is always available, and that
> * configure determined whether fdatasync() is.
> */
> #if defined(O_SYNC)
> ! #define OPEN_SYNC_FLAG O_SYNC
> #else
> #if defined(O_FSYNC)
> ! #define OPEN_SYNC_FLAG O_FSYNC
> #endif
> #endif
>
> #if defined(O_DSYNC)
> #if defined(OPEN_SYNC_FLAG)
> ! #if O_DSYNC != OPEN_SYNC_FLAG
> ! #define OPEN_DATASYNC_FLAG O_DSYNC
> #endif
> #else /* !defined(OPEN_SYNC_FLAG) */
> /* Win32 only has O_DSYNC */
> ! #define OPEN_DATASYNC_FLAG O_DSYNC
> #endif
> #endif
>
> #if defined(OPEN_DATASYNC_FLAG)
> #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> --- 48,117 ----
>
>
> /*
> + * Becauase O_DIRECT bypasses the kernel buffers, and because we never
> + * read those buffers except during crash recovery, it is a win to use
> + * it in all cases where we sync on each write(). We could allow O_DIRECT
> + * with fsync(), but because skipping the kernel buffer forces writes out
> + * quickly, it seems best just to use it for O_SYNC. It is hard to imagine
> + * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> + */
> + #ifdef O_DIRECT
> + #define PG_O_DIRECT O_DIRECT
> + #else
> + #define PG_O_DIRECT 0
> + #endif
> +
> + /*
> * This chunk of hackery attempts to determine which file sync methods
> * are available on the current platform, and to choose an appropriate
> * default method. We assume that fsync() is always available, and that
> * configure determined whether fdatasync() is.
> */
> #if defined(O_SYNC)
> ! #define CMP_OPEN_SYNC_FLAG O_SYNC
> #else
> #if defined(O_FSYNC)
> ! #define CMP_OPEN_SYNC_FLAG O_FSYNC
> #endif
> #endif
> + #define OPEN_SYNC_FLAG (CMP_OPEN_SYNC_FLAG | PG_O_DIRECT)
>
> #if defined(O_DSYNC)
> #if defined(OPEN_SYNC_FLAG)
> ! #if O_DSYNC != CMP_OPEN_SYNC_FLAG
> ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> #endif
> #else /* !defined(OPEN_SYNC_FLAG) */
> /* Win32 only has O_DSYNC */
> ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> #endif
> #endif
>
> + /*
> + * Limitation of buffer-alignment for direct io depend on OS and filesystem,
> + * but BLCKSZ is assumed to be enough for it.
> + */
> + #ifdef O_DIRECT
> + #define ALIGNOF_XLOG_BUFFER BLCKSZ
> + #else
> + #define ALIGNOF_XLOG_BUFFER MAXIMUM_ALIGNOF
> + #endif
> +
> + /*
> + * Switch the alignment routine because ShmemAlloc() returns a max-aligned
> + * buffer and ALIGNOF_XLOG_BUFFER may be greater than MAXIMUM_ALIGNOF.
> + */
> + #if ALIGNOF_XLOG_BUFFER <= MAXIMUM_ALIGNOF
> + #define XLOG_BUFFER_ALIGN(LEN) MAXALIGN((LEN))
> + #else
> + #define XLOG_BUFFER_ALIGN(LEN) ((LEN) + (ALIGNOF_XLOG_BUFFER))
> + #endif
> + /* assume sizeof(ptrdiff_t) == sizeof(void*) */
> + #define POINTERALIGN(ALIGNVAL,PTR) \
> + ((char *)(((ptrdiff_t) (PTR) + (ALIGNVAL-1)) & ~((ptrdiff_t) (ALIGNVAL-1))))
> + #define XLOG_BUFFER_POINTERALIGN(PTR) \
> + POINTERALIGN((ALIGNOF_XLOG_BUFFER), (PTR))
> +
> #if defined(OPEN_DATASYNC_FLAG)
> #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> ***************
> *** 469,474 ****
> --- 509,525 ----
> static char *str_time(time_t tnow);
> static void issue_xlog_fsync(void);
>
> + /* XLog gather-write staffs */
> + typedef struct XLogPages
> + {
> + char *head; /* Head of first page */
> + int size; /* Total bytes of pages == count(pages) * BLCKSZ */
> + int offset; /* Offset in xlog segment file */
> + } XLogPages;
> + static void XLogPageReset(XLogPages *pages);
> + static void XLogPageWrite(XLogPages *pages, int index);
> + static void XLogPageFlush(XLogPages *pages, int index);
> +
> #ifdef WAL_DEBUG
> static void xlog_outrec(char *buf, XLogRecord *record);
> #endif
> ***************
> *** 1245,1253 ****
> XLogWrite(XLogwrtRqst WriteRqst)
> {
> XLogCtlWrite *Write = &XLogCtl->Write;
> - char *from;
> bool ispartialpage;
> bool use_existent;
>
> /* We should always be inside a critical section here */
> Assert(CritSectionCount > 0);
> --- 1296,1305 ----
> XLogWrite(XLogwrtRqst WriteRqst)
> {
> XLogCtlWrite *Write = &XLogCtl->Write;
> bool ispartialpage;
> bool use_existent;
> + int currentIndex = Write->curridx;
> + XLogPages pages;
>
> /* We should always be inside a critical section here */
> Assert(CritSectionCount > 0);
> ***************
> *** 1258,1263 ****
> --- 1310,1317 ----
> */
> LogwrtResult = Write->LogwrtResult;
>
> + XLogPageReset(&pages);
> +
> while (XLByteLT(LogwrtResult.Write, WriteRqst.Write))
> {
> /*
> ***************
> *** 1266,1279 ****
> * end of the last page that's been initialized by
> * AdvanceXLInsertBuffer.
> */
> ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
> elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> ! XLogCtl->xlblocks[Write->curridx].xlogid,
> ! XLogCtl->xlblocks[Write->curridx].xrecoff);
>
> /* Advance LogwrtResult.Write to end of current buffer page */
> ! LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
> ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
>
> if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> --- 1320,1333 ----
> * end of the last page that's been initialized by
> * AdvanceXLInsertBuffer.
> */
> ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
> elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> ! XLogCtl->xlblocks[currentIndex].xlogid,
> ! XLogCtl->xlblocks[currentIndex].xrecoff);
>
> /* Advance LogwrtResult.Write to end of current buffer page */
> ! LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
> ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
>
> if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> ***************
> *** 1281,1286 ****
> --- 1335,1341 ----
> /*
> * Switch to new logfile segment.
> */
> + XLogPageFlush(&pages, currentIndex);
> if (openLogFile >= 0)
> {
> if (close(openLogFile))
> ***************
> *** 1354,1384 ****
> openLogOff = 0;
> }
>
> ! /* Need to seek in the file? */
> ! if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
> ! {
> ! openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> ! if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> ! ereport(PANIC,
> ! (errcode_for_file_access(),
> ! errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> ! openLogId, openLogSeg, openLogOff)));
> ! }
> !
> ! /* OK to write the page */
> ! from = XLogCtl->pages + Write->curridx * BLCKSZ;
> ! errno = 0;
> ! if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
> ! {
> ! /* if write didn't set errno, assume problem is no disk space */
> ! if (errno == 0)
> ! errno = ENOSPC;
> ! ereport(PANIC,
> ! (errcode_for_file_access(),
> ! errmsg("could not write to log file %u, segment %u at offset %u: %m",
> ! openLogId, openLogSeg, openLogOff)));
> ! }
> ! openLogOff += BLCKSZ;
>
> /*
> * If we just wrote the whole last page of a logfile segment,
> --- 1409,1416 ----
> openLogOff = 0;
> }
>
> ! /* Add a page to buffer */
> ! XLogPageWrite(&pages, currentIndex);
>
> /*
> * If we just wrote the whole last page of a logfile segment,
> ***************
> *** 1390,1397 ****
> * This is also the right place to notify the Archiver that the
> * segment is ready to copy to archival storage.
> */
> ! if (openLogOff >= XLogSegSize && !ispartialpage)
> {
> issue_xlog_fsync();
> LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
>
> --- 1422,1430 ----
> * This is also the right place to notify the Archiver that the
> * segment is ready to copy to archival storage.
> */
> ! if (openLogOff + pages.size >= XLogSegSize && !ispartialpage)
> {
> + XLogPageFlush(&pages, currentIndex);
> issue_xlog_fsync();
> LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
>
> ***************
> *** 1405,1412 ****
> LogwrtResult.Write = WriteRqst.Write;
> break;
> }
> ! Write->curridx = NextBufIdx(Write->curridx);
> }
>
> /*
> * If asked to flush, do so
> --- 1438,1446 ----
> LogwrtResult.Write = WriteRqst.Write;
> break;
> }
> ! currentIndex = NextBufIdx(currentIndex);
> }
> + XLogPageFlush(&pages, currentIndex);
>
> /*
> * If asked to flush, do so
> ***************
> *** 3584,3590 ****
> if (XLOGbuffers < MinXLOGbuffers)
> XLOGbuffers = MinXLOGbuffers;
>
> ! return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> + BLCKSZ * XLOGbuffers +
> MAXALIGN(sizeof(ControlFileData));
> }
> --- 3618,3624 ----
> if (XLOGbuffers < MinXLOGbuffers)
> XLOGbuffers = MinXLOGbuffers;
>
> ! return XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> + BLCKSZ * XLOGbuffers +
> MAXALIGN(sizeof(ControlFileData));
> }
> ***************
> *** 3601,3607 ****
>
> XLogCtl = (XLogCtlData *)
> ShmemInitStruct("XLOG Ctl",
> ! MAXALIGN(sizeof(XLogCtlData) +
> sizeof(XLogRecPtr) * XLOGbuffers)
> + BLCKSZ * XLOGbuffers,
> &foundXLog);
> --- 3635,3641 ----
>
> XLogCtl = (XLogCtlData *)
> ShmemInitStruct("XLOG Ctl",
> ! XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) +
> sizeof(XLogRecPtr) * XLOGbuffers)
> + BLCKSZ * XLOGbuffers,
> &foundXLog);
> ***************
> *** 3630,3638 ****
> * Here, on the other hand, we must MAXALIGN to ensure the page
> * buffers have worst-case alignment.
> */
> ! XLogCtl->pages =
> ! ((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
> ! sizeof(XLogRecPtr) * XLOGbuffers);
> memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
>
> /*
> --- 3664,3672 ----
> * Here, on the other hand, we must MAXALIGN to ensure the page
> * buffers have worst-case alignment.
> */
> ! XLogCtl->pages = XLOG_BUFFER_POINTERALIGN(
> ! ((char *) XLogCtl)
> ! + sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
> memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
>
> /*
> ***************
> *** 3690,3699 ****
> /* First timeline ID is always 1 */
> ThisTimeLineID = 1;
>
> ! /* Use malloc() to ensure buffer is MAXALIGNED */
> ! buffer = (char *) malloc(BLCKSZ);
> ! page = (XLogPageHeader) buffer;
> ! memset(buffer, 0, BLCKSZ);
>
> /* Set up information for the initial checkpoint record */
> checkPoint.redo.xlogid = 0;
> --- 3724,3732 ----
> /* First timeline ID is always 1 */
> ThisTimeLineID = 1;
>
> ! buffer = (char *) malloc(BLCKSZ + ALIGNOF_XLOG_BUFFER);
> ! page = (XLogPageHeader) XLOG_BUFFER_POINTERALIGN(buffer);
> ! memset(page, 0, BLCKSZ);
>
> /* Set up information for the initial checkpoint record */
> checkPoint.redo.xlogid = 0;
> ***************
> *** 3745,3751 ****
>
> /* Write the first page with the initial record */
> errno = 0;
> ! if (write(openLogFile, buffer, BLCKSZ) != BLCKSZ)
> {
> /* if write didn't set errno, assume problem is no disk space */
> if (errno == 0)
> --- 3778,3784 ----
>
> /* Write the first page with the initial record */
> errno = 0;
> ! if (write(openLogFile, page, BLCKSZ) != BLCKSZ)
> {
> /* if write didn't set errno, assume problem is no disk space */
> if (errno == 0)
> ***************
> *** 5837,5839 ****
> --- 5870,5940 ----
> errmsg("could not remove file \"%s\": %m",
> BACKUP_LABEL_FILE)));
> }
> +
> +
> + /* XLog gather-write staffs */
> +
> + static void
> + XLogPageReset(XLogPages *pages)
> + {
> + memset(pages, 0, sizeof(*pages));
> + }
> +
> + static void
> + XLogPageWrite(XLogPages *pages, int index)
> + {
> + char *page = XLogCtl->pages + index * BLCKSZ;
> + int size = BLCKSZ;
> + int offset = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> +
> + if (pages->head + pages->size == page
> + && pages->offset + pages->size == offset)
> + { /* Pages are continuous. Append new page. */
> + pages->size += size;
> + }
> + else
> + { /* Pages are not continuous. Flush and clear. */
> + XLogPageFlush(pages, PrevBufIdx(index));
> + pages->head = page;
> + pages->size = size;
> + pages->offset = offset;
> + }
> + }
> +
> + static void
> + XLogPageFlush(XLogPages *pages, int index)
> + {
> + if (!pages->head)
> + { /* No needs to write pages. */
> + XLogCtl->Write.curridx = index;
> + return;
> + }
> +
> + /* Need to seek in the file? */
> + if (openLogOff != pages->offset)
> + {
> + openLogOff = pages->offset;
> + if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> + ereport(PANIC,
> + (errcode_for_file_access(),
> + errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> + openLogId, openLogSeg, openLogOff)));
> + }
> +
> + /* OK to write the page */
> + errno = 0;
> + if (write(openLogFile, pages->head, pages->size) != pages->size)
> + {
> + /* if write didn't set errno, assume problem is no disk space */
> + if (errno == 0)
> + errno = ENOSPC;
> + ereport(PANIC,
> + (errcode_for_file_access(),
> + errmsg("could not write to log file %u, segment %u at offset %u: %m",
> + openLogId, openLogSeg, openLogOff)));
> + }
> +
> + openLogOff += pages->size;
> + XLogCtl->Write.curridx = index;
> + XLogPageReset(pages);
> + }


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: maryedie(at)osdl(dot)org
Cc: Mark Wong <markw(at)osdl(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: [HACKERS] data on devel code perf dip
Date: 2005-08-12 18:53:32
Message-ID: 200508121853.j7CIrWU17249@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Mary Edie Meredith wrote:
> Just to be certain I know what I have and how to use it,
> please confirm the following is correct.
>
> According to Markw, the tarball we have at OSDL that is
> dated 7/29 already has the O_DIRECT patch + wal grouping
> applied. Hence, I will use the one from the day before:
> postgresql-20050728.tar.bz2 and apply your patch.

Right, or use patch -R to reverse out the changes and revert to a
version with O_DIRECT.
>
> I expect to have, after applying the patch, the O_DIRECT patch
> for the log (which I should not be using given the config
> parameters I have), but I will _not have the wal grouping?

The patch adds O_DIRECT (which is not being used given your configure
paramters), and grouped WAL writes.

---------------------------------------------------------------------------

>
> Is that correct?
>
> On Fri, 2005-08-12 at 12:12 -0400, Bruce Momjian wrote:
> > Mark Wong wrote:
> > > On Thu, 11 Aug 2005 22:11:42 -0400 (EDT)
> > > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
> > >
> > > > Tom Lane wrote:
> > > > > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > > > > >> O_DIRECT is only being used for WAL page writes (or I sure hope so
> > > > > >> anyway), so shared_buffers should be irrelevant.
> > > > >
> > > > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume
> > > > > > that is not used for writing dirty buffers during a checkpoint.
> > > > >
> > > > > I double-checked that O_DIRECT is really just used for WAL, and only
> > > > > when the sync mode is open_sync or open_datasync. So it seems
> > > > > impossible that it affected a run with mode fdatasync. What seems the
> > > > > best theory at the moment is that the grouped-WAL-write part of the
> > > > > patch doesn't work so well as we thought.
> > > >
> > > > Yes, that's my only guess. Let us know if you want the patch to test,
> > > > rather than pulling CVS before and after the patch was applied.
> > >
> > > Yeah, a patch would be a little easier. :)
> >
> > OK, patch attached. The code has been cleaned up a little since then but
> > this is the basic change that should be tested. It is based on CVS of
> > 2005/07/29 03:22:33 GMT.
> >
> > Plain text document attachment (/bjm/diff)
> > Index: src/backend/access/transam/xlog.c
> > ===================================================================
> > RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
> > retrieving revision 1.210
> > retrieving revision 1.211
> > diff -c -r1.210 -r1.211
> > *** src/backend/access/transam/xlog.c 23 Jul 2005 15:31:16 -0000 1.210
> > --- src/backend/access/transam/xlog.c 29 Jul 2005 03:22:33 -0000 1.211
> > ***************
> > *** 48,77 ****
> >
> >
> > /*
> > * This chunk of hackery attempts to determine which file sync methods
> > * are available on the current platform, and to choose an appropriate
> > * default method. We assume that fsync() is always available, and that
> > * configure determined whether fdatasync() is.
> > */
> > #if defined(O_SYNC)
> > ! #define OPEN_SYNC_FLAG O_SYNC
> > #else
> > #if defined(O_FSYNC)
> > ! #define OPEN_SYNC_FLAG O_FSYNC
> > #endif
> > #endif
> >
> > #if defined(O_DSYNC)
> > #if defined(OPEN_SYNC_FLAG)
> > ! #if O_DSYNC != OPEN_SYNC_FLAG
> > ! #define OPEN_DATASYNC_FLAG O_DSYNC
> > #endif
> > #else /* !defined(OPEN_SYNC_FLAG) */
> > /* Win32 only has O_DSYNC */
> > ! #define OPEN_DATASYNC_FLAG O_DSYNC
> > #endif
> > #endif
> >
> > #if defined(OPEN_DATASYNC_FLAG)
> > #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> > #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> > --- 48,117 ----
> >
> >
> > /*
> > + * Becauase O_DIRECT bypasses the kernel buffers, and because we never
> > + * read those buffers except during crash recovery, it is a win to use
> > + * it in all cases where we sync on each write(). We could allow O_DIRECT
> > + * with fsync(), but because skipping the kernel buffer forces writes out
> > + * quickly, it seems best just to use it for O_SYNC. It is hard to imagine
> > + * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> > + */
> > + #ifdef O_DIRECT
> > + #define PG_O_DIRECT O_DIRECT
> > + #else
> > + #define PG_O_DIRECT 0
> > + #endif
> > +
> > + /*
> > * This chunk of hackery attempts to determine which file sync methods
> > * are available on the current platform, and to choose an appropriate
> > * default method. We assume that fsync() is always available, and that
> > * configure determined whether fdatasync() is.
> > */
> > #if defined(O_SYNC)
> > ! #define CMP_OPEN_SYNC_FLAG O_SYNC
> > #else
> > #if defined(O_FSYNC)
> > ! #define CMP_OPEN_SYNC_FLAG O_FSYNC
> > #endif
> > #endif
> > + #define OPEN_SYNC_FLAG (CMP_OPEN_SYNC_FLAG | PG_O_DIRECT)
> >
> > #if defined(O_DSYNC)
> > #if defined(OPEN_SYNC_FLAG)
> > ! #if O_DSYNC != CMP_OPEN_SYNC_FLAG
> > ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> > #endif
> > #else /* !defined(OPEN_SYNC_FLAG) */
> > /* Win32 only has O_DSYNC */
> > ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> > #endif
> > #endif
> >
> > + /*
> > + * Limitation of buffer-alignment for direct io depend on OS and filesystem,
> > + * but BLCKSZ is assumed to be enough for it.
> > + */
> > + #ifdef O_DIRECT
> > + #define ALIGNOF_XLOG_BUFFER BLCKSZ
> > + #else
> > + #define ALIGNOF_XLOG_BUFFER MAXIMUM_ALIGNOF
> > + #endif
> > +
> > + /*
> > + * Switch the alignment routine because ShmemAlloc() returns a max-aligned
> > + * buffer and ALIGNOF_XLOG_BUFFER may be greater than MAXIMUM_ALIGNOF.
> > + */
> > + #if ALIGNOF_XLOG_BUFFER <= MAXIMUM_ALIGNOF
> > + #define XLOG_BUFFER_ALIGN(LEN) MAXALIGN((LEN))
> > + #else
> > + #define XLOG_BUFFER_ALIGN(LEN) ((LEN) + (ALIGNOF_XLOG_BUFFER))
> > + #endif
> > + /* assume sizeof(ptrdiff_t) == sizeof(void*) */
> > + #define POINTERALIGN(ALIGNVAL,PTR) \
> > + ((char *)(((ptrdiff_t) (PTR) + (ALIGNVAL-1)) & ~((ptrdiff_t) (ALIGNVAL-1))))
> > + #define XLOG_BUFFER_POINTERALIGN(PTR) \
> > + POINTERALIGN((ALIGNOF_XLOG_BUFFER), (PTR))
> > +
> > #if defined(OPEN_DATASYNC_FLAG)
> > #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> > #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> > ***************
> > *** 469,474 ****
> > --- 509,525 ----
> > static char *str_time(time_t tnow);
> > static void issue_xlog_fsync(void);
> >
> > + /* XLog gather-write staffs */
> > + typedef struct XLogPages
> > + {
> > + char *head; /* Head of first page */
> > + int size; /* Total bytes of pages == count(pages) * BLCKSZ */
> > + int offset; /* Offset in xlog segment file */
> > + } XLogPages;
> > + static void XLogPageReset(XLogPages *pages);
> > + static void XLogPageWrite(XLogPages *pages, int index);
> > + static void XLogPageFlush(XLogPages *pages, int index);
> > +
> > #ifdef WAL_DEBUG
> > static void xlog_outrec(char *buf, XLogRecord *record);
> > #endif
> > ***************
> > *** 1245,1253 ****
> > XLogWrite(XLogwrtRqst WriteRqst)
> > {
> > XLogCtlWrite *Write = &XLogCtl->Write;
> > - char *from;
> > bool ispartialpage;
> > bool use_existent;
> >
> > /* We should always be inside a critical section here */
> > Assert(CritSectionCount > 0);
> > --- 1296,1305 ----
> > XLogWrite(XLogwrtRqst WriteRqst)
> > {
> > XLogCtlWrite *Write = &XLogCtl->Write;
> > bool ispartialpage;
> > bool use_existent;
> > + int currentIndex = Write->curridx;
> > + XLogPages pages;
> >
> > /* We should always be inside a critical section here */
> > Assert(CritSectionCount > 0);
> > ***************
> > *** 1258,1263 ****
> > --- 1310,1317 ----
> > */
> > LogwrtResult = Write->LogwrtResult;
> >
> > + XLogPageReset(&pages);
> > +
> > while (XLByteLT(LogwrtResult.Write, WriteRqst.Write))
> > {
> > /*
> > ***************
> > *** 1266,1279 ****
> > * end of the last page that's been initialized by
> > * AdvanceXLInsertBuffer.
> > */
> > ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
> > elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> > LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> > ! XLogCtl->xlblocks[Write->curridx].xlogid,
> > ! XLogCtl->xlblocks[Write->curridx].xrecoff);
> >
> > /* Advance LogwrtResult.Write to end of current buffer page */
> > ! LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
> > ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
> >
> > if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> > --- 1320,1333 ----
> > * end of the last page that's been initialized by
> > * AdvanceXLInsertBuffer.
> > */
> > ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
> > elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> > LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> > ! XLogCtl->xlblocks[currentIndex].xlogid,
> > ! XLogCtl->xlblocks[currentIndex].xrecoff);
> >
> > /* Advance LogwrtResult.Write to end of current buffer page */
> > ! LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
> > ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
> >
> > if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> > ***************
> > *** 1281,1286 ****
> > --- 1335,1341 ----
> > /*
> > * Switch to new logfile segment.
> > */
> > + XLogPageFlush(&pages, currentIndex);
> > if (openLogFile >= 0)
> > {
> > if (close(openLogFile))
> > ***************
> > *** 1354,1384 ****
> > openLogOff = 0;
> > }
> >
> > ! /* Need to seek in the file? */
> > ! if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
> > ! {
> > ! openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> > ! if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> > ! ereport(PANIC,
> > ! (errcode_for_file_access(),
> > ! errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> > ! openLogId, openLogSeg, openLogOff)));
> > ! }
> > !
> > ! /* OK to write the page */
> > ! from = XLogCtl->pages + Write->curridx * BLCKSZ;
> > ! errno = 0;
> > ! if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
> > ! {
> > ! /* if write didn't set errno, assume problem is no disk space */
> > ! if (errno == 0)
> > ! errno = ENOSPC;
> > ! ereport(PANIC,
> > ! (errcode_for_file_access(),
> > ! errmsg("could not write to log file %u, segment %u at offset %u: %m",
> > ! openLogId, openLogSeg, openLogOff)));
> > ! }
> > ! openLogOff += BLCKSZ;
> >
> > /*
> > * If we just wrote the whole last page of a logfile segment,
> > --- 1409,1416 ----
> > openLogOff = 0;
> > }
> >
> > ! /* Add a page to buffer */
> > ! XLogPageWrite(&pages, currentIndex);
> >
> > /*
> > * If we just wrote the whole last page of a logfile segment,
> > ***************
> > *** 1390,1397 ****
> > * This is also the right place to notify the Archiver that the
> > * segment is ready to copy to archival storage.
> > */
> > ! if (openLogOff >= XLogSegSize && !ispartialpage)
> > {
> > issue_xlog_fsync();
> > LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
> >
> > --- 1422,1430 ----
> > * This is also the right place to notify the Archiver that the
> > * segment is ready to copy to archival storage.
> > */
> > ! if (openLogOff + pages.size >= XLogSegSize && !ispartialpage)
> > {
> > + XLogPageFlush(&pages, currentIndex);
> > issue_xlog_fsync();
> > LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
> >
> > ***************
> > *** 1405,1412 ****
> > LogwrtResult.Write = WriteRqst.Write;
> > break;
> > }
> > ! Write->curridx = NextBufIdx(Write->curridx);
> > }
> >
> > /*
> > * If asked to flush, do so
> > --- 1438,1446 ----
> > LogwrtResult.Write = WriteRqst.Write;
> > break;
> > }
> > ! currentIndex = NextBufIdx(currentIndex);
> > }
> > + XLogPageFlush(&pages, currentIndex);
> >
> > /*
> > * If asked to flush, do so
> > ***************
> > *** 3584,3590 ****
> > if (XLOGbuffers < MinXLOGbuffers)
> > XLOGbuffers = MinXLOGbuffers;
> >
> > ! return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> > + BLCKSZ * XLOGbuffers +
> > MAXALIGN(sizeof(ControlFileData));
> > }
> > --- 3618,3624 ----
> > if (XLOGbuffers < MinXLOGbuffers)
> > XLOGbuffers = MinXLOGbuffers;
> >
> > ! return XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> > + BLCKSZ * XLOGbuffers +
> > MAXALIGN(sizeof(ControlFileData));
> > }
> > ***************
> > *** 3601,3607 ****
> >
> > XLogCtl = (XLogCtlData *)
> > ShmemInitStruct("XLOG Ctl",
> > ! MAXALIGN(sizeof(XLogCtlData) +
> > sizeof(XLogRecPtr) * XLOGbuffers)
> > + BLCKSZ * XLOGbuffers,
> > &foundXLog);
> > --- 3635,3641 ----
> >
> > XLogCtl = (XLogCtlData *)
> > ShmemInitStruct("XLOG Ctl",
> > ! XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) +
> > sizeof(XLogRecPtr) * XLOGbuffers)
> > + BLCKSZ * XLOGbuffers,
> > &foundXLog);
> > ***************
> > *** 3630,3638 ****
> > * Here, on the other hand, we must MAXALIGN to ensure the page
> > * buffers have worst-case alignment.
> > */
> > ! XLogCtl->pages =
> > ! ((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
> > ! sizeof(XLogRecPtr) * XLOGbuffers);
> > memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
> >
> > /*
> > --- 3664,3672 ----
> > * Here, on the other hand, we must MAXALIGN to ensure the page
> > * buffers have worst-case alignment.
> > */
> > ! XLogCtl->pages = XLOG_BUFFER_POINTERALIGN(
> > ! ((char *) XLogCtl)
> > ! + sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
> > memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
> >
> > /*
> > ***************
> > *** 3690,3699 ****
> > /* First timeline ID is always 1 */
> > ThisTimeLineID = 1;
> >
> > ! /* Use malloc() to ensure buffer is MAXALIGNED */
> > ! buffer = (char *) malloc(BLCKSZ);
> > ! page = (XLogPageHeader) buffer;
> > ! memset(buffer, 0, BLCKSZ);
> >
> > /* Set up information for the initial checkpoint record */
> > checkPoint.redo.xlogid = 0;
> > --- 3724,3732 ----
> > /* First timeline ID is always 1 */
> > ThisTimeLineID = 1;
> >
> > ! buffer = (char *) malloc(BLCKSZ + ALIGNOF_XLOG_BUFFER);
> > ! page = (XLogPageHeader) XLOG_BUFFER_POINTERALIGN(buffer);
> > ! memset(page, 0, BLCKSZ);
> >
> > /* Set up information for the initial checkpoint record */
> > checkPoint.redo.xlogid = 0;
> > ***************
> > *** 3745,3751 ****
> >
> > /* Write the first page with the initial record */
> > errno = 0;
> > ! if (write(openLogFile, buffer, BLCKSZ) != BLCKSZ)
> > {
> > /* if write didn't set errno, assume problem is no disk space */
> > if (errno == 0)
> > --- 3778,3784 ----
> >
> > /* Write the first page with the initial record */
> > errno = 0;
> > ! if (write(openLogFile, page, BLCKSZ) != BLCKSZ)
> > {
> > /* if write didn't set errno, assume problem is no disk space */
> > if (errno == 0)
> > ***************
> > *** 5837,5839 ****
> > --- 5870,5940 ----
> > errmsg("could not remove file \"%s\": %m",
> > BACKUP_LABEL_FILE)));
> > }
> > +
> > +
> > + /* XLog gather-write staffs */
> > +
> > + static void
> > + XLogPageReset(XLogPages *pages)
> > + {
> > + memset(pages, 0, sizeof(*pages));
> > + }
> > +
> > + static void
> > + XLogPageWrite(XLogPages *pages, int index)
> > + {
> > + char *page = XLogCtl->pages + index * BLCKSZ;
> > + int size = BLCKSZ;
> > + int offset = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> > +
> > + if (pages->head + pages->size == page
> > + && pages->offset + pages->size == offset)
> > + { /* Pages are continuous. Append new page. */
> > + pages->size += size;
> > + }
> > + else
> > + { /* Pages are not continuous. Flush and clear. */
> > + XLogPageFlush(pages, PrevBufIdx(index));
> > + pages->head = page;
> > + pages->size = size;
> > + pages->offset = offset;
> > + }
> > + }
> > +
> > + static void
> > + XLogPageFlush(XLogPages *pages, int index)
> > + {
> > + if (!pages->head)
> > + { /* No needs to write pages. */
> > + XLogCtl->Write.curridx = index;
> > + return;
> > + }
> > +
> > + /* Need to seek in the file? */
> > + if (openLogOff != pages->offset)
> > + {
> > + openLogOff = pages->offset;
> > + if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> > + ereport(PANIC,
> > + (errcode_for_file_access(),
> > + errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> > + openLogId, openLogSeg, openLogOff)));
> > + }
> > +
> > + /* OK to write the page */
> > + errno = 0;
> > + if (write(openLogFile, pages->head, pages->size) != pages->size)
> > + {
> > + /* if write didn't set errno, assume problem is no disk space */
> > + if (errno == 0)
> > + errno = ENOSPC;
> > + ereport(PANIC,
> > + (errcode_for_file_access(),
> > + errmsg("could not write to log file %u, segment %u at offset %u: %m",
> > + openLogId, openLogSeg, openLogOff)));
> > + }
> > +
> > + openLogOff += pages->size;
> > + XLogCtl->Write.curridx = index;
> > + XLogPageReset(pages);
> > + }
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Mark Wong <markw(at)osdl(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: [HACKERS] data on devel code perf dip
Date: 2005-08-12 20:22:08
Message-ID: 1123878128.9956.121.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Fri, 2005-08-12 at 14:53 -0400, Bruce Momjian wrote:
> Mary Edie Meredith wrote:
> > Just to be certain I know what I have and how to use it,
> > please confirm the following is correct.
> >
> > According to Markw, the tarball we have at OSDL that is
> > dated 7/29 already has the O_DIRECT patch + wal grouping
> > applied. Hence, I will use the one from the day before:
> > postgresql-20050728.tar.bz2 and apply your patch.
>
> Right, or use patch -R to reverse out the changes and revert to a
> version with O_DIRECT.
> >
> > I expect to have, after applying the patch, the O_DIRECT patch
> > for the log (which I should not be using given the config
> > parameters I have), but I will _not have the wal grouping?
>
> The patch adds O_DIRECT (which is not being used given your configure
> paramters), and grouped WAL writes.
So I'll use postgresql-20050728, I'll run with it to confirm it is
running similarly to my good run (42). If so, then I'll apply the patch
and see what happens....

This may take me a while just because I'm backed up with other
things ....

>
> ---------------------------------------------------------------------------
>
>
> >
> > Is that correct?
> >
> > On Fri, 2005-08-12 at 12:12 -0400, Bruce Momjian wrote:
> > > Mark Wong wrote:
> > > > On Thu, 11 Aug 2005 22:11:42 -0400 (EDT)
> > > > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
> > > >
> > > > > Tom Lane wrote:
> > > > > > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > > > > > >> O_DIRECT is only being used for WAL page writes (or I sure hope so
> > > > > > >> anyway), so shared_buffers should be irrelevant.
> > > > > >
> > > > > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume
> > > > > > > that is not used for writing dirty buffers during a checkpoint.
> > > > > >
> > > > > > I double-checked that O_DIRECT is really just used for WAL, and only
> > > > > > when the sync mode is open_sync or open_datasync. So it seems
> > > > > > impossible that it affected a run with mode fdatasync. What seems the
> > > > > > best theory at the moment is that the grouped-WAL-write part of the
> > > > > > patch doesn't work so well as we thought.
> > > > >
> > > > > Yes, that's my only guess. Let us know if you want the patch to test,
> > > > > rather than pulling CVS before and after the patch was applied.
> > > >
> > > > Yeah, a patch would be a little easier. :)
> > >
> > > OK, patch attached. The code has been cleaned up a little since then but
> > > this is the basic change that should be tested. It is based on CVS of
> > > 2005/07/29 03:22:33 GMT.
> > >
> > > Plain text document attachment (/bjm/diff)
> > > Index: src/backend/access/transam/xlog.c
> > > ===================================================================
> > > RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
> > > retrieving revision 1.210
> > > retrieving revision 1.211
> > > diff -c -r1.210 -r1.211
> > > *** src/backend/access/transam/xlog.c 23 Jul 2005 15:31:16 -0000 1.210
> > > --- src/backend/access/transam/xlog.c 29 Jul 2005 03:22:33 -0000 1.211
> > > ***************
> > > *** 48,77 ****
> > >
> > >
> > > /*
> > > * This chunk of hackery attempts to determine which file sync methods
> > > * are available on the current platform, and to choose an appropriate
> > > * default method. We assume that fsync() is always available, and that
> > > * configure determined whether fdatasync() is.
> > > */
> > > #if defined(O_SYNC)
> > > ! #define OPEN_SYNC_FLAG O_SYNC
> > > #else
> > > #if defined(O_FSYNC)
> > > ! #define OPEN_SYNC_FLAG O_FSYNC
> > > #endif
> > > #endif
> > >
> > > #if defined(O_DSYNC)
> > > #if defined(OPEN_SYNC_FLAG)
> > > ! #if O_DSYNC != OPEN_SYNC_FLAG
> > > ! #define OPEN_DATASYNC_FLAG O_DSYNC
> > > #endif
> > > #else /* !defined(OPEN_SYNC_FLAG) */
> > > /* Win32 only has O_DSYNC */
> > > ! #define OPEN_DATASYNC_FLAG O_DSYNC
> > > #endif
> > > #endif
> > >
> > > #if defined(OPEN_DATASYNC_FLAG)
> > > #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> > > #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> > > --- 48,117 ----
> > >
> > >
> > > /*
> > > + * Becauase O_DIRECT bypasses the kernel buffers, and because we never
> > > + * read those buffers except during crash recovery, it is a win to use
> > > + * it in all cases where we sync on each write(). We could allow O_DIRECT
> > > + * with fsync(), but because skipping the kernel buffer forces writes out
> > > + * quickly, it seems best just to use it for O_SYNC. It is hard to imagine
> > > + * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> > > + */
> > > + #ifdef O_DIRECT
> > > + #define PG_O_DIRECT O_DIRECT
> > > + #else
> > > + #define PG_O_DIRECT 0
> > > + #endif
> > > +
> > > + /*
> > > * This chunk of hackery attempts to determine which file sync methods
> > > * are available on the current platform, and to choose an appropriate
> > > * default method. We assume that fsync() is always available, and that
> > > * configure determined whether fdatasync() is.
> > > */
> > > #if defined(O_SYNC)
> > > ! #define CMP_OPEN_SYNC_FLAG O_SYNC
> > > #else
> > > #if defined(O_FSYNC)
> > > ! #define CMP_OPEN_SYNC_FLAG O_FSYNC
> > > #endif
> > > #endif
> > > + #define OPEN_SYNC_FLAG (CMP_OPEN_SYNC_FLAG | PG_O_DIRECT)
> > >
> > > #if defined(O_DSYNC)
> > > #if defined(OPEN_SYNC_FLAG)
> > > ! #if O_DSYNC != CMP_OPEN_SYNC_FLAG
> > > ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> > > #endif
> > > #else /* !defined(OPEN_SYNC_FLAG) */
> > > /* Win32 only has O_DSYNC */
> > > ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> > > #endif
> > > #endif
> > >
> > > + /*
> > > + * Limitation of buffer-alignment for direct io depend on OS and filesystem,
> > > + * but BLCKSZ is assumed to be enough for it.
> > > + */
> > > + #ifdef O_DIRECT
> > > + #define ALIGNOF_XLOG_BUFFER BLCKSZ
> > > + #else
> > > + #define ALIGNOF_XLOG_BUFFER MAXIMUM_ALIGNOF
> > > + #endif
> > > +
> > > + /*
> > > + * Switch the alignment routine because ShmemAlloc() returns a max-aligned
> > > + * buffer and ALIGNOF_XLOG_BUFFER may be greater than MAXIMUM_ALIGNOF.
> > > + */
> > > + #if ALIGNOF_XLOG_BUFFER <= MAXIMUM_ALIGNOF
> > > + #define XLOG_BUFFER_ALIGN(LEN) MAXALIGN((LEN))
> > > + #else
> > > + #define XLOG_BUFFER_ALIGN(LEN) ((LEN) + (ALIGNOF_XLOG_BUFFER))
> > > + #endif
> > > + /* assume sizeof(ptrdiff_t) == sizeof(void*) */
> > > + #define POINTERALIGN(ALIGNVAL,PTR) \
> > > + ((char *)(((ptrdiff_t) (PTR) + (ALIGNVAL-1)) & ~((ptrdiff_t) (ALIGNVAL-1))))
> > > + #define XLOG_BUFFER_POINTERALIGN(PTR) \
> > > + POINTERALIGN((ALIGNOF_XLOG_BUFFER), (PTR))
> > > +
> > > #if defined(OPEN_DATASYNC_FLAG)
> > > #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> > > #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> > > ***************
> > > *** 469,474 ****
> > > --- 509,525 ----
> > > static char *str_time(time_t tnow);
> > > static void issue_xlog_fsync(void);
> > >
> > > + /* XLog gather-write staffs */
> > > + typedef struct XLogPages
> > > + {
> > > + char *head; /* Head of first page */
> > > + int size; /* Total bytes of pages == count(pages) * BLCKSZ */
> > > + int offset; /* Offset in xlog segment file */
> > > + } XLogPages;
> > > + static void XLogPageReset(XLogPages *pages);
> > > + static void XLogPageWrite(XLogPages *pages, int index);
> > > + static void XLogPageFlush(XLogPages *pages, int index);
> > > +
> > > #ifdef WAL_DEBUG
> > > static void xlog_outrec(char *buf, XLogRecord *record);
> > > #endif
> > > ***************
> > > *** 1245,1253 ****
> > > XLogWrite(XLogwrtRqst WriteRqst)
> > > {
> > > XLogCtlWrite *Write = &XLogCtl->Write;
> > > - char *from;
> > > bool ispartialpage;
> > > bool use_existent;
> > >
> > > /* We should always be inside a critical section here */
> > > Assert(CritSectionCount > 0);
> > > --- 1296,1305 ----
> > > XLogWrite(XLogwrtRqst WriteRqst)
> > > {
> > > XLogCtlWrite *Write = &XLogCtl->Write;
> > > bool ispartialpage;
> > > bool use_existent;
> > > + int currentIndex = Write->curridx;
> > > + XLogPages pages;
> > >
> > > /* We should always be inside a critical section here */
> > > Assert(CritSectionCount > 0);
> > > ***************
> > > *** 1258,1263 ****
> > > --- 1310,1317 ----
> > > */
> > > LogwrtResult = Write->LogwrtResult;
> > >
> > > + XLogPageReset(&pages);
> > > +
> > > while (XLByteLT(LogwrtResult.Write, WriteRqst.Write))
> > > {
> > > /*
> > > ***************
> > > *** 1266,1279 ****
> > > * end of the last page that's been initialized by
> > > * AdvanceXLInsertBuffer.
> > > */
> > > ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
> > > elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> > > LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> > > ! XLogCtl->xlblocks[Write->curridx].xlogid,
> > > ! XLogCtl->xlblocks[Write->curridx].xrecoff);
> > >
> > > /* Advance LogwrtResult.Write to end of current buffer page */
> > > ! LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
> > > ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
> > >
> > > if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> > > --- 1320,1333 ----
> > > * end of the last page that's been initialized by
> > > * AdvanceXLInsertBuffer.
> > > */
> > > ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
> > > elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> > > LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> > > ! XLogCtl->xlblocks[currentIndex].xlogid,
> > > ! XLogCtl->xlblocks[currentIndex].xrecoff);
> > >
> > > /* Advance LogwrtResult.Write to end of current buffer page */
> > > ! LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
> > > ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
> > >
> > > if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> > > ***************
> > > *** 1281,1286 ****
> > > --- 1335,1341 ----
> > > /*
> > > * Switch to new logfile segment.
> > > */
> > > + XLogPageFlush(&pages, currentIndex);
> > > if (openLogFile >= 0)
> > > {
> > > if (close(openLogFile))
> > > ***************
> > > *** 1354,1384 ****
> > > openLogOff = 0;
> > > }
> > >
> > > ! /* Need to seek in the file? */
> > > ! if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
> > > ! {
> > > ! openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> > > ! if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> > > ! ereport(PANIC,
> > > ! (errcode_for_file_access(),
> > > ! errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> > > ! openLogId, openLogSeg, openLogOff)));
> > > ! }
> > > !
> > > ! /* OK to write the page */
> > > ! from = XLogCtl->pages + Write->curridx * BLCKSZ;
> > > ! errno = 0;
> > > ! if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
> > > ! {
> > > ! /* if write didn't set errno, assume problem is no disk space */
> > > ! if (errno == 0)
> > > ! errno = ENOSPC;
> > > ! ereport(PANIC,
> > > ! (errcode_for_file_access(),
> > > ! errmsg("could not write to log file %u, segment %u at offset %u: %m",
> > > ! openLogId, openLogSeg, openLogOff)));
> > > ! }
> > > ! openLogOff += BLCKSZ;
> > >
> > > /*
> > > * If we just wrote the whole last page of a logfile segment,
> > > --- 1409,1416 ----
> > > openLogOff = 0;
> > > }
> > >
> > > ! /* Add a page to buffer */
> > > ! XLogPageWrite(&pages, currentIndex);
> > >
> > > /*
> > > * If we just wrote the whole last page of a logfile segment,
> > > ***************
> > > *** 1390,1397 ****
> > > * This is also the right place to notify the Archiver that the
> > > * segment is ready to copy to archival storage.
> > > */
> > > ! if (openLogOff >= XLogSegSize && !ispartialpage)
> > > {
> > > issue_xlog_fsync();
> > > LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
> > >
> > > --- 1422,1430 ----
> > > * This is also the right place to notify the Archiver that the
> > > * segment is ready to copy to archival storage.
> > > */
> > > ! if (openLogOff + pages.size >= XLogSegSize && !ispartialpage)
> > > {
> > > + XLogPageFlush(&pages, currentIndex);
> > > issue_xlog_fsync();
> > > LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
> > >
> > > ***************
> > > *** 1405,1412 ****
> > > LogwrtResult.Write = WriteRqst.Write;
> > > break;
> > > }
> > > ! Write->curridx = NextBufIdx(Write->curridx);
> > > }
> > >
> > > /*
> > > * If asked to flush, do so
> > > --- 1438,1446 ----
> > > LogwrtResult.Write = WriteRqst.Write;
> > > break;
> > > }
> > > ! currentIndex = NextBufIdx(currentIndex);
> > > }
> > > + XLogPageFlush(&pages, currentIndex);
> > >
> > > /*
> > > * If asked to flush, do so
> > > ***************
> > > *** 3584,3590 ****
> > > if (XLOGbuffers < MinXLOGbuffers)
> > > XLOGbuffers = MinXLOGbuffers;
> > >
> > > ! return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> > > + BLCKSZ * XLOGbuffers +
> > > MAXALIGN(sizeof(ControlFileData));
> > > }
> > > --- 3618,3624 ----
> > > if (XLOGbuffers < MinXLOGbuffers)
> > > XLOGbuffers = MinXLOGbuffers;
> > >
> > > ! return XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> > > + BLCKSZ * XLOGbuffers +
> > > MAXALIGN(sizeof(ControlFileData));
> > > }
> > > ***************
> > > *** 3601,3607 ****
> > >
> > > XLogCtl = (XLogCtlData *)
> > > ShmemInitStruct("XLOG Ctl",
> > > ! MAXALIGN(sizeof(XLogCtlData) +
> > > sizeof(XLogRecPtr) * XLOGbuffers)
> > > + BLCKSZ * XLOGbuffers,
> > > &foundXLog);
> > > --- 3635,3641 ----
> > >
> > > XLogCtl = (XLogCtlData *)
> > > ShmemInitStruct("XLOG Ctl",
> > > ! XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) +
> > > sizeof(XLogRecPtr) * XLOGbuffers)
> > > + BLCKSZ * XLOGbuffers,
> > > &foundXLog);
> > > ***************
> > > *** 3630,3638 ****
> > > * Here, on the other hand, we must MAXALIGN to ensure the page
> > > * buffers have worst-case alignment.
> > > */
> > > ! XLogCtl->pages =
> > > ! ((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
> > > ! sizeof(XLogRecPtr) * XLOGbuffers);
> > > memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
> > >
> > > /*
> > > --- 3664,3672 ----
> > > * Here, on the other hand, we must MAXALIGN to ensure the page
> > > * buffers have worst-case alignment.
> > > */
> > > ! XLogCtl->pages = XLOG_BUFFER_POINTERALIGN(
> > > ! ((char *) XLogCtl)
> > > ! + sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
> > > memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
> > >
> > > /*
> > > ***************
> > > *** 3690,3699 ****
> > > /* First timeline ID is always 1 */
> > > ThisTimeLineID = 1;
> > >
> > > ! /* Use malloc() to ensure buffer is MAXALIGNED */
> > > ! buffer = (char *) malloc(BLCKSZ);
> > > ! page = (XLogPageHeader) buffer;
> > > ! memset(buffer, 0, BLCKSZ);
> > >
> > > /* Set up information for the initial checkpoint record */
> > > checkPoint.redo.xlogid = 0;
> > > --- 3724,3732 ----
> > > /* First timeline ID is always 1 */
> > > ThisTimeLineID = 1;
> > >
> > > ! buffer = (char *) malloc(BLCKSZ + ALIGNOF_XLOG_BUFFER);
> > > ! page = (XLogPageHeader) XLOG_BUFFER_POINTERALIGN(buffer);
> > > ! memset(page, 0, BLCKSZ);
> > >
> > > /* Set up information for the initial checkpoint record */
> > > checkPoint.redo.xlogid = 0;
> > > ***************
> > > *** 3745,3751 ****
> > >
> > > /* Write the first page with the initial record */
> > > errno = 0;
> > > ! if (write(openLogFile, buffer, BLCKSZ) != BLCKSZ)
> > > {
> > > /* if write didn't set errno, assume problem is no disk space */
> > > if (errno == 0)
> > > --- 3778,3784 ----
> > >
> > > /* Write the first page with the initial record */
> > > errno = 0;
> > > ! if (write(openLogFile, page, BLCKSZ) != BLCKSZ)
> > > {
> > > /* if write didn't set errno, assume problem is no disk space */
> > > if (errno == 0)
> > > ***************
> > > *** 5837,5839 ****
> > > --- 5870,5940 ----
> > > errmsg("could not remove file \"%s\": %m",
> > > BACKUP_LABEL_FILE)));
> > > }
> > > +
> > > +
> > > + /* XLog gather-write staffs */
> > > +
> > > + static void
> > > + XLogPageReset(XLogPages *pages)
> > > + {
> > > + memset(pages, 0, sizeof(*pages));
> > > + }
> > > +
> > > + static void
> > > + XLogPageWrite(XLogPages *pages, int index)
> > > + {
> > > + char *page = XLogCtl->pages + index * BLCKSZ;
> > > + int size = BLCKSZ;
> > > + int offset = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> > > +
> > > + if (pages->head + pages->size == page
> > > + && pages->offset + pages->size == offset)
> > > + { /* Pages are continuous. Append new page. */
> > > + pages->size += size;
> > > + }
> > > + else
> > > + { /* Pages are not continuous. Flush and clear. */
> > > + XLogPageFlush(pages, PrevBufIdx(index));
> > > + pages->head = page;
> > > + pages->size = size;
> > > + pages->offset = offset;
> > > + }
> > > + }
> > > +
> > > + static void
> > > + XLogPageFlush(XLogPages *pages, int index)
> > > + {
> > > + if (!pages->head)
> > > + { /* No needs to write pages. */
> > > + XLogCtl->Write.curridx = index;
> > > + return;
> > > + }
> > > +
> > > + /* Need to seek in the file? */
> > > + if (openLogOff != pages->offset)
> > > + {
> > > + openLogOff = pages->offset;
> > > + if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> > > + ereport(PANIC,
> > > + (errcode_for_file_access(),
> > > + errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> > > + openLogId, openLogSeg, openLogOff)));
> > > + }
> > > +
> > > + /* OK to write the page */
> > > + errno = 0;
> > > + if (write(openLogFile, pages->head, pages->size) != pages->size)
> > > + {
> > > + /* if write didn't set errno, assume problem is no disk space */
> > > + if (errno == 0)
> > > + errno = ENOSPC;
> > > + ereport(PANIC,
> > > + (errcode_for_file_access(),
> > > + errmsg("could not write to log file %u, segment %u at offset %u: %m",
> > > + openLogId, openLogSeg, openLogOff)));
> > > + }
> > > +
> > > + openLogOff += pages->size;
> > > + XLogCtl->Write.curridx = index;
> > > + XLogPageReset(pages);
> > > + }
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> > choose an index scan if your joining column's datatypes do not
> > match
> >
>


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: maryedie(at)osdl(dot)org
Cc: Mark Wong <markw(at)osdl(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: [HACKERS] data on devel code perf dip
Date: 2005-08-12 20:41:45
Message-ID: 200508122041.j7CKfjc25914@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Mary Edie Meredith wrote:
> On Fri, 2005-08-12 at 14:53 -0400, Bruce Momjian wrote:
> > Mary Edie Meredith wrote:
> > > Just to be certain I know what I have and how to use it,
> > > please confirm the following is correct.
> > >
> > > According to Markw, the tarball we have at OSDL that is
> > > dated 7/29 already has the O_DIRECT patch + wal grouping
> > > applied. Hence, I will use the one from the day before:
> > > postgresql-20050728.tar.bz2 and apply your patch.
> >
> > Right, or use patch -R to reverse out the changes and revert to a
> > version with O_DIRECT.
> > >
> > > I expect to have, after applying the patch, the O_DIRECT patch
> > > for the log (which I should not be using given the config
> > > parameters I have), but I will _not have the wal grouping?
> >
> > The patch adds O_DIRECT (which is not being used given your configure
> > paramters), and grouped WAL writes.
> So I'll use postgresql-20050728, I'll run with it to confirm it is
> running similarly to my good run (42). If so, then I'll apply the patch
> and see what happens....
>
> This may take me a while just because I'm backed up with other
> things ....

Right.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: data on devel code perf dip
Date: 2005-08-16 22:45:30
Message-ID: 1124232331.12159.211.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 2005-08-11 at 22:11 -0400, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > >> O_DIRECT is only being used for WAL page writes (or I sure hope so
> > >> anyway), so shared_buffers should be irrelevant.
> >
> > > Uh, O_DIRECT really just enables when open_sync is used, and I assume
> > > that is not used for writing dirty buffers during a checkpoint.
> >
> > I double-checked that O_DIRECT is really just used for WAL, and only
> > when the sync mode is open_sync or open_datasync. So it seems
> > impossible that it affected a run with mode fdatasync. What seems the
> > best theory at the moment is that the grouped-WAL-write part of the
> > patch doesn't work so well as we thought.
>
> Yes, that's my only guess. Let us know if you want the patch to test,
> rather than pulling CVS before and after the patch was applied.

I tried the pull we have from 8/28/05, which should not have the Wal
O-DIRECT patch or the WAL grouping patch (I believe that markw check
this for me).

Unfortunately the performance is just as bad as run 59. So it looks
like those patches are not my immediate problem.

While you think about that, I'm going to pull the development code
from July 18, and run it again with oprofile.

I'm still very concerned about what I'm seeing in the oprofile:
namely: .CreateLWLocks is the second highest entry for postgres.
http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt

It is still the second highest entry in the run I just did.
Why are there so many "Creates"? Shouldn't there just be mostly
calls that gab a LW lock and release it? Mark and I looked
through the code at what the CreateLWLocks subroutine does
and found the note inserted below. Since the connections are
established at the very beginning, and this run went on for
30 more minutes, I dont understand why so many creates happened.
Can you explain this?

* This is called by the postmaster or by a standalone backend.
* It is also called by a backend forked from the postmaster in the
* EXEC_BACKEND case. In the latter case, the shared memory segment
* already exists and has been physically attached to, but we have to
* initialize pointers in local memory that reference the shared
structures,
* because we didn't inherit the correct pointer values from the
postmaster
* as we do in the fork() scenario. The easiest way to do that is to
run
* through the same code as before. (Note that the called routines
mostly
* check IsUnderPostmaster, rather than EXEC_BACKEND, to detect this
case.
* This is a bit code-wasteful and could be cleaned up.)
*
* If "makePrivate" is true then we only need private memory, not shared
* memory. This is true for a standalone backend, false for a
postmaster.

TIA.

>
--
Mary Edie Meredith
Initiative Manager
Open Source Development Labs
maryedie(at)hotmail(dot)com
503-906-1942


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: maryedie(at)osdl(dot)org
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: data on devel code perf dip
Date: 2005-08-16 22:53:55
Message-ID: 7592.1124232835@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Mary Edie Meredith <maryedie(at)osdl(dot)org> writes:
> I'm still very concerned about what I'm seeing in the oprofile:
> namely: .CreateLWLocks is the second highest entry for postgres.
> http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt

This says there's something wrong with your oprofile setup.
CreateLWLocks is run *once* at postmaster startup, and it doesn't
take all that long. Check to see if maybe your executables are
out of sync with what oprofile is looking at?

regards, tom lane


From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: data on devel code perf dip
Date: 2005-08-16 23:02:19
Message-ID: 1124233339.12159.234.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, 2005-08-16 at 18:53 -0400, Tom Lane wrote:
> Mary Edie Meredith <maryedie(at)osdl(dot)org> writes:
> > I'm still very concerned about what I'm seeing in the oprofile:
> > namely: .CreateLWLocks is the second highest entry for postgres.
> > http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt
>
> This says there's something wrong with your oprofile setup.
> CreateLWLocks is run *once* at postmaster startup, and it doesn't
> take all that long. Check to see if maybe your executables are
> out of sync with what oprofile is looking at?

Ok, Tom, sounds correct. thanks.
>
> regards, tom lane


From: Mark Wong <markw(at)osdl(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: maryedie(at)osdl(dot)org, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: data on devel code perf dip
Date: 2005-08-16 23:07:30
Message-ID: 200508162307.j7GN71jA007197@smtp.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, 16 Aug 2005 18:53:55 -0400
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Mary Edie Meredith <maryedie(at)osdl(dot)org> writes:
> > I'm still very concerned about what I'm seeing in the oprofile:
> > namely: .CreateLWLocks is the second highest entry for postgres.
> > http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt
>
> This says there's something wrong with your oprofile setup.
> CreateLWLocks is run *once* at postmaster startup, and it doesn't
> take all that long. Check to see if maybe your executables are
> out of sync with what oprofile is looking at?

It is a POWER5 platform, if that has anything to do with it. It
certainly doesn't look sane, but for user apps oprofile is supposed to
be getting the symbols directly from the binaries its using. I'm
inclined to blame the platform support. ;)

Mark


From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: Mark Wong <markw(at)osdl(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: data on devel code perf dip
Date: 2005-08-16 23:41:11
Message-ID: 1124235671.12159.251.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, 2005-08-16 at 16:07 -0700, Mark Wong wrote:
> On Tue, 16 Aug 2005 18:53:55 -0400
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> > Mary Edie Meredith <maryedie(at)osdl(dot)org> writes:
> > > I'm still very concerned about what I'm seeing in the oprofile:
> > > namely: .CreateLWLocks is the second highest entry for postgres.
> > > http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt
> >
> > This says there's something wrong with your oprofile setup.
> > CreateLWLocks is run *once* at postmaster startup, and it doesn't
> > take all that long. Check to see if maybe your executables are
> > out of sync with what oprofile is looking at?
>
> It is a POWER5 platform, if that has anything to do with it. It
> certainly doesn't look sane, but for user apps oprofile is supposed to
> be getting the symbols directly from the binaries its using. I'm
> inclined to blame the platform support. ;)

Could it have to do with the fact that I'm running in a Virtual Machine?
Maybe they forgot to change the symbols to reflect the VM's "virtual
physical address" if you get my drift. Who do you know on the IBM side,
Mark, who could help us sort this out?
>
> Mark


From: markw(at)osdl(dot)org
To: maryedie(at)osdl(dot)org
Cc: "Mark Wong" <markw(at)osdl(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Alvaro Herrera" <alvherre(at)alvh(dot)no-ip(dot)org>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>, "Josh Berkus" <josh(at)agliodbs(dot)com>
Subject: Re: data on devel code perf dip
Date: 2005-08-16 23:57:58
Message-ID: 51627.63.224.228.211.1124236678.squirrel@www.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

> On Tue, 2005-08-16 at 16:07 -0700, Mark Wong wrote:
>> On Tue, 16 Aug 2005 18:53:55 -0400
>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>
>> > Mary Edie Meredith <maryedie(at)osdl(dot)org> writes:
>> > > I'm still very concerned about what I'm seeing in the oprofile:
>> > > namely: .CreateLWLocks is the second highest entry for postgres.
>> > > http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt
>> >
>> > This says there's something wrong with your oprofile setup.
>> > CreateLWLocks is run *once* at postmaster startup, and it doesn't
>> > take all that long. Check to see if maybe your executables are
>> > out of sync with what oprofile is looking at?
>>
>> It is a POWER5 platform, if that has anything to do with it. It
>> certainly doesn't look sane, but for user apps oprofile is supposed to
>> be getting the symbols directly from the binaries its using. I'm
>> inclined to blame the platform support. ;)
>
> Could it have to do with the fact that I'm running in a Virtual Machine?
> Maybe they forgot to change the symbols to reflect the VM's "virtual
> physical address" if you get my drift. Who do you know on the IBM side,
> Mark, who could help us sort this out?

Not likely, that would imply the system knew it was running in a virtual
machine. You're not supposed to know you're in a virtual machine. I
don't know who at IBM to ask about the hypervisor. As for oprofile, the
oprofile list would be a start.

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: maryedie(at)osdl(dot)org
Cc: pgsql-hackers(at)postgresql(dot)org, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject: Re: data on devel code perf dip
Date: 2005-08-22 00:37:58
Message-ID: 5100.1124671078@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I wrote:
> Mary Edie Meredith <maryedie(at)osdl(dot)org> writes:
>> I have an example of runs that illustrate a performance
>> problem that occurred between installing the 7/18 and 8/1
>> development release codes.

> I dug through the CVS logs to see what had changed, and I'm afraid there
> is just one plausible-looking candidate:

> * src/backend/access/transam/xlog.c:
> Use O_DIRECT if available when using O_SYNC for wal_sync_method.
> Also, write multiple WAL buffers out in one write() operation.

I've been sniffing around that patch and not really finding any smoking
gun about why it would make things slower when you're not using O_DIRECT.
I noticed that it forces 8K alignment of the WAL buffers on any machine
that has O_DIRECT defined, whether you use O_DIRECT or not --- but it's
pretty hard to see why that would make things slower. (Indeed, the
older code only guaranteed MAXALIGN alignment of the WAL buffers, and
we *know* that's wrong --- on Intel machines you want cache-line
alignment to make kernel-to-userspace transfers fast. BTW, does anyone
know if the current definition of BUFFERALIGN == 32 bytes needs to be
increased for newer Intel machines?)

I've got a bunch of minor gripes about the coding style of the
gather-write part of the patch, but I really can't see that it's
causing any performance issue. AFAICS it just replaces several
closely spaced write() syscalls with one larger one. It's not
possible that the kernel you're using is less efficient for larger
transfers than smaller ones, is it?

The whole thing's pretty bizarre.

regards, tom lane


From: "Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: data on devel code perf dip
Date: 2005-08-22 02:14:30
Message-ID: 1124676870.6263.6.camel@noodles
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sun, 2005-08-21 at 20:37 -0400, Tom Lane wrote:
> The whole thing's pretty bizarre.

I hate to sound obvious, but does the missing performance return if you
back the patch out? It seemed to have been decided on Tue, 16 Aug 2005
15:45:30 -0700 that the performance was the same before and after.
However, there also seems to be version confusion, as maryedie(at)osdl(dot)org
also claimed to be testing a tree from the future.

Basically I think there's more confusion here than evidence. On an
admittedly smaller x86 configuration, source trees with this patch are
faster than without, not slower.

And finally, not to start a flamewar, but the initial report was on a
machine running gentoo, the distribution that should be renamed
Irreproduceable Benchmarks Linux.

-jwb


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: maryedie(at)osdl(dot)org
Cc: pgsql-hackers(at)postgresql(dot)org, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject: Re: data on devel code perf dip
Date: 2005-08-23 00:20:05
Message-ID: 26571.1124756405@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I wrote:
> I've been sniffing around that patch and not really finding any smoking
> gun about why it would make things slower when you're not using O_DIRECT.

While rewriting the patch to fit more naturally into xlog.c, I found a
bug that might possibly be related to your performance issue. The
if-test in this fragment is wrong:

/*
* If we just wrote the whole last page of a logfile segment,
* fsync the segment immediately. This avoids having to go back
* and re-open prior segments when an fsync request comes along
* later. Doing it here ensures that one and only one backend will
* perform this fsync.
*
* This is also the right place to notify the Archiver that the
* segment is ready to copy to archival storage.
*/
if (openLogOff + pages.size >= XLogSegSize && !ispartialpage)
{
XLogPageFlush(&pages, currentIndex);
issue_xlog_fsync();
...

Because of the patch's logic changes, openLogOff is not the right thing
to be examining here --- we have not yet done a seek to ensure that it
matches the write start point. (pages.offset + pages.size would have
worked instead.)

The implication of this is that the code might either fail to do a write
and fsync when it needs to, or do extra fsyncs that it doesn't need to
do. I am wondering if the latter could explain your issue. Right
offhand it doesn't seem like this would lead to enough extra fsyncs
to account for the amount of slowdown you report --- but there just
doesn't seem to be anything else in there that could account for it.

I've committed a patch that fixes this and makes some other minor
improvements (which you probably won't notice given that you're using
such a large wal_buffers setting). You might like to update to CVS
tip (make sure you get xlog.c 1.218 or later) and see if things are
any better or not.

regards, tom lane


From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Mark Wong <markw(at)osdl(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject: Re: data on devel code perf dip
Date: 2005-08-29 21:36:37
Message-ID: 1125351397.30683.106.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Thanks, Tom for keeping on this. I was on vacation last week. Will try
the latest ASAP.

On Mon, 2005-08-22 at 20:20 -0400, Tom Lane wrote:
> I wrote:
> > I've been sniffing around that patch and not really finding any smoking
> > gun about why it would make things slower when you're not using O_DIRECT.
>
> While rewriting the patch to fit more naturally into xlog.c, I found a
> bug that might possibly be related to your performance issue. The
> if-test in this fragment is wrong:
>
> /*
> * If we just wrote the whole last page of a logfile segment,
> * fsync the segment immediately. This avoids having to go back
> * and re-open prior segments when an fsync request comes along
> * later. Doing it here ensures that one and only one backend will
> * perform this fsync.
> *
> * This is also the right place to notify the Archiver that the
> * segment is ready to copy to archival storage.
> */
> if (openLogOff + pages.size >= XLogSegSize && !ispartialpage)
> {
> XLogPageFlush(&pages, currentIndex);
> issue_xlog_fsync();
> ...
>
> Because of the patch's logic changes, openLogOff is not the right thing
> to be examining here --- we have not yet done a seek to ensure that it
> matches the write start point. (pages.offset + pages.size would have
> worked instead.)
>
> The implication of this is that the code might either fail to do a write
> and fsync when it needs to, or do extra fsyncs that it doesn't need to
> do. I am wondering if the latter could explain your issue. Right
> offhand it doesn't seem like this would lead to enough extra fsyncs
> to account for the amount of slowdown you report --- but there just
> doesn't seem to be anything else in there that could account for it.
>
> I've committed a patch that fixes this and makes some other minor
> improvements (which you probably won't notice given that you're using
> such a large wal_buffers setting). You might like to update to CVS
> tip (make sure you get xlog.c 1.218 or later) and see if things are
> any better or not.
>
> regards, tom lane