Quick Links

Re: COMMIT NOWAIT Performance Option

Lists:	pgsql-hackers

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	COMMIT NOWAIT Performance Option
Date:	2007-02-26 22:56:58
Message-ID:	1172530618.3760.441.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Proposal: Implement a new option for COMMIT, for enhancing performance,
providing a MySQL-like trade-off between performance and robustness for
*only* those that want it.

COMMIT NOWAIT

This form of COMMIT will *not* perform XLogFlush(), but will rely on a
special background process to perform regular WAL fsyncs (see later).

COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not
threaten the consistency or robustness of other COMMIT modes. Read that
again and think about it, before we go further, please. Normal COMMIT
still guarantees to flush all of WAL up to the point of the commit,
whether or not the previous commits have requested that.

Mixing COMMIT NOWAIT with other modes does not effect the performance of
other backends - those that specify that mode are faster, those that do
not simply go at the same speed they did before. This is important,
because it allows us to have a fully robust server, yet with certain
critical applications going along much faster. No need for an
all-or-nothing approach at db cluster level.

Unlike fsync = off, WAL is always consistent and the server can be
recovered easily, though with some potential for data loss for
transactions that chose the COMMIT NOWAIT option. Sounds like a hole
there: normal COMMITs that rely on data written by COMMIT NOWAIT
transactions are still safe, because the normal COMMIT is still bound by
the guarantee to go to disk. The buffer manager/WAL interlock is not
effected by this change and remains in place, as it should.

This implements the TODO item:
--Allow buffered WAL writes and fsync
"Instead of guaranteeing recovery of all committed transactions, this
would provide improved performance by delaying WAL writes and fsync so
an abrupt operating system restart might lose a few seconds of committed
transactions but still be consistent. We could perhaps remove the
'fsync' parameter (which results in an an inconsistent database) in
favor of this capability."

Why do we want this?? Because some apps have *lots* of data and many
really don't care whether they lose a few records. Honestly, I've met
people that want this, even after 2 hours of discussion and
understanding. Plus probably lots of MySQLers also.

User Control
------------

New commit mode is available by explicit command, or as a default
setting that will be applied to all COMMITs, or both.

The full syntax would be COMMIT [WRITE] NOWAIT [IMMEDIATE], for Oracle
compatibility (why choose incompatibility?). Note that this is not a
transaction start setting like Isolation Level; this happens at end of
transaction. The syntax for END is unchanged, defaulting to normal
behaviour unless overridden.

New userset GUC, commit_wait_default = on (default) | off

We change the meaning of the commit_delay parameter:

- If commit_delay = 0 then commit_wait_default cannot be set off.

- WAL will be flushed every commit_delay milliseconds; if no flush is
required this will do nothing very quickly, so there is little overhead
of no COMMIT NOWAIT commits have been made.

Implementation
--------------

COMMIT NOWAIT in xact.c simply ignores XLogFlush and returns.

Who does the XLogFlush? Well, my recommendation is a totally new
process, WALWriter. But I can see that many of you will say bgwriter
should be the person to do this work. IMHO doing WAL flushes will take
time and thats time that bgwriter really needs to do other things, plus
it can't really guarantee to do flush regularly when its doing
checkpoints.

When commit_delay > 0 then the WALwriter will startup, or shutdown if
commit_delay = 0.

WALWriter will XLogFlush every commit_delay milliseconds.

A prototype patch is posted to -patches, which is WORK IN PROGRESS.
The following TODO items remain
1. discuss which process will issue regular XLogFlush(). If agreed,
implement WALWriter process to perform this task. (Yes, the patch isn't
fully implemented, yet).
2. remove fsync parameter
3. Prevent COMMIT NOWAIT when commit_delay = 0
4. Discuss whether commit_delay is OK to usurp; twas just an earlier
suggestion from someone else, can go either way.
5. docs

The remaining items can be completed very quickly if this proposal is
acceptable. (I wrote this over Christmas, so it turning up now isn't a
rushed proposal and I'm pretty certain it ain't broke).

Comments?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-26 23:06:19
Message-ID:	45E367EB.1050406@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Why do we want this?? Because some apps have *lots* of data and many
> really don't care whether they lose a few records. Honestly, I've met
> people that want this, even after 2 hours of discussion and
> understanding. Plus probably lots of MySQLers also.

Most users will take speed over data loss any day. Whether we want to
admit it or not.

I think these feature is a good middle ground.

Sincerely,

Joshua D. Drake

>
> User Control
> ------------
>
> New commit mode is available by explicit command, or as a default
> setting that will be applied to all COMMITs, or both.
>
> The full syntax would be COMMIT [WRITE] NOWAIT [IMMEDIATE], for Oracle
> compatibility (why choose incompatibility?). Note that this is not a
> transaction start setting like Isolation Level; this happens at end of
> transaction. The syntax for END is unchanged, defaulting to normal
> behaviour unless overridden.
>
> New userset GUC, commit_wait_default = on (default) | off
>
> We change the meaning of the commit_delay parameter:
>
> - If commit_delay = 0 then commit_wait_default cannot be set off.
>
> - WAL will be flushed every commit_delay milliseconds; if no flush is
> required this will do nothing very quickly, so there is little overhead
> of no COMMIT NOWAIT commits have been made.
>
> Implementation
> --------------
>
> COMMIT NOWAIT in xact.c simply ignores XLogFlush and returns.
>
> Who does the XLogFlush? Well, my recommendation is a totally new
> process, WALWriter. But I can see that many of you will say bgwriter
> should be the person to do this work. IMHO doing WAL flushes will take
> time and thats time that bgwriter really needs to do other things, plus
> it can't really guarantee to do flush regularly when its doing
> checkpoints.
>
> When commit_delay > 0 then the WALwriter will startup, or shutdown if
> commit_delay = 0.
>
> WALWriter will XLogFlush every commit_delay milliseconds.
>
> A prototype patch is posted to -patches, which is WORK IN PROGRESS.
> The following TODO items remain
> 1. discuss which process will issue regular XLogFlush(). If agreed,
> implement WALWriter process to perform this task. (Yes, the patch isn't
> fully implemented, yet).
> 2. remove fsync parameter
> 3. Prevent COMMIT NOWAIT when commit_delay = 0
> 4. Discuss whether commit_delay is OK to usurp; twas just an earlier
> suggestion from someone else, can go either way.
> 5. docs
>
> The remaining items can be completed very quickly if this proposal is
> acceptable. (I wrote this over Christmas, so it turning up now isn't a
> rushed proposal and I'm pretty certain it ain't broke).
>
> Comments?
>

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	Richard Huxton <dev(at)archonet(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-26 23:25:31
Message-ID:	45E36C6B.8080605@archonet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> Proposal: Implement a new option for COMMIT, for enhancing performance,
> providing a MySQL-like trade-off between performance and robustness for
> *only* those that want it.
>
> COMMIT NOWAIT
>
> This form of COMMIT will *not* perform XLogFlush(), but will rely on a
> special background process to perform regular WAL fsyncs (see later).
>
> COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not
> threaten the consistency or robustness of other COMMIT modes. Read that
> again and think about it, before we go further, please. Normal COMMIT
> still guarantees to flush all of WAL up to the point of the commit,
> whether or not the previous commits have requested that.
>
> Mixing COMMIT NOWAIT with other modes does not effect the performance of
> other backends - those that specify that mode are faster, those that do
> not simply go at the same speed they did before. This is important,
> because it allows us to have a fully robust server, yet with certain
> critical applications going along much faster. No need for an
> all-or-nothing approach at db cluster level.
>
> Unlike fsync = off, WAL is always consistent and the server can be
> recovered easily, though with some potential for data loss for
> transactions that chose the COMMIT NOWAIT option. Sounds like a hole
> there: normal COMMITs that rely on data written by COMMIT NOWAIT
> transactions are still safe, because the normal COMMIT is still bound by
> the guarantee to go to disk. The buffer manager/WAL interlock is not
> effected by this change and remains in place, as it should.

OK, so do I have this right?

A is issuing COMMIT NOWAIT once a second
B is issuing COMMIT every other second
Checkpoints happen every 10 seconds

Every 10 seconds we have a checkpoint and all WAL+data are on-disk.
Every 2 seconds a standard COMMIT occurs and A+B are synced to WAL
In-between COMMIT NOWAIT occurs and no data is written to disk - neither
WAL nor data-blocks.

So, if I have a load of connections issuing standard COMMITs regularly
then I'll not see much performance-gain. However, I can guarantee any
data visible to those transactions is committed to disk.

If I have only one connection to the database and that uses only COMMIT
NOWAIT, it should approach the speed of fsync=off.

So this is for:
1. data-loading where I don't want to use fsync=off (for whatever reason)
2. Data-logging type apps

Have I got the gist of that?

--
Richard Huxton
Archonet Ltd

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Richard Huxton" <dev(at)archonet(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-26 23:58:11
Message-ID:	1172534292.3760.459.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2007-02-26 at 23:25 +0000, Richard Huxton wrote:
> Simon Riggs wrote:
> > Proposal: Implement a new option for COMMIT, for enhancing performance,
> > providing a MySQL-like trade-off between performance and robustness for
> > *only* those that want it.
> >
> > COMMIT NOWAIT
> >
> > This form of COMMIT will *not* perform XLogFlush(), but will rely on a
> > special background process to perform regular WAL fsyncs (see later).
> >
> > COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not
> > threaten the consistency or robustness of other COMMIT modes. Read that
> > again and think about it, before we go further, please. Normal COMMIT
> > still guarantees to flush all of WAL up to the point of the commit,
> > whether or not the previous commits have requested that.
> >
> > Mixing COMMIT NOWAIT with other modes does not effect the performance of
> > other backends - those that specify that mode are faster, those that do
> > not simply go at the same speed they did before. This is important,
> > because it allows us to have a fully robust server, yet with certain
> > critical applications going along much faster. No need for an
> > all-or-nothing approach at db cluster level.
> >
> > Unlike fsync = off, WAL is always consistent and the server can be
> > recovered easily, though with some potential for data loss for
> > transactions that chose the COMMIT NOWAIT option. Sounds like a hole
> > there: normal COMMITs that rely on data written by COMMIT NOWAIT
> > transactions are still safe, because the normal COMMIT is still bound by
> > the guarantee to go to disk. The buffer manager/WAL interlock is not
> > effected by this change and remains in place, as it should.
>
> OK, so do I have this right?
>
> A is issuing COMMIT NOWAIT once a second
> B is issuing COMMIT every other second
> Checkpoints happen every 10 seconds
>
> Every 10 seconds we have a checkpoint and all WAL+data are on-disk.
> Every 2 seconds a standard COMMIT occurs and A+B are synced to WAL
> In-between COMMIT NOWAIT occurs and no data is written to disk - neither
> WAL nor data-blocks.

Right, but there'd also be a WAL flush every 100ms or so, according to
how you set the parameter. So the window of data loss is controllable
for both efficiency and user selected robustness for those transactions
that want it.

(Plus the point that a COMMIT doesn't send data to disk anyway, only
WAL. This proposal does nothing to change that, I should add).

> So, if I have a load of connections issuing standard COMMITs regularly
> then I'll not see much performance-gain. However, I can guarantee any
> data visible to those transactions is committed to disk.
>
> If I have only one connection to the database and that uses only COMMIT
> NOWAIT, it should approach the speed of fsync=off.
>
> So this is for:
> 1. data-loading where I don't want to use fsync=off (for whatever reason)
> 2. Data-logging type apps
>
> Have I got the gist of that?

Pretty much.

Only the people using COMMIT NOWAIT see a performance gain, and of
course only if they are doing write transactions of relatively short
duration.

The interesting point is you can have a huge data grinding app, yet with
other tables alongside that hold more important data. In that scenario,
90% of the data would be COMMIT NOWAIT, whilst the small important data
is safe.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Richard Huxton <dev(at)archonet(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 00:20:53
Message-ID:	20070227002053.GT19104@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:

> The interesting point is you can have a huge data grinding app, yet with
> other tables alongside that hold more important data. In that scenario,
> 90% of the data would be COMMIT NOWAIT, whilst the small important data
> is safe.

Does this means that the regular COMMIT is slower because it has to
force more data to disk? I imagine that this isn't the case, because
it's not the write itself that's slow; rather, it's the wait until the
fsync on WAL is reported complete. However, did you measure this?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	"A(dot)M(dot)" <agentm(at)themactionfaction(dot)com>
To:	PostgreSQL-development hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 00:27:09
Message-ID:	6155E1B9-FBA9-4B47-90B6-D37BADD08947@themactionfaction.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 26, 2007, at 18:58 , Simon Riggs wrote:

> On Mon, 2007-02-26 at 23:25 +0000, Richard Huxton wrote:
>> Simon Riggs wrote:
>>> Proposal: Implement a new option for COMMIT, for enhancing
>>> performance,
>>> providing a MySQL-like trade-off between performance and
>>> robustness for
>>> *only* those that want it.
>>>
>>> COMMIT NOWAIT

Isn't the domain of transient data relegated to certain tables
instead of specific transactions? Wouldn't it be easier to create un-
wal-logged global temp tables?

-M

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 02:44:34
Message-ID:	1172544275.10824.334.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2007-02-26 at 22:56 +0000, Simon Riggs wrote:
> Proposal: Implement a new option for COMMIT, for enhancing performance,
> providing a MySQL-like trade-off between performance and robustness for
> *only* those that want it.
>
> COMMIT NOWAIT
>
> This form of COMMIT will *not* perform XLogFlush(), but will rely on a
> special background process to perform regular WAL fsyncs (see later).
>

I would think that the type of people who would want this feature would
want it to be a property of the connection, not per-statement. For one
thing, many applications use solitary SQL statements without surrounding
them in BEGIN/COMMIT.

Regards,
Jeff Davis

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 03:06:47
Message-ID:	20070227030647.GB19104@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis wrote:
> On Mon, 2007-02-26 at 22:56 +0000, Simon Riggs wrote:
> > Proposal: Implement a new option for COMMIT, for enhancing performance,
> > providing a MySQL-like trade-off between performance and robustness for
> > *only* those that want it.
> >
> > COMMIT NOWAIT
> >
> > This form of COMMIT will *not* perform XLogFlush(), but will rely on a
> > special background process to perform regular WAL fsyncs (see later).
> >
>
> I would think that the type of people who would want this feature would
> want it to be a property of the connection, not per-statement. For one
> thing, many applications use solitary SQL statements without surrounding
> them in BEGIN/COMMIT.

You can set the GUC locally in your connection, of course. (You can
even do it per role or per database by using ALTER/SET commands)

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 04:04:07
Message-ID:	10498.1172549047@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not
> threaten the consistency or robustness of other COMMIT modes. Read that
> again and think about it, before we go further, please.

I read that, and thought about it, and don't think I believe it. The
problem is that there are more interconnections between different
transactions than you're allowing for. In particular you need to
justify that the behavior is safe for non-transactional operations like
btree page splits and pg_clog buffer page writes. The idea that's
particularly bothering me at the moment is that after a system crash,
we might come back up in a state where a NOWAIT transaction appears
committed when its updates didn't all get to disk. "Database corrupt"
is a situation that threatens all your transactions...

> New commit mode is available by explicit command, or as a default
> setting that will be applied to all COMMITs, or both.

I dislike introducing new nonstandard syntax ("Oracle compatible" is not
standard). If we did this I'd vote for control via a GUC setting only;
I think that is more useful anyway, as an application can be made to run
with such a setting without invasive source code changes.

regards, tom lane

From:	"Jeroen T(dot) Vermeulen" <jtv(at)xs4all(dot)nl>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 04:05:45
Message-ID:	20053.125.24.237.231.1172549145.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, February 27, 2007 06:06, Joshua D. Drake wrote:
>
>> Why do we want this?? Because some apps have *lots* of data and many
>> really don't care whether they lose a few records. Honestly, I've met
>> people that want this, even after 2 hours of discussion and
>> understanding. Plus probably lots of MySQLers also.
>
> Most users will take speed over data loss any day. Whether we want to
> admit it or not.

In that case, wouldn't it make just as much sense to have an equivalent
for this special transaction mode on individual statements, without
transaction context? I'm guessing that who don't really know or want
transactions would never start one, running lots of loose statements
instead that otherwise get committed individually.

Jeroen

From:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 04:50:56
Message-ID:	20070227045056.GJ29041@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 26, 2007 at 10:56:58PM +0000, Simon Riggs wrote:
> 2. remove fsync parameter

Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still
want this for things like database restores.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To:	"Jeroen T(dot) Vermeulen" <jtv(at)xs4all(dot)nl>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 05:45:19
Message-ID:	20070227054519.GQ29041@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 27, 2007 at 11:05:45AM +0700, Jeroen T. Vermeulen wrote:
> On Tue, February 27, 2007 06:06, Joshua D. Drake wrote:
> >
> >> Why do we want this?? Because some apps have *lots* of data and many
> >> really don't care whether they lose a few records. Honestly, I've met
> >> people that want this, even after 2 hours of discussion and
> >> understanding. Plus probably lots of MySQLers also.
> >
> > Most users will take speed over data loss any day. Whether we want to
> > admit it or not.
>
> In that case, wouldn't it make just as much sense to have an equivalent
> for this special transaction mode on individual statements, without
> transaction context? I'm guessing that who don't really know or want
> transactions would never start one, running lots of loose statements
> instead that otherwise get committed individually.

I don't think it makes sense to optimize for people who can't be
bothered to learn about a transaction. In any case, that option is
there; you just set the GUC in the session.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 06:29:33
Message-ID:	1172557774.3760.462.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2007-02-26 at 22:50 -0600, Jim C. Nasby wrote:
> On Mon, Feb 26, 2007 at 10:56:58PM +0000, Simon Riggs wrote:
> > 2. remove fsync parameter
>
> Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still
> want this for things like database restores.

Well, it seemed to be part of the documented TODO, so I included it as a
discussion point.

I'm happy to leave it, personally, FWIW.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc:	"Richard Huxton" <dev(at)archonet(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 07:07:22
Message-ID:	1172560043.3760.501.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2007-02-26 at 21:20 -0300, Alvaro Herrera wrote:
> Simon Riggs wrote:
>
> > The interesting point is you can have a huge data grinding app, yet with
> > other tables alongside that hold more important data. In that scenario,
> > 90% of the data would be COMMIT NOWAIT, whilst the small important data
> > is safe.
>
> Does this means that the regular COMMIT is slower because it has to
> force more data to disk? I imagine that this isn't the case, because
> it's not the write itself that's slow; rather, it's the wait until the
> fsync on WAL is reported complete. However, did you measure this?

No, I've not measured that aspect specifically. But the overall effect
depends upon your hardware. (Laptops work great :-)

A COMMIT will write all of WAL, no matter how much that is. This is no
different from now, where a normal COMMIT executing while another
backend is doing a work may have to write more than just its own WAL.

We actually consider this a good thing: piggyback writes go out of their
way to ensure we write as much WAL as possible in one go.

There is no doubt a point where the fsync delay has been set wrong *and*
there are so few normal COMMITs that they do become slower. But that
slower isn't the same as "starved", just the result of having to do the
work of others. The WALwriter will be doing the heavy lifting, if set
correctly.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 10:49:32
Message-ID:	1172573373.3760.645.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2007-02-26 at 23:04 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not
> > threaten the consistency or robustness of other COMMIT modes. Read that
> > again and think about it, before we go further, please.
>
> I read that, and thought about it, and don't think I believe it.

I don't mind saying its taken a while to believe it myself. And I very
much want to have the concept tested so we all believe it. Doubt=>risk.

> The
> problem is that there are more interconnections between different
> transactions than you're allowing for. In particular you need to
> justify that the behavior is safe for non-transactional operations like
> btree page splits and pg_clog buffer page writes. The idea that's
> particularly bothering me at the moment is that after a system crash,
> we might come back up in a state where a NOWAIT transaction appears
> committed when its updates didn't all get to disk. "Database corrupt"
> is a situation that threatens all your transactions...

OK, I can see its possible for the following WAL history to occur:

t1 btree split part1
t2 COMMIT;
<---------- risk of inconsistent WAL
t1 btree split part2

The COMMIT by t2 flushes WAL, so if a crash happens at the point shown
we have an incomplete multi-part WAL operation. At this stage, lets
assume that no data writes by either t1 or t2 have made it to disk,
which is normal and likely.

On recovery, we apply all the changes in WAL up to the end of WAL. At
the last minute t2 sneaks in and gets committed. t1 was never committed,
so we never found out whether it would be a COMMIT NOWAIT or a COMMIT.
*Whichever* it will be the btree split is still incomplete and recovery
understands this and acts accordingly.

So the situation that sounds nasty is actually a normal situation
currently, so presents no threat because we already handle this
correctly. Multi-part operations seem good to me.

> The idea that's
> particularly bothering me at the moment is that after a system crash,
> we might come back up in a state where a NOWAIT transaction appears
> committed when its updates didn't all get to disk. "Database corrupt"
> is a situation that threatens all your transactions...

Well, I've spent 2 hours thinking this through, gone pale, thought hard
and gone down a few blind alleys along the way.

The bottom line is that COMMIT NOWAIT isn't fully safe, and thats part
of the explicit non-full guarantee, written in big letters on the tin.
If a transaction commits and then we crash before we flush WAL, then the
transaction will be incomplete. As we define COMMIT now, this is broken
and I use that word accurately: If you use COMMIT NOWAIT, you risk data
loss *but* you have the choice of which transactions this applies to.
However, WAL *is* consistent and there is no danger of database
corruption. Simply put, this is not group commit.

So if you issue COMMIT NOWAIT and then crash, the transactions that were
marked committed need to be marked aborted in clog. So during recovery
we will need to keep track of which transactions are in progress, so we
can mark them explicitly aborted, rather than the current implicit
mechanism. This would need to be done carefully, since a clog page that
was created at transaction start may never have made it to disk by the
time of the crash. We must extend clog in that case, even if the
extension WAL never made it to WAL, either, so that we can record the
aborted state of the Xids that exist somewhere on disk.

Keeping track of transactions in progress won't take long. It isn't
required at all during archive recovery, but that might form the basis
for later use as a Snapshot creation mechanism for read-only access
during startup.

It isn't possible for a normal non-readonly transaction to change data
that has been changed by a commit-nowait transaction, and then have the
normal transaction commit, yet without also implicitly committing the
commit-nowait transaction.

Hopefully, I've got that right?

> > New commit mode is available by explicit command, or as a default
> > setting that will be applied to all COMMITs, or both.
>
> I dislike introducing new nonstandard syntax ("Oracle compatible" is not
> standard). If we did this I'd vote for control via a GUC setting only;
> I think that is more useful anyway, as an application can be made to run
> with such a setting without invasive source code changes.

OK.

Having read through all of the above things again, ISTM that we should
make this functionality available by a new GUC commit_fsync_delay, which
must be set explicitly > 0 before this feature can be used at all. If I
confused Tom by using commit_delay, then I'll confuse others also and
group commit and deferred fsync are different techniques with different
robustness guarantees. When enabled it should have a clear message in
the log to show that some commits might be using commit_nowait.

I'd even welcome a more descriptive term that summed up the relaxed
transaction guarantee implied by the use of the deferred fsync
technique. Perhaps even a very explicit USERSET GUC:

transaction_guarantee = on (default) | off

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 17:32:09
Message-ID:	20070227173209.GY29041@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 27, 2007 at 10:49:32AM +0000, Simon Riggs wrote:
> > I dislike introducing new nonstandard syntax ("Oracle compatible" is not
> > standard). If we did this I'd vote for control via a GUC setting only;
> > I think that is more useful anyway, as an application can be made to run
> > with such a setting without invasive source code changes.
>
> OK.
>
> Having read through all of the above things again, ISTM that we should
> make this functionality available by a new GUC commit_fsync_delay, which
> must be set explicitly > 0 before this feature can be used at all. If I
> confused Tom by using commit_delay, then I'll confuse others also and
> group commit and deferred fsync are different techniques with different
> robustness guarantees. When enabled it should have a clear message in
> the log to show that some commits might be using commit_nowait.
>
> I'd even welcome a more descriptive term that summed up the relaxed
> transaction guarantee implied by the use of the deferred fsync
> technique. Perhaps even a very explicit USERSET GUC:
>
> transaction_guarantee = on (default) | off

So would you set commit_fsync_delay on a per-transaction basis? That
doesn't make much sense to me... I guess I'm not seeing how you would
explicitly mark transactions that you didn't want to fsync immediately.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 17:32:38
Message-ID:	200702270932.38610.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon,

One of the things I love about doing informal online user support in the
PostgreSQL community, and formal user support for Sun's customers, is the
almost-ironclad guarentee that if a user has a corrupt database or data loss,
one of three things is true:
a) they didn't apply some recommended PG update;
b) they have a bad disk controller or disk config;
c) they have bad ram.

It seriously narrows down the problem space to know that PostgreSQL does *not*
allow data loss if it's physically possible to prevent it.

Therefore, if we're going to arm a foot-gun as big as COMMIT NOWAIT for
PostgreSQL, I'd like to see the answers to two questions:

a) Please give some examples of performance gain on applications using COMMIT
NOWAIT. The performance gain needs to be substantial (like, 50% to 100%) to
justify a compromise like this.

b) Why this and not global temporary tables or queuing?

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 17:36:08
Message-ID:	21639.1172597768@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jim C. Nasby" <jim(at)nasby(dot)net> writes:
> So would you set commit_fsync_delay on a per-transaction basis? That
> doesn't make much sense to me... I guess I'm not seeing how you would
> explicitly mark transactions that you didn't want to fsync immediately.

My assumption was that most of the time you'd want this behavior
per-session and so the existing mechanisms for setting a GUC variable
would work perfectly well. If you really want it per-transaction then
changing the variable on the fly is possible (perhaps using SET LOCAL).
You'll be issuing nonstandard commands either way...

regards, tom lane

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 18:37:30
Message-ID:	1172601450.3760.795.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2007-02-27 at 11:32 -0600, Jim C. Nasby wrote:
> On Tue, Feb 27, 2007 at 10:49:32AM +0000, Simon Riggs wrote:
> > > I dislike introducing new nonstandard syntax ("Oracle compatible" is not
> > > standard). If we did this I'd vote for control via a GUC setting only;
> > > I think that is more useful anyway, as an application can be made to run
> > > with such a setting without invasive source code changes.
> >
> > OK.
> >
> > Having read through all of the above things again, ISTM that we should
> > make this functionality available by a new GUC commit_fsync_delay, which
> > must be set explicitly > 0 before this feature can be used at all. If I
> > confused Tom by using commit_delay, then I'll confuse others also and
> > group commit and deferred fsync are different techniques with different
> > robustness guarantees. When enabled it should have a clear message in
> > the log to show that some commits might be using commit_nowait.
> >
> > I'd even welcome a more descriptive term that summed up the relaxed
> > transaction guarantee implied by the use of the deferred fsync
> > technique. Perhaps even a very explicit USERSET GUC:
> >
> > transaction_guarantee = on (default) | off
>
> So would you set commit_fsync_delay on a per-transaction basis? That
> doesn't make much sense to me... I guess I'm not seeing how you would
> explicitly mark transactions that you didn't want to fsync immediately.

There are 2 GUCs that would control the behaviour here:

transaction_guarantee = on | off
Specifies whether following transaction commits will guarantee
WAL has been flushed prior to reporting commit. If no guarantee
is requested (=off), then data loss may result even after the
transaction has reported its COMMIT message.
USERSET, but listed in postgresql.conf where default = on
Set this at role, individual session or transaction level to
improve performance of non-critical user data. Use of this
setting does not interfere with the transaction_guarantee
that other transactions may choose. i.e. if somebody else
chooses to take risks with their data it will not affect
the transaction guarantees the server offers to you.
Can only be set off by a transaction if commit_fsync_delay
has been enabled. Use this parameter with care; if you find
yourself wanting to use this parameter all of the time you
should consult a psychiatrist or change open source databases.

commit_fsync_delay = 0...10000 microseconds (0 = off, default)
Controls how often the WALWriter issues an XLogFlush()
SIGHUP, so set once for each server, in postgresql.conf
This provides a maximum time window of potential data loss
in the event of a server crash for transactions that choose
transaction_guarantee = off. This parameter has no effect
on transactions that choose transaction_guarantee = on.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 19:16:59
Message-ID:	E1539E0ED7043848906A8FF995BDA57901CAFC6B@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> There are 2 GUCs that would control the behaviour here:
>
> transaction_guarantee = on | off

> has been enabled. Use this parameter with care; if you find
> yourself wanting to use this parameter all of the time you
> should consult a psychiatrist or change open source databases.

If you guarantee your customers, that you wont loose a transaction that
has already been committed you need to at least have synchronous
replication to a remote site. Since not many installations have that, I
do find it funny that people imply so much safety only by syncing the
wal.

Without sync replication a "buffered wal" as proposed only increases the
chances that you loose something. It certainly is no change from safe
heaven to abyssmal hell.

So I think the part after the semicolon can safely be dropped.
Many will be able to use it always, without changing to another db :-)

> commit_fsync_delay = 0...10000 microseconds (0 = off, default)
> Controls how often the WALWriter issues an XLogFlush()
> SIGHUP, so set once for each server, in postgresql.conf
> This provides a maximum time window of potential data loss
> in the event of a server crash for transactions that choose
> transaction_guarantee = off. This parameter has no effect
> on transactions that choose transaction_guarantee = on.

The wal sync method probably needs to be considered ?
If the wal is opened with open_datasync, how does that affect the
performance, or do you ommit the write and leave that to the WALWriter
also ? You probably also want more wal_buffers in such a setup. It may
be better to trigger the WALWriter with wal_buffer fill-level instead of
an extra parameter ?

It is imho great that you are working on this. I always thought it
impossible, because WAL (write ahead) implied to me, that you are not
allowed to do some data/index page changes before wal is on disk.

Andreas

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Josh Berkus" <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 19:36:18
Message-ID:	36e682920702271136v72b4d29anbcadcfcef747b8b5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/27/07, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> It seriously narrows down the problem space to know that PostgreSQL does *not*
> allow data loss if it's physically possible to prevent it.

Seems like we're trying to protect users from themselves again. This
is not a PostgreSQL database issue; it's a feature designed for
application developers and cases where there is the possibility of
acceptable data loss for certain types of operations or transactions.
We already have the capacity for loss when fsync is disabled; this is
just a more controlled feature.

Under Oracle, NOWAIT is an asynchronous commit... anyone that uses it
should understand that it's still not on-disk and that they can lose
it in the event of a failure. That's what Oracle's docs even say.
It's just a risk vs. reward trade off.

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
33 Wood Ave S, 3rd Floor | jharris(at)enterprisedb(dot)com
Iselin, New Jersey 08830 | http://www.enterprisedb.com/

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 19:56:16
Message-ID:	200702271156.17375.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jonah,

> Under Oracle, NOWAIT is an asynchronous commit... anyone that uses it
> should understand that it's still not on-disk and that they can lose
> it in the event of a failure. That's what Oracle's docs even say.
> It's just a risk vs. reward trade off.

You're missing my point, which is that nobody has demonstrated that there
is any real performance gain from this. I see no reason to implement it
if there is no performance gain.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 20:13:26
Message-ID:	45E490E6.1060409@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus wrote:
> Simon,
>
> One of the things I love about doing informal online user support in the
> PostgreSQL community, and formal user support for Sun's customers, is the
> almost-ironclad guarentee that if a user has a corrupt database or data loss,
> one of three things is true:
> a) they didn't apply some recommended PG update;
> b) they have a bad disk controller or disk config;
> c) they have bad ram.

That is pretty spot on.

>
> It seriously narrows down the problem space to know that PostgreSQL does *not*
> allow data loss if it's physically possible to prevent it.

But we do don't we? fsync = off, full_page_writes = off?

>
> Therefore, if we're going to arm a foot-gun as big as COMMIT NOWAIT for
> PostgreSQL, I'd like to see the answers to two questions:

I agree with this.

>
> a) Please give some examples of performance gain on applications using COMMIT
> NOWAIT. The performance gain needs to be substantial (like, 50% to 100%) to
> justify a compromise like this.

WOAH... that seems excessive. There are a couple of things going on here.

1. We have a potential increase in performance for certain workloads.
This is good, but must be proven. IS that proof 50%? Bah.. let's talk
15-25%.

2. We have to accept that not everyone wants IRON clad data integrity.
We have many, many options for dealing with that now, including PITR and
REPLICATION.

>
> b) Why this and not global temporary tables or queuing?

/me would love global temp tables.

Much of the PostgreSQL Users out there today, will happily loose a 15
minutes of data if it means their data is served 25% faster.

Sincerely,

Joshua D. Drake

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	Dave Page <dpage(at)postgresql(dot)org>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 20:32:57
Message-ID:	45E49579.90908@postgresql.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joshua D. Drake wrote:

> 2. We have to accept that not everyone wants IRON clad data integrity.
> We have many, many options for dealing with that now, including PITR and
> REPLICATION.

100% agreed - our own stats collector is extremely similar (in that it
may drop data under high load) to a system I designed a few years back
which collected stats from snmp devices and windows boxes via WMI on a
regular basis. We didn't care if we lost a few values here and there for
that data and would happily have taken the riskier but higher
performance option had we needed to (and had we ever got round to ever
actually building the system!).

I do agree that some benchmarks are appropriate though.

Regards, Dave

From:	Jordan Henderson <jordan_henders(at)yahoo(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 20:34:21
Message-ID:	200702271534.22111.jordan_henders@yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I am not sure about some of this. The Oracle option does not change the
engine fsync behavior I believe. All that is changed is whether the client
side waits for the complete of the fsync or not. If this is true, the data
store, logs, etc, are all protected. The user may still experience a data
loss if a network, or system failure occurred just after the client issued
the commit. This would be something like I send the message and exit.
However prior to the engine receiving the message, a network component fails
and the message is never delivered. This will turn into an aborted
transaction as far as the engine is concerned. Of course, the exact details
are in the protocol between the client and the server.

The commit nowait is async with respect to the response to the user, not the
underlying engine I think. Therefore performance gains are purely a user
perspective, not an engine perspective. Perhaps some network traffic could
be pruned, not sending the response.

Jordan Henderson

On Tuesday 27 February 2007, Joshua D. Drake wrote:
> Josh Berkus wrote:
> > Simon,
> >
> > One of the things I love about doing informal online user support in the
> > PostgreSQL community, and formal user support for Sun's customers, is the
> > almost-ironclad guarentee that if a user has a corrupt database or data
> > loss, one of three things is true:
> > a) they didn't apply some recommended PG update;
> > b) they have a bad disk controller or disk config;
> > c) they have bad ram.
>
> That is pretty spot on.
>
> > It seriously narrows down the problem space to know that PostgreSQL does
> > *not* allow data loss if it's physically possible to prevent it.
>
> But we do don't we? fsync = off, full_page_writes = off?
>
> > Therefore, if we're going to arm a foot-gun as big as COMMIT NOWAIT for
> > PostgreSQL, I'd like to see the answers to two questions:
>
> I agree with this.
>
> > a) Please give some examples of performance gain on applications using
> > COMMIT NOWAIT. The performance gain needs to be substantial (like, 50%
> > to 100%) to justify a compromise like this.
>
> WOAH... that seems excessive. There are a couple of things going on here.
>
> 1. We have a potential increase in performance for certain workloads.
> This is good, but must be proven. IS that proof 50%? Bah.. let's talk
> 15-25%.
>
> 2. We have to accept that not everyone wants IRON clad data integrity.
> We have many, many options for dealing with that now, including PITR and
> REPLICATION.
>
> > b) Why this and not global temporary tables or queuing?
>
> /me would love global temp tables.
>
> Much of the PostgreSQL Users out there today, will happily loose a 15
> minutes of data if it means their data is served 25% faster.
>
> Sincerely,
>
> Joshua D. Drake

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 20:56:12
Message-ID:	19370.1172609772@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Josh Berkus wrote:
>> It seriously narrows down the problem space to know that PostgreSQL does *not*
>> allow data loss if it's physically possible to prevent it.

> But we do don't we? fsync = off, full_page_writes = off?

One of the things that's really attractive about the proposed mode is
that it does *not* create a risk of data corruption (assuming that
Simon's analyzed it correctly --- I think the clog code in particular
needs a look). What you risk is that when the database comes back up,
its state may reflect an instant up to X seconds before the time of the
crash, rather than exactly the crash time. It seems to me that that's
way better than fsync = off, which allows unlimited corruption.

I agree that we ought to look at some performance numbers before
accepting the patch, but I think Josh's argument that this opens us
up to major corruption problems is probably wrong. The question is
whether your application can tolerate loss of "very recent" transactions,
and I think there are plenty where it can.

regards, tom lane

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	josh(at)agliodbs(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 21:01:46
Message-ID:	36e682920702271301v54999aaej12346f7a9c4a2660@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/27/07, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> You're missing my point, which is that nobody has demonstrated that there
> is any real performance gain from this. I see no reason to implement it
> if there is no performance gain.

While I'll back your request for results, it seems nearly impossible
to expect no performance gain using this.

Results never hurt though.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 21:37:46
Message-ID:	200702272137.l1RLbkD08272@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not
> > threaten the consistency or robustness of other COMMIT modes. Read that
> > again and think about it, before we go further, please.
>
> I read that, and thought about it, and don't think I believe it. The
> problem is that there are more interconnections between different
> transactions than you're allowing for. In particular you need to
> justify that the behavior is safe for non-transactional operations like
> btree page splits and pg_clog buffer page writes. The idea that's
> particularly bothering me at the moment is that after a system crash,
> we might come back up in a state where a NOWAIT transaction appears
> committed when its updates didn't all get to disk. "Database corrupt"
> is a situation that threatens all your transactions...
>
> > New commit mode is available by explicit command, or as a default
> > setting that will be applied to all COMMITs, or both.
>
> I dislike introducing new nonstandard syntax ("Oracle compatible" is not
> standard). If we did this I'd vote for control via a GUC setting only;
> I think that is more useful anyway, as an application can be made to run
> with such a setting without invasive source code changes.

Agreed on the GUC-only. I don't see many people using the per-COMMIT
setting without wanting it to be for many transactions in the session.

(And, YES, I am VERY excited we are getting this feature.)

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 22:17:45
Message-ID:	200702272217.l1RMHjZ15874@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jonah H. Harris wrote:
> On 2/27/07, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> > You're missing my point, which is that nobody has demonstrated that there
> > is any real performance gain from this. I see no reason to implement it
> > if there is no performance gain.
>
> While I'll back your request for results, it seems nearly impossible
> to expect no performance gain using this.
>
> Results never hurt though.

The results are going to be very close to fsync off for sufficiently
high values delay, and we _know_ fsync off is a performance win.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 22:18:28
Message-ID:	200702272218.l1RMISr16087@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim C. Nasby wrote:
> On Mon, Feb 26, 2007 at 10:56:58PM +0000, Simon Riggs wrote:
> > 2. remove fsync parameter
>
> Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still
> want this for things like database restores.

I think we will remove fsync in favor of the new delay, and allow -1 to
be the same behavior as fsync off.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 22:30:01
Message-ID:	45E4B0E9.10308@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:
> Jonah H. Harris wrote:
>> On 2/27/07, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> You're missing my point, which is that nobody has demonstrated that there
>>> is any real performance gain from this. I see no reason to implement it
>>> if there is no performance gain.
>> While I'll back your request for results, it seems nearly impossible
>> to expect no performance gain using this.
>>
>> Results never hurt though.
>
> The results are going to be very close to fsync off for sufficiently
> high values delay, and we _know_ fsync off is a performance win.

This is an assumption. Yes we know that fsync off is a performance win.
We do not know that COMMIT NO WAIT is a performance win. Yes, we can all
sit here and stare at what we *think* will be the result and yes I
actually concur that it will be a performance win.

However, I strongly concur that we need at least some evidence. It could
easily be that a misstep in the code, causes a loop over the wrong set
and all the performance we thought we would get is invalid, not because
of theory or what should happen, but because of actual implementation.

Joshua D. Drake

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-27 23:59:41
Message-ID:	20070227235941.GQ51916@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 27, 2007 at 05:18:28PM -0500, Bruce Momjian wrote:
> Jim C. Nasby wrote:
> > On Mon, Feb 26, 2007 at 10:56:58PM +0000, Simon Riggs wrote:
> > > 2. remove fsync parameter
> >
> > Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still
> > want this for things like database restores.
>
> I think we will remove fsync in favor of the new delay, and allow -1 to
> be the same behavior as fsync off.

Well, presumably we'd still allow fsync for some number of versions...

Actually, I don't know that combining both settings is a wise move. The
delay should still provide crash protection, whereas with fsync=off
you've got absolutely no protection from anything. That's a huge
difference, and one that IMHO warrants a separate setting (and a big,
fat WARNING in the comment for that setting).
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 00:17:37
Message-ID:	200702280017.l1S0HbO25397@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim C. Nasby wrote:
> On Tue, Feb 27, 2007 at 05:18:28PM -0500, Bruce Momjian wrote:
> > Jim C. Nasby wrote:
> > > On Mon, Feb 26, 2007 at 10:56:58PM +0000, Simon Riggs wrote:
> > > > 2. remove fsync parameter
> > >
> > > Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still
> > > want this for things like database restores.
> >
> > I think we will remove fsync in favor of the new delay, and allow -1 to
> > be the same behavior as fsync off.
>
> Well, presumably we'd still allow fsync for some number of versions...

No.

> Actually, I don't know that combining both settings is a wise move. The
> delay should still provide crash protection, whereas with fsync=off
> you've got absolutely no protection from anything. That's a huge
> difference, and one that IMHO warrants a separate setting (and a big,
> fat WARNING in the comment for that setting).

Yes, it needs a warning, or perhaps we just tell people to set it to
something high and that is all they can do.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 00:21:04
Message-ID:	20070228002104.GA62448@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 27, 2007 at 07:17:37PM -0500, Bruce Momjian wrote:
> > Actually, I don't know that combining both settings is a wise move. The
> > delay should still provide crash protection, whereas with fsync=off
> > you've got absolutely no protection from anything. That's a huge
> > difference, and one that IMHO warrants a separate setting (and a big,
> > fat WARNING in the comment for that setting).
>
> Yes, it needs a warning, or perhaps we just tell people to set it to
> something high and that is all they can do.

Before doing that I'd want to see how the performance compares to
fsync=off. My guess is that fsync=off is a big gain during checkpoints.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 00:49:37
Message-ID:	87slcrp0m6.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:

>>> On 2/27/07, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>>>
>>>> I see no reason to implement it if there is no performance gain.

> However, I strongly concur that we need at least some evidence. It could
> easily be that a misstep in the code, causes a loop over the wrong set
> and all the performance we thought we would get is invalid, not because
> of theory or what should happen, but because of actual implementation.

It rather sounds like you're asking for a proof that Simon can write bug-free
code before you allow him to write any code.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 01:30:19
Message-ID:	200702271730.19428.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom,

> One of the things that's really attractive about the proposed mode is
> that it does *not* create a risk of data corruption

Oh, ok. That wasn't how I understood Simon's case.

> I agree that we ought to look at some performance numbers before
> accepting the patch, but I think Josh's argument that this opens us
> up to major corruption problems is probably wrong.

OK. I've seen no performance numbers yet though. It just seems to me that
any performance patch proposal should start a discussion of what amount of
performance we expect to gain.

Unfortunately, this is *not* a patch I can test on TPCE or SpecJ, because both
of those have ACID requirements which I don't think this would satisfy. I'd
have to modify the benchmark, and I already have 4 performance patches queue
which don't require that.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 01:32:52
Message-ID:	45E4DBC4.2050805@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>
>>>> On 2/27/07, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>>>> I see no reason to implement it if there is no performance gain.
>
>> However, I strongly concur that we need at least some evidence. It could
>> easily be that a misstep in the code, causes a loop over the wrong set
>> and all the performance we thought we would get is invalid, not because
>> of theory or what should happen, but because of actual implementation.
>
> It rather sounds like you're asking for a proof that Simon can write bug-free
> code before you allow him to write any code.

Well wouldn't that be great! :) but no, not quite. I would just like to
see some metrics showing that it is a benefit. Besides the patch needs
to work for the metrics to be run.

Joshua D. Drake

>
>

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 01:36:00
Message-ID:	200702280136.l1S1a0814527@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joshua D. Drake wrote:
> Gregory Stark wrote:
> > "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> >
> >>>> On 2/27/07, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >>>>> I see no reason to implement it if there is no performance gain.
> >
> >> However, I strongly concur that we need at least some evidence. It could
> >> easily be that a misstep in the code, causes a loop over the wrong set
> >> and all the performance we thought we would get is invalid, not because
> >> of theory or what should happen, but because of actual implementation.
> >
> > It rather sounds like you're asking for a proof that Simon can write bug-free
> > code before you allow him to write any code.
>
> Well wouldn't that be great! :) but no, not quite. I would just like to
> see some metrics showing that it is a benefit. Besides the patch needs
> to work for the metrics to be run.

I don't understand the great demand for metrics at this point. Once the
patch is ready, people can run the patch on their workloads to get
real-world metrics. Metrics are only needed before the patch is
applied, not before it is discussed.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 01:40:33
Message-ID:	200702271740.33592.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

All,

> > But we do don't we? fsync = off, full_page_writes = off?

BTW, our testing seems to indicate that full_page_writes = off is safe on
Solaris 10 on good hardware. At least, we haven't been able to break it yet.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 01:46:48
Message-ID:	1172627208.10824.428.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2007-02-27 at 17:40 -0800, Josh Berkus wrote:
> All,
>
> > > But we do don't we? fsync = off, full_page_writes = off?
>
> BTW, our testing seems to indicate that full_page_writes = off is safe on
> Solaris 10 on good hardware. At least, we haven't been able to break it yet.
>

Is that an OS-dependent parameter? I always assumed it depended entirely
on hardware. I have no way to test it for myself though, so I just leave
full_page_writes=on to be safe.

Regards,
Jeff Davis

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 01:50:28
Message-ID:	200702271750.28370.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff,

> Is that an OS-dependent parameter? I always assumed it depended entirely
> on hardware. I have no way to test it for myself though, so I just leave
> full_page_writes=on to be safe.

It's a question of whether your HW+OS can guarentee no torn page writes for
the xlog. Running on Sun hardware combined with Solaris 10 with the xlog
mounted forcedirectio, the Solaris folks are convinced we are torn-page-proof
and so far we haven't been able to prove them wrong. And, on Solaris it's a
substantial performance gain (like, 8-10% on OLTP benchmarks).

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Josh Berkus" <josh(at)agliodbs(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 01:51:04
Message-ID:	87odnfoxrr.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Josh Berkus" <josh(at)agliodbs(dot)com> writes:

> OK. I've seen no performance numbers yet though. It just seems to me that
> any performance patch proposal should start a discussion of what amount of
> performance we expect to gain.

There exist proposals that can be prototyped and measured to see what
potential they have. I don't see how this is one of them though. The only way
I could see to prototype this is would, as Bruce said, be to turn fsync off
and consider that the extreme end of the spectrum. The reality will lie
somewhere in between and exactly where will depend on the value of the delay.

> Unfortunately, this is *not* a patch I can test on TPCE or SpecJ, because both
> of those have ACID requirements which I don't think this would satisfy. I'd
> have to modify the benchmark, and I already have 4 performance patches queue
> which don't require that.

I haven't read TPCE but if it's like TPCC then the simple addition of a UPS
satisfies the ACID requirements even with fsync off entirely. Pretty lame but
it's explicitly sufficient.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 01:57:16
Message-ID:	45E4E17C.7070701@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:
> Joshua D. Drake wrote:
>> Gregory Stark wrote:
>>> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>>>
>>>>>> On 2/27/07, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>>>>>> I see no reason to implement it if there is no performance gain.
>>>> However, I strongly concur that we need at least some evidence. It could
>>>> easily be that a misstep in the code, causes a loop over the wrong set
>>>> and all the performance we thought we would get is invalid, not because
>>>> of theory or what should happen, but because of actual implementation.
>>> It rather sounds like you're asking for a proof that Simon can write bug-free
>>> code before you allow him to write any code.
>> Well wouldn't that be great! :) but no, not quite. I would just like to
>> see some metrics showing that it is a benefit. Besides the patch needs
>> to work for the metrics to be run.
>
> I don't understand the great demand for metrics at this point. Once the
> patch is ready, people can run the patch on their workloads to get
> real-world metrics. Metrics are only needed before the patch is
> applied, not before it is discussed.

I don't disagree. Maybe I missed something here, but my whole argument
for metrics was purely for the feature to be accepted. I would certainly
expect full discussion.

Joshua D. Drake

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Josh Berkus" <josh(at)agliodbs(dot)com>
Cc:	"Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 02:13:11
Message-ID:	87k5y3owqw.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Josh Berkus" <josh(at)agliodbs(dot)com> writes:

> It's a question of whether your HW+OS can guarentee no torn page writes for
> the xlog.

no, the data files.

torn pages in the xlog is also a problem but we protect ourselves with a CRC
and stop replay if it the CRC doesn't match. So the cost there is a bit of
cpu, not extra i/o.

> Running on Sun hardware combined with Solaris 10 with the xlog mounted
> forcedirectio, the Solaris folks are convinced we are torn-page-proof and so
> far we haven't been able to prove them wrong. And, on Solaris it's a
> substantial performance gain (like, 8-10% on OLTP benchmarks).

I would expect you to need a small non-volatile cache, either in the
controller or the drive itself to be torn-page-proof. Or failing that to have
drives that operate on 8kb sectors and guarantee that whole sectors get
written using residual power. I don't think any drives operate in 8k sectors
though.

The scary thing about torn pages with full_page_writes off is that we don't
offer any way to detect them. If both halves of the 8kb page look reasonable
you could conceivably end up continuing without ever knowing your data is
corrupt.

That could happen if, for example, the change that was being written isn't
very dramatic. Perhaps all that's missing is an update chain pointer for
example. So you could have two versions of the same record but be missing the
chain pointer in the old record. That would eventually lead to having two
visible versions of the same record but no crashes or other red flags.

I suggested a while back implementing torn page detection by writing a
sequential number ever 512 bytes in the blocks. (I was talking about WAL at
the time but the same principle applies.) Do it at the smgr layer using
readv/writev and the upper layers need never know their data wasn't contiguous
on disk. The only effect would be to shorten page sizes by 16 bytes which
would be annoying but much less so than full_page_writes.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 03:43:53
Message-ID:	36e682920702271943s49cb6986vd01adefeea71ab19@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/27/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> I suggested a while back implementing torn page detection by writing a
> sequential number ever 512 bytes in the blocks. (I was talking about WAL at
> the time but the same principle applies.) Do it at the smgr layer using
> readv/writev and the upper layers need never know their data wasn't contiguous
> on disk. The only effect would be to shorten page sizes by 16 bytes which
> would be annoying but much less so than full_page_writes.

First, rather than using 16-bytes per page and having to deal with
handling the non-contiguous space, why not just use a page-level
checksum like everyone else? Most of the systems I've seen seem to
employ a simple CRC16 or CRC32.

Second, unless I'm missing something, I don't see how your algorithm
is going to work as each 512 byte chunk of the block will *always*
have the same sequential byte value. That is, unless you have some
way of preventing wraparound at 255 without adding additional block
overhead.

Lastly, from a performance perspective, it's going to be faster to
compute the entire page's checksum than it would be to check the
sequence every 512 bytes and perform the space adjustment.

From:	"Jeroen T(dot) Vermeulen" <jtv(at)xs4all(dot)nl>
To:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 05:54:36
Message-ID:	4655.125.24.226.252.1172642076.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, February 28, 2007 06:59, Jim C. Nasby wrote:
> On Tue, Feb 27, 2007 at 05:18:28PM -0500, Bruce Momjian wrote:

>> I think we will remove fsync in favor of the new delay, and allow -1 to
>> be the same behavior as fsync off.
>
> Well, presumably we'd still allow fsync for some number of versions...

I'd hate to lose the ability to disable fsync. I run tons of tests that
don't require any protection against server crashes or hardware failures,
but their speed does matter. I know it's not the most important
requirement in the world, but speeding up those tests means I can run more
of them, on more hardware, more often. Test time also affects my
development cycle.

My main worry is where the option is set, though. For my situation,
selecting a "fast and sloppy" mode when starting the server is clearly the
best choice. It'd be possible, though awkward, to change my code to use
COMMIT NOWAIT. But then am I really sure I'm still testing the same
thing? Plus it introduces a risk of binaries (distributed by others)
accidentally doing COMMIT NOWAIT, as for testing, in production code.

Jeroen

From:	"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To:	"Jeff Davis" <pgsql(at)j-davis(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 10:40:29
Message-ID:	E1539E0ED7043848906A8FF995BDA57901CAFD3F@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > > > But we do don't we? fsync = off, full_page_writes = off?
> >
> > BTW, our testing seems to indicate that full_page_writes = off is
safe
> > on Solaris 10 on good hardware. At least, we haven't been able to
break it yet.
> >
>
> Is that an OS-dependent parameter? I always assumed it depended
entirely
> on hardware. I have no way to test it for myself though, so I just
leave
> full_page_writes=on to be safe.

It also depends on the FS implementation. The OS/FS must guarantee, that
it does not chunk single data page write calls. Usually that is the
case, when OS/FS pagesize and pg pagesize are identical. And the HW
needs to guarantee atomicity for single calls. e.g. on AIX you need to
reduce the pg page size to 4k to be able to give those guarantees.

Andreas

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 11:10:04
Message-ID:	87bqjepmgj.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:

> First, rather than using 16-bytes per page and having to deal with
> handling the non-contiguous space, why not just use a page-level
> checksum like everyone else? Most of the systems I've seen seem to
> employ a simple CRC16 or CRC32.

I think a CRC would be a useful feature for people who want an extra degree of
protection from faulty hardware.

But we've already seen that CRC checks can be expensive. Not everyone will
want to take the cpu hit. Storing a byte counter in every block is cheap.

And the idea came from what someone said MSSQL does, so "like everyone else"
-- which isn't a very compelling argument to begin with -- doesn't argue
against it.

> Second, unless I'm missing something, I don't see how your algorithm
> is going to work as each 512 byte chunk of the block will *always*
> have the same sequential byte value. That is, unless you have some
> way of preventing wraparound at 255 without adding additional block
> overhead.

I think the way you would work is to have the smgr note the sequential value
it found when it read in a page and then when it writes it out increment that
value by one. Conveniently the pages would be 16 bytes shorter than an 8kb
page so you have 16 bytes available with every buffer to note information like
the last sequential tag the buffer used.

> Lastly, from a performance perspective, it's going to be faster to
> compute the entire page's checksum than it would be to check the
> sequence every 512 bytes and perform the space adjustment.

That seems pretty unlikely. CRC checks are expensive cpu-wise, we're already
suffering a copy due to our use of read/write the difference between
read/write of 8192 bytes and readv/writev of 511b*16+1*6 is going to be
non-zero but very small. Thousands of times quicker than the CRC.

If we went to direct-io then it would entail an additional memory-copy which
would be annoying. But that would still be much much cheaper than a CRC check.
The best we could do in that case would be to do a CRC check at the same time
as the memory move.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 15:27:48
Message-ID:	36e682920702280727n58a5c2cbg9371e9afed2895d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> But we've already seen that CRC checks can be expensive. Not everyone will
> want to take the cpu hit. Storing a byte counter in every block is cheap.

CRC checking a page is most certainly the simplest. And, I disagree
that it would be worse than either a sequence counter or the full page
write. Block checksumming is done at read/write time... which is
something that needs to be improved anyway. With a properly tuned
bgwriter, the write itself should barely be noticeable. How fast is a
CRC of 8K? Last time I checked it was something on the scale of ~95
usec for CRC32 and ~33 usec for sb8.

> And the idea came from what someone said MSSQL does, so "like everyone else"
> -- which isn't a very compelling argument to begin with -- doesn't argue
> against it.

Rather than basing designs on poor second-hand information, maybe you
and the person who mentioned this idea should get up-to-date and read
the SQL Server storage engine architecture.

As of SQL Server 2005, blocks *are* checksummed with CRC32. And, just
for the record, previous versions of SQL server performed a bit
flipping technique for every 512 bytes in the page header; it did
*not* waste a byte for every 512 bytes written.

> I think the way you would work is to have the smgr note the sequential value
> it found when it read in a page and then when it writes it out increment that
> value by one. Conveniently the pages would be 16 bytes shorter than an 8kb
> page so you have 16 bytes available with every buffer to note information like
> the last sequential tag the buffer used.

This proposed design is overcomplicated and a waste of space. I mean,
we reduce storage overhead using phantom command id and variable
varlena, but let's just fill it up again with unnecessary junk bytes.

> That seems pretty unlikely. CRC checks are expensive cpu-wise, we're already
> suffering a copy due to our use of read/write the difference between
> read/write of 8192 bytes and readv/writev of 511b*16+1*6 is going to be
> non-zero but very small. Thousands of times quicker than the CRC.

Prove it.

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 15:47:04
Message-ID:	87ps7unv2f.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:

> This proposed design is overcomplicated and a waste of space. I mean,
> we reduce storage overhead using phantom command id and variable
> varlena, but let's just fill it up again with unnecessary junk bytes.

We reduced storage overhead using phantom command id by 8 bytes *per tuple*. I
hardly think 8 bytes per page is much of a concern. You're already losing an
average of 1/2 a tuple per page to rounding and that's a minimum of 16 bytes
for the narrowest of tuples.

>> That seems pretty unlikely. CRC checks are expensive cpu-wise, we're already
>> suffering a copy due to our use of read/write the difference between
>> read/write of 8192 bytes and readv/writev of 511b*16+1*6 is going to be
>> non-zero but very small. Thousands of times quicker than the CRC.
>
> Prove it.

We've already seen wal CRC checking show up at the top of profiles.

Do you really doubt that memcpy is faster than CRC32 checking? Especially when
you're already doing memcpy anyways and the only overhead is the few unaligned
bytes at the end and the 8 one-byte copies?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 15:57:03
Message-ID:	45E5A64F.8070601@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> "Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:
>
>> This proposed design is overcomplicated and a waste of space. I mean,
>> we reduce storage overhead using phantom command id and variable
>> varlena, but let's just fill it up again with unnecessary junk bytes.
>
> We reduced storage overhead using phantom command id by 8 bytes *per tuple*. I
> hardly think 8 bytes per page is much of a concern. You're already losing an
> average of 1/2 a tuple per page to rounding and that's a minimum of 16 bytes
> for the narrowest of tuples.
>
>>> That seems pretty unlikely. CRC checks are expensive cpu-wise, we're already
>>> suffering a copy due to our use of read/write the difference between
>>> read/write of 8192 bytes and readv/writev of 511b*16+1*6 is going to be
>>> non-zero but very small. Thousands of times quicker than the CRC.
>> Prove it.
>
> We've already seen wal CRC checking show up at the top of profiles.

yeah - on fast boxes (diskio wise) wal-crc checking is nearly always on
the very top of wal-intensive workloads.

Stefan

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 16:23:49
Message-ID:	36e682920702280823v7ed96451jd872211eb04a2aeb@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> We reduced storage overhead using phantom command id by 8 bytes *per tuple*. I
> hardly think 8 bytes per page is much of a concern. You're already losing an
> average of 1/2 a tuple per page to rounding and that's a minimum of 16 bytes
> for the narrowest of tuples.

Again, it goes back to competent design. Save space here, waste it
there. SQL Server's bit-flipping technique is still *much* better
than wasting 1 byte for every 512.

> We've already seen wal CRC checking show up at the top of profiles.

Well, when you consider we're performing a CRC for every log record
rather than at the block level, like most other systems, I wouldn't be
that surprised. Don't try and use that example as a reason to kill
the checksum; it's a completely different use case.

> Do you really doubt that memcpy is faster than CRC32 checking? Especially when
> you're already doing memcpy anyways and the only overhead is the few unaligned
> bytes at the end and the 8 one-byte copies?

I'm saying the complexity and implementation of it is going to get you
a bit more than you think. If it didn't, you'd already have the thing
coded and would be proving me wrong with the patch rather than theory.
I can code up a checksum version in an hour or less if you want to
give it a go.

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 16:38:51
Message-ID:	45E5B01B.9050508@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> Do you really doubt that memcpy is faster than CRC32 checking?
>> Especially when
>> you're already doing memcpy anyways and the only overhead is the few
>> unaligned
>> bytes at the end and the 8 one-byte copies?
>
> I'm saying the complexity and implementation of it is going to get you
> a bit more than you think. If it didn't, you'd already have the thing
> coded and would be proving me wrong with the patch rather than theory.
> I can code up a checksum version in an hour or less if you want to
> give it a go.

LIVE FROM THE WWE, CAGE MATCH!

Jonah (the Theorist) Harris versus Greg (the Brain) Stark.

What is going to happen between these two brothers in arms when they
must both prove their theory!

Sincerely,

Joshua D. Drake

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 17:16:10
Message-ID:	200702281716.l1SHGAP25360@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I think we need to think about when these CRCs would be read and
written. It would be written when it hits the disk, hopefully by the
background writer, and I think after a server crash, all pages would
have to be read and checked. The good news is that both of these are
non-critical paths.

---------------------------------------------------------------------------

Jonah H. Harris wrote:
> On 2/28/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> > But we've already seen that CRC checks can be expensive. Not everyone will
> > want to take the cpu hit. Storing a byte counter in every block is cheap.
>
> CRC checking a page is most certainly the simplest. And, I disagree
> that it would be worse than either a sequence counter or the full page
> write. Block checksumming is done at read/write time... which is
> something that needs to be improved anyway. With a properly tuned
> bgwriter, the write itself should barely be noticeable. How fast is a
> CRC of 8K? Last time I checked it was something on the scale of ~95
> usec for CRC32 and ~33 usec for sb8.
>
> > And the idea came from what someone said MSSQL does, so "like everyone else"
> > -- which isn't a very compelling argument to begin with -- doesn't argue
> > against it.
>
> Rather than basing designs on poor second-hand information, maybe you
> and the person who mentioned this idea should get up-to-date and read
> the SQL Server storage engine architecture.
>
> As of SQL Server 2005, blocks *are* checksummed with CRC32. And, just
> for the record, previous versions of SQL server performed a bit
> flipping technique for every 512 bytes in the page header; it did
> *not* waste a byte for every 512 bytes written.
>
> > I think the way you would work is to have the smgr note the sequential value
> > it found when it read in a page and then when it writes it out increment that
> > value by one. Conveniently the pages would be 16 bytes shorter than an 8kb
> > page so you have 16 bytes available with every buffer to note information like
> > the last sequential tag the buffer used.
>
> This proposed design is overcomplicated and a waste of space. I mean,
> we reduce storage overhead using phantom command id and variable
> varlena, but let's just fill it up again with unnecessary junk bytes.
>
> > That seems pretty unlikely. CRC checks are expensive cpu-wise, we're already
> > suffering a copy due to our use of read/write the difference between
> > read/write of 8192 bytes and readv/writev of 511b*16+1*6 is going to be
> > non-zero but very small. Thousands of times quicker than the CRC.
>
> Prove it.
>
> --
> Jonah H. Harris, Software Architect | phone: 732.331.1324
> EnterpriseDB Corporation | fax: 732.331.1301
> 33 Wood Ave S, 3rd Floor | jharris(at)enterprisedb(dot)com
> Iselin, New Jersey 08830 | http://www.enterprisedb.com/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 17:16:39
Message-ID:	87tzx6urrc.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Gregory Stark" <stark(at)enterprisedb(dot)com> writes:

> We've already seen wal CRC checking show up at the top of profiles.
>
> Do you really doubt that memcpy is faster than CRC32 checking? Especially when
> you're already doing memcpy anyways and the only overhead is the few unaligned
> bytes at the end and the 8 one-byte copies?

Well color me surprised, writev is not nearly so much faster than CRC as I had
expected:

lseek+write syscall overhead: 7.95 us
CRC32 32.54 us
writev 26.56 us

The reason there's lseek overhead in there is because I had it seek back to
the same block repeatedly to (hopefully) avoid any i/o. It seems to have
worked as I find it hard to believe these numbers could be so low if there's
any i/o being included.

I think part of the reason writev is slow is because I'm including the time it
took to set up the iovec array. That's 64 word copies write there. And then
writev has to read those 64 words back and do 64 extra branches and so on...

This is on an Intel T2500 (2Ghz).

There is a side issue that tagging each sector is 100% guaranteed to detect
torn pages whereas checksums still have a chance of missing them. But usually
the scenario where that comes into play is where you have many checksum
failures and are ignoring them assuming they never fail. In the case of torn
pages there'll only be one torn page and we're going to scream bloody murder
if we see it so I don't think that's a big issue.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 17:19:20
Message-ID:	87ps7uurmv.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:

> LIVE FROM THE WWE, CAGE MATCH!
>
> Jonah (the Theorist) Harris versus Greg (the Brain) Stark.
>
> What is going to happen between these two brothers in arms when they
> must both prove their theory!

Darn, I wish I had seen this post before I posted the results of my testing.

So what happens to the loser in WWE cage matches?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 17:20:17
Message-ID:	36e682920702280920n128c545dr70d7cadc4a0bbd34@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Joshua D. Drake <jd(at)commandprompt(dot)com> wrote:
> LIVE FROM THE WWE, CAGE MATCH!
>
> Jonah (the Theorist) Harris versus Greg (the Brain) Stark.
>
> What is going to happen between these two brothers in arms when they
> must both prove their theory!

Heh, I like it :)

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 17:28:57
Message-ID:	87lkiiur6u.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Bruce Momjian" <bruce(at)momjian(dot)us> writes:

> I think we need to think about when these CRCs would be read and
> written. It would be written when it hits the disk, hopefully by the
> background writer, and I think after a server crash, all pages would
> have to be read and checked. The good news is that both of these are
> non-critical paths.

If you're protecting against torn pages then yes, if the system is shut down
uncleanly by a system crash or power failure you would in theory have to scan
every page of every table and index before starting up.

But if the system was shut down uncleanly as the result of a Postgres crash or
fast shutdown of Postgres then that isn't an issue. And many users may prefer
to bring the system up as soon as possible as long as they know any corrupt
pages will be spotted and throw errors as soon as it's seen.

So I think you need a mode that only checks checksums when a page is read from
disk. That would protect against torn pages (but not necessarily before
bringing up the system) and against bad i/o hardware.

Unfortunately memory errors are far more common than disk errors and I it
would be much harder to protect against them. You can't check it when someone
may be writing to the buffer, which limits you to checking it only when you
acquire some form of lock on the buffer. It also means you would have to write
it before you release a lock if you've made any changes.

Worse, I'm not sure how to handle hint bits though. We currently don't require
any lock at all to set hint bits which means someone may think they can check
a checksum while or after you've fiddled some bits.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 17:31:27
Message-ID:	45E5BC6F.4050106@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>
>> LIVE FROM THE WWE, CAGE MATCH!
>>
>> Jonah (the Theorist) Harris versus Greg (the Brain) Stark.
>>
>> What is going to happen between these two brothers in arms when they
>> must both prove their theory!
>
> Darn, I wish I had seen this post before I posted the results of my testing.
>
> So what happens to the loser in WWE cage matches?

Usually the are left bloody and staring at the winner through the cage.
Occasionally they are carried out on a stretcher.

The good news is, the loser gets a rematch at the next pay per view.

Sincerely,

Joshua D. Drake

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 17:34:52
Message-ID:	200702281734.l1SHYqM29382@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> "Bruce Momjian" <bruce(at)momjian(dot)us> writes:
>
> > I think we need to think about when these CRCs would be read and
> > written. It would be written when it hits the disk, hopefully by the
> > background writer, and I think after a server crash, all pages would
> > have to be read and checked. The good news is that both of these are
> > non-critical paths.
>
> If you're protecting against torn pages then yes, if the system is shut down
> uncleanly by a system crash or power failure you would in theory have to scan
> every page of every table and index before starting up.
>
> But if the system was shut down uncleanly as the result of a Postgres crash or
> fast shutdown of Postgres then that isn't an issue. And many users may prefer
> to bring the system up as soon as possible as long as they know any corrupt
> pages will be spotted and throw errors as soon as it's seen.

I don't think we should start up a system and only detect the errors
later.

> So I think you need a mode that only checks checksums when a page is read from
> disk. That would protect against torn pages (but not necessarily before
> bringing up the system) and against bad i/o hardware.
>
> Unfortunately memory errors are far more common than disk errors and I it
> would be much harder to protect against them. You can't check it when someone
> may be writing to the buffer, which limits you to checking it only when you
> acquire some form of lock on the buffer. It also means you would have to write
> it before you release a lock if you've made any changes.
>
> Worse, I'm not sure how to handle hint bits though. We currently don't require
> any lock at all to set hint bits which means someone may think they can check
> a checksum while or after you've fiddled some bits.

Yep, a problem.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 19:10:11
Message-ID:	36e682920702281110o287aae37i69d1e0bdfc97495f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > But if the system was shut down uncleanly as the result of a Postgres crash or
> > fast shutdown of Postgres then that isn't an issue. And many users may prefer
> > to bring the system up as soon as possible as long as they know any corrupt
> > pages will be spotted and throw errors as soon as it's seen.
>
> I don't think we should start up a system and only detect the errors
> later.

Which is, of course, how everyone else does it. On block access, the
checksum is verified (if you've turned checksum checking on). I
*really* doubt you want to pull in every page in the database at
startup time to verify the checksum or sequence. Even pages from the
last checkpoint would be a killer.

All of the databases (Oracle, SQL Server, DB2) have a way to perform a
database corruption check which does go out and verify all checksums.

If consistency is stored at the block-level, which is pretty much the
only way to avoid full page writes, you have to accept some level of
possible corruption.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 19:12:24
Message-ID:	200702281912.l1SJCON22868@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jonah H. Harris wrote:
> On 2/28/07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > > But if the system was shut down uncleanly as the result of a Postgres crash or
> > > fast shutdown of Postgres then that isn't an issue. And many users may prefer
> > > to bring the system up as soon as possible as long as they know any corrupt
> > > pages will be spotted and throw errors as soon as it's seen.
> >
> > I don't think we should start up a system and only detect the errors
> > later.
>
> Which is, of course, how everyone else does it. On block access, the
> checksum is verified (if you've turned checksum checking on). I
> *really* doubt you want to pull in every page in the database at
> startup time to verify the checksum or sequence. Even pages from the
> last checkpoint would be a killer.
>
> All of the databases (Oracle, SQL Server, DB2) have a way to perform a
> database corruption check which does go out and verify all checksums.
>
> If consistency is stored at the block-level, which is pretty much the
> only way to avoid full page writes, you have to accept some level of
> possible corruption.

Am am not comfortable starting and having something fail later. How
other databases do it is not an issue for me.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 19:27:08
Message-ID:	36e682920702281127p3df42e2fv7eee56b79d11ef0@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Am am not comfortable starting and having something fail later.

Then do you have some other idea for protecting pages from being torn
without storing an entire backup copy or performing a block-level
consistency check?

> How other databases do it is not an issue for me.

Not saying it is, just stating the facts.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 19:35:09
Message-ID:	200702281935.l1SJZ9625120@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jonah H. Harris wrote:
> On 2/28/07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > Am am not comfortable starting and having something fail later.
>
> Then do you have some other idea for protecting pages from being torn
> without storing an entire backup copy or performing a block-level
> consistency check?

We have several methods suggested to check the blocks, like CRC. My
point was that, whatever check method we use, we should be prepared to
check on startup, or at least make it the default for a crash restart.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 19:41:42
Message-ID:	36e682920702281141y229ed97fx84a25a22f9a03928@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> We have several methods suggested to check the blocks, like CRC. My
> point was that, whatever check method we use, we should be prepared to
> check on startup, or at least make it the default for a crash restart.

Sounds like it should be a guc. I most certainly wouldn't check the
entire database, especially if it was over a gigabyte.

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 19:48:10
Message-ID:	1172692090.13722.59.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2007-02-28 at 14:10 -0500, Jonah H. Harris wrote:
> On 2/28/07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > > But if the system was shut down uncleanly as the result of a Postgres crash or
> > > fast shutdown of Postgres then that isn't an issue. And many users may prefer
> > > to bring the system up as soon as possible as long as they know any corrupt
> > > pages will be spotted and throw errors as soon as it's seen.
> >
> > I don't think we should start up a system and only detect the errors
> > later.
>
> Which is, of course, how everyone else does it. On block access, the
> checksum is verified (if you've turned checksum checking on). I
> *really* doubt you want to pull in every page in the database at
> startup time to verify the checksum or sequence. Even pages from the
> last checkpoint would be a killer.
>

Under normal operations, shutting down the database does a checkpoint,
right? So unless you're in recovery mode, there's no additional cost.
And I can't think of any reason you'd need to see any pages before the
last checkpoint (unless you don't trust your disk and just want to check
all the pages, which is more than we can do now anyway).

So the additional cost of doing CRCs every time would be the CPU cost,
and also the cost during recovery of reading in all the data pages since
the last checkpoint. That's 5 minutes of data, in the default
configuration.

Regards,
Jeff Davis

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 19:48:45
Message-ID:	200702281948.l1SJmjG26774@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jonah H. Harris wrote:
> On 2/28/07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > We have several methods suggested to check the blocks, like CRC. My
> > point was that, whatever check method we use, we should be prepared to
> > check on startup, or at least make it the default for a crash restart.
>
> Sounds like it should be a guc. I most certainly wouldn't check the
> entire database, especially if it was over a gigabyte.

Keep in mind if you don't check it on startup, you will be checking it
for every read, which for rare crashes, might not be wise.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 19:52:11
Message-ID:	36e682920702281152u3020c7f6k3c567b873ae08d12@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Keep in mind if you don't check it on startup, you will be checking it
> for every read, which for rare crashes, might not be wise.

Well understood. That's how most everyone configures their database
systems; they certainly don't optimize for torn page detection on a
crash.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 19:54:16
Message-ID:	36e682920702281154j28e708bcmda6872a5dd3e1d08@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> That's 5 minutes of data, in the default configuration.

Right, but I don't know anyone that keeps checkpoints at 5 minutes.
At least not on OLTP configurations.

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 20:08:00
Message-ID:	45E5E120.3010300@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jonah H. Harris wrote:
> On 2/28/07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> > But if the system was shut down uncleanly as the result of a
>> Postgres crash or
>> > fast shutdown of Postgres then that isn't an issue. And many users
>> may prefer
>> > to bring the system up as soon as possible as long as they know any
>> corrupt
>> > pages will be spotted and throw errors as soon as it's seen.
>>
>> I don't think we should start up a system and only detect the errors
>> later.
>
> Which is, of course, how everyone else does it. On block access, the
> checksum is verified (if you've turned checksum checking on). I
> *really* doubt you want to pull in every page in the database at
> startup time to verify the checksum or sequence. Even pages from the
> last checkpoint would be a killer.

AFAICS pages from the last checkpoint would be sufficient, and not that
expensive given that we have to pull all the pages touched since last
checkpoint from disk to do the WAL replay anyway.

> All of the databases (Oracle, SQL Server, DB2) have a way to perform a
> database corruption check which does go out and verify all checksums.

I think that's to protect from random disk errors rather than just torn
pages. Which might be a useful thing but I think that level of
protection belongs to the filesystem or storage hardware.

What's the use case for the torn-page detection, anyway? If you don't
trust your OS/hardware to protect you from torn-pages, but you do care
about your data, surely you would use full_pages_writes = on. If you do
trust your OS/hardware, just turn it off. And if you don't care about
your data, what's the point?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 20:10:24
Message-ID:	45E5E1B0.2080202@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jonah H. Harris wrote:
> On 2/28/07, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>> That's 5 minutes of data, in the default configuration.
>
> Right, but I don't know anyone that keeps checkpoints at 5 minutes.
> At least not on OLTP configurations.

Uhmm... most do because most don't ever touch the postgresql.conf and
those that do, don't touch checkpoints because they don't understand it.

Joshua D. Drake

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 20:17:02
Message-ID:	1172693822.13722.73.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2007-02-28 at 14:54 -0500, Jonah H. Harris wrote:
> On 2/28/07, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> > That's 5 minutes of data, in the default configuration.
>
> Right, but I don't know anyone that keeps checkpoints at 5 minutes.
> At least not on OLTP configurations.
>

It's got a hard maximum of 1 hour. Also, the need to make checkpoints
far apart has been reduced with bgwriter. It will be further reduced
with the patch that allows people to tune the bgwriter for their needs.

Recovery has to run through all those WAL segments anyway. It's not like
we're making a 5 second recovery take 20 minutes, we'd be taking an
already long recovery and making it longer (I'm not sure how much
longer, but it can't be more than twice as long).

I'm not saying there's no cost, but the extra recovery cost seems lower
to me than the CRC cost on every data page read during operation.

Also, if we find an error, do we even have the ability to correct it? A
CRC doesn't tell us which pages were written and which weren't, so we
could detect the error but not correct it, right?

Regards,
Jeff Davis

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 20:37:11
Message-ID:	36e682920702281237g287c564ck59f8f825998e01c3@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> I'm not saying there's no cost, but the extra recovery cost seems lower
> to me than the CRC cost on every data page read during operation.

I agree, I just think it should be configurable.

> Also, if we find an error, do we even have the ability to correct it? A
> CRC doesn't tell us which pages were written and which weren't, so we
> could detect the error but not correct it, right?

Correct.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	"Jeff Davis" <pgsql(at)j-davis(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 20:39:18
Message-ID:	36e682920702281239y388fbb53x41fdc82130b03549@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Joshua D. Drake <jd(at)commandprompt(dot)com> wrote:
> > Right, but I don't know anyone that keeps checkpoints at 5 minutes.
> > At least not on OLTP configurations.
>
> Uhmm... most do because most don't ever touch the postgresql.conf and
> those that do, don't touch checkpoints because they don't understand it.

Yes, I guess I should've qualified that to specify people who do
understand the system better.

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 20:44:36
Message-ID:	45E5E9B4.5000303@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jonah H. Harris wrote:
> On 2/28/07, Joshua D. Drake <jd(at)commandprompt(dot)com> wrote:
>> > Right, but I don't know anyone that keeps checkpoints at 5 minutes.
>> > At least not on OLTP configurations.
>>
>> Uhmm... most do because most don't ever touch the postgresql.conf and
>> those that do, don't touch checkpoints because they don't understand it.
>
> Yes, I guess I should've qualified that to specify people who do
> understand the system better.

I am not trying to pick on the issue but I do think it is important to
recognize that literally only those in the know, are going to ever touch
the postgresql.conf.

I deal with lots of customers, who employ lots of programmers that I
have often heard, "There is a postgresql.conf?". Why? Because they are
user space programmers using an ORM or other such tech that allows them
to never actually login to postgresql or a shell or anything once they
edit the pg_hba.conf.

Joshua D. Drake

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	"Jeff Davis" <pgsql(at)j-davis(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 20:55:13
Message-ID:	36e682920702281255g7f004928xb0bb624e46ddf29e@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Joshua D. Drake <jd(at)commandprompt(dot)com> wrote:
> I am not trying to pick on the issue but I do think it is important to
> recognize that literally only those in the know, are going to ever touch
> the postgresql.conf.

I agree.

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 21:13:08
Message-ID:	87d53uugt7.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:

> Which is, of course, how everyone else does it.

I happen to agree with your conclusion but this line of argument is
exceptionally unconvincing. In fact in this crowd you'll tend to turn people
off and lose people if you say things like that rather than convince anyone of
anything.

> Even pages from the last checkpoint would be a killer.

Hm that's an interesting thought. We only really have to check pages that
would have received a full page write since the last checkpoint. So if we made
turning full page writes off still record the page ids of the pages it *would*
have written then we just need the code that normally replays full page writes
to check the checksum if the page data isn't available.

I can't see how that would be a killer. No matter how large a system you're
talking about you're going to tune checkpoints to be occurring at about the
same interval anyways. So the amount of time the wal replay checksum checking
takes will be more or less constant.

In fact we're already reading in most, if not all, of those pages anyways
since we're replaying wal records that touch them after all. Would we even
have to do anything extra? If we check checksums whenever we read in a page
surely the wal replay code would automatically detect any torn pages without
any special attention.

That also makes it clear just how awful full page writes are for scalability.
As you scale up the system but try to keep checkpoint intervals constant
you're less and less likely to ever see the same page twice between two
checkpoints. So as you scale the system up more and more of the wal will
consist of full page writes.

> All of the databases (Oracle, SQL Server, DB2) have a way to perform a
> database corruption check which does go out and verify all checksums.

Which is pretty poor design. If we implemented a fsck-like tool I would be far
more interested in checking things like "tuples don't overlap" or "hint bits
are set correctly" and so on. Checksums do nothing to protect against software
failures which is the only kind of failure with a good rationale for being in
an external tool.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 21:46:26
Message-ID:	36e682920702281346l38bceedbpd79f15db90b373fd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> "Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:
>
> > Which is, of course, how everyone else does it.
>
> I happen to agree with your conclusion but this line of argument is
> exceptionally unconvincing. In fact in this crowd you'll tend to turn people
> off and lose people if you say things like that rather than convince anyone of
> anything.

Rather than reinventing the wheel, it often pays to piggyback on the
solutions others in similar situations have encountered. I'm just
stating how others provide similar functionality or capabilities. If
someone dislikes an idea just because the major vendors have done it
that way, that's their own problem. It's up to the community to
decide how to proceed given the information at hand.

> > Even pages from the last checkpoint would be a killer.
>
> Hm that's an interesting thought. We only really have to check pages that
> would have received a full page write since the last checkpoint.

That's the only way I see that it could possibly be acceptable from a
time-to-recover performance standpoint. I would still prefer a guc.

> Which is pretty poor design. If we implemented a fsck-like tool I would be far
> more interested in checking things like "tuples don't overlap" or "hint bits
> are set correctly" and so on. Checksums do nothing to protect against software
> failures which is the only kind of failure with a good rationale for being in
> an external tool.

Regardless of whether it's better as a separate tool or in the
database itself, they provide a
corruption-finding/consistency-checking capability. As far as other
checks that could be performed, SQL Server and Oracle do have their
own internal structure checks; many of which execute at runtime, not
as a separate tool or process.

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 22:36:38
Message-ID:	1172702198.13722.79.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2007-02-28 at 21:13 +0000, Gregory Stark wrote:
> Hm that's an interesting thought. We only really have to check pages that
> would have received a full page write since the last checkpoint. So if we made

Do we ever do a partial page write, or is what you're saying equivalent
to "we only have to check pages that have been written to since the last
checkpoint"?

And if it is the same, can't we get the pages that were written to from
the ctids in the wal records?

Regards,
Jeff Davis

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Josh Berkus" <josh(at)agliodbs(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 23:23:32
Message-ID:	1172705012.3760.1068.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2007-02-27 at 09:32 -0800, Josh Berkus wrote:
> Simon,

I think if you address me in a mail, it would be best not to explicitly
*remove* my name from the address list.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-02-28 23:49:59
Message-ID:	200702282349.l1SNnxY11935@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I want the startup to fail.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 00:02:50
Message-ID:	874pp5vnit.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:

> On 2/28/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
>> "Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:
>>
>> > Which is, of course, how everyone else does it.
>>
>> I happen to agree with your conclusion but this line of argument is
>> exceptionally unconvincing. In fact in this crowd you'll tend to turn people
>> off and lose people if you say things like that rather than convince anyone of
>> anything.
>
> Rather than reinventing the wheel, it often pays to piggyback on the
> solutions others in similar situations have encountered. I'm just
> stating how others provide similar functionality or capabilities. If
> someone dislikes an idea just because the major vendors have done it
> that way, that's their own problem. It's up to the community to
> decide how to proceed given the information at hand.

Except that's not what you're doing. There's nothing wrong with saying "foo
does this clever thing I think we should copy because <insert argument>". Nor
even "foo does this thing, would that help us?" But what you seem to be saying
is "*Because* foo does this thing we can conclude it's a clever thing and we
should do it".

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 00:06:47
Message-ID:	200702281606.47865.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon,

> I think if you address me in a mail, it would be best not to explicitly
> *remove* my name from the address list.

I was trying to remove everyone but the list address.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Josh Berkus" <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 00:07:54
Message-ID:	87zm6xu8px.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jeff Davis" <pgsql(at)j-davis(dot)com> writes:

> On Wed, 2007-02-28 at 21:13 +0000, Gregory Stark wrote:
>> Hm that's an interesting thought. We only really have to check pages that
>> would have received a full page write since the last checkpoint. So if we made
>
> Do we ever do a partial page write, or is what you're saying equivalent
> to "we only have to check pages that have been written to since the last
> checkpoint"?

"full page writes" is a term of art in Postgres. When we first modify a page
we stuff a copy of the entire page in the WAL so that recovery can restore it.
We never write only part of a page to data files intentionally but if the
system crashes while it's in the process of being written only part of it
might end up getting written. Normally only the modified bits get put in WAL.

> And if it is the same, can't we get the pages that were written to from
> the ctids in the wal records?

yes, though it may still be convenient to use the full page write fields

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 00:40:56
Message-ID:	36e682920702281640y60c151cdmbc26f4e83c696bce@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> Except that's not what you're doing. There's nothing wrong with saying "foo
> does this clever thing I think we should copy because <insert argument>". Nor
> even "foo does this thing, would that help us?" But what you seem to be saying
> is "*Because* foo does this thing we can conclude it's a clever thing and we
> should do it".

Oracle, Microsoft, IBM, Sybase, Teradata, MySQL, and Firebird have a
clever feature called page checksumming which I think we should copy
because it's simple and effective at detecting page-level corruption
due to torn pages and/or faulty storage hardware.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 02:16:21
Message-ID:	36e682920702281816s2ef95fe8xde20f64782f84a62@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> Well color me surprised, writev is not nearly so much faster than CRC as I had
> expected:

All fun aside, are you going to be submitting a patch for this?

From:	"J(dot) Andrew Rogers" <jrogers(at)neopolitan(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 02:22:58
Message-ID:	98A04B82-95A4-44C7-BED3-64F5F1DB2E6E@neopolitan.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 28, 2007, at 4:40 PM, Jonah H. Harris wrote:
> Oracle, Microsoft, IBM, Sybase, Teradata, MySQL, and Firebird have a
> clever feature called page checksumming which I think we should copy
> because it's simple and effective at detecting page-level corruption
> due to torn pages and/or faulty storage hardware.

Any system that does enough I/O will occasionally generate corrupted
pages in the complete absence of any other detectable fault or
hardware failure. A fair amount has been written about it with
respect to large-scale computing and it happens often enough when
systems start getting large that just about everyone implements
software I/O checksumming eventually to deal with the problem. I
simply assumed that PostgreSQL was doing the same since it is
definitely used for systems that are large enough that this becomes a
statistically significant issue.

A popular alternative to CRC32 for this purpose is the significantly
cheaper and almost as effective is the Adler32 algorithm. I know
Google used this algorithm when they added checksumming to their
database to tame inexplicable transient corruption.

Cheers,

J. Andrew Rogers

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"J(dot) Andrew Rogers" <jrogers(at)neopolitan(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 03:06:00
Message-ID:	36e682920702281906g5971e1ex1209142c451b02e3@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, J. Andrew Rogers <jrogers(at)neopolitan(dot)com> wrote:
> A popular alternative to CRC32 for this purpose is the significantly
> cheaper and almost as effective is the Adler32 algorithm. I know
> Google used this algorithm when they added checksumming to their
> database to tame inexplicable transient corruption.

Last time I checked, Teradata used a modified Fletcher checksum as
well; I wasn't aware of Adler32.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"J(dot) Andrew Rogers" <jrogers(at)neopolitan(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 03:34:21
Message-ID:	36e682920702281934v7ba7041dl6c16ca6ae19a4224@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Jonah H. Harris <jonah(dot)harris(at)gmail(dot)com> wrote:
> Last time I checked, Teradata used a modified Fletcher checksum as
> well; I wasn't aware of Adler32.

I added adler32 to Gurjeet Singh's crc testing utility, compiled it
with gcc 4.0.0 on a single-core Opteron running FC4, and configured it
for 8K block sizes.

GCC options were -mtune=opteron and the optimizations stated below:

Generic optimization:
test parameters:: SF=1, block_size=8, list_size=1000
Generating data... done. Time taken : 93 milliseconds
Generating CRC using PG (vanilla) algo... done. Time taken : 117 milliseconds
Verifying CRC using PG (vanilla) algo... done. Time taken : 89 milliseconds
Generating CRC using SliceBy8 algo... done. Time taken : 44 milliseconds
Verifying CRC using SliceBy8 algo... done. Time taken : 44 milliseconds
Generating CRC using Adler32 algo... done. Time taken : 52 milliseconds
Verifying CRC using Adler32 algo... done. Time taken : 53 milliseconds

Using -O2
test parameters:: SF=1, block_size=8, list_size=1000
Generating data... done. Time taken : 85 milliseconds
Generating CRC using PG (vanilla) algo... done. Time taken : 31 milliseconds
Verifying CRC using PG (vanilla) algo... done. Time taken : 30 milliseconds
Generating CRC using SliceBy8 algo... done. Time taken : 30 milliseconds
Verifying CRC using SliceBy8 algo... done. Time taken : 30 milliseconds
Generating CRC using Adler32 algo... done. Time taken : 14 milliseconds
Verifying CRC using Adler32 algo... done. Time taken : 14 milliseconds

Using -O3
test parameters:: SF=1, block_size=8, list_size=1000
Generating data... done. Time taken : 87 milliseconds
Generating CRC using PG (vanilla) algo... done. Time taken : 31 milliseconds
Verifying CRC using PG (vanilla) algo... done. Time taken : 31 milliseconds
Generating CRC using SliceBy8 algo... done. Time taken : 30 milliseconds
Verifying CRC using SliceBy8 algo... done. Time taken : 30 milliseconds
Generating CRC using Adler32 algo... done. Time taken : 14 milliseconds
Verifying CRC using Adler32 algo... done. Time taken : 15 milliseconds

Interesting...

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"J(dot) Andrew Rogers" <jrogers(at)neopolitan(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 05:20:07
Message-ID:	36e682920702282120l37f76b45tbd5e8932accf304@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/07, Jonah H. Harris <jonah(dot)harris(at)gmail(dot)com> wrote:
> I added adler32 to Gurjeet Singh's crc testing utility, compiled it
> with gcc 4.0.0 on a single-core Opteron running FC4, and configured it
> for 8K block sizes.

Tested Fletcher and seeing similar results ratio-wise. Using O3, it's:

Generating CRC using Fletcher algo... done. Time taken : 11 milliseconds
Verifying CRC using Fletcher algo... done. Time taken : 11 milliseconds

From:	"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To:	"Jeff Davis" <pgsql(at)j-davis(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 09:45:42
Message-ID:	E1539E0ED7043848906A8FF995BDA57901CAFF11@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Under normal operations, shutting down the database does a
> checkpoint, right? So unless you're in recovery mode, there's
> no additional cost.
> And I can't think of any reason you'd need to see any pages
> before the last checkpoint (unless you don't trust your disk
> and just want to check all the pages, which is more than we
> can do now anyway).

The problem is you don't necessarily have the info which pages changed
*after* the last checkpoint. Especially since in the mentioned scenario
you don't have full-page images. Maybe a suitable replacement for
full-page would be to sync the first WAL record for a page change before
writing the buffer (startup recovery would then check all pages it
touches). This syncing does not sound promising though.

And it is also not only one page we are looking for, but one per
controller (for battery backed) or one per phys disk.

Andreas

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
Cc:	"Jeff Davis" <pgsql(at)j-davis(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Josh Berkus" <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 10:04:20
Message-ID:	87ps7ts2jf.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at> writes:

> Maybe a suitable replacement for full-page would be to sync the first WAL
> record for a page change before writing the buffer

We *always* sync WAL records for page changes before writing the buffer for
the page.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Jeff Davis" <pgsql(at)j-davis(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Josh Berkus" <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 13:24:47
Message-ID:	E1539E0ED7043848906A8FF995BDA57901CAFFBF@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > Maybe a suitable replacement for full-page would be to sync the
first
> > WAL record for a page change before writing the buffer
>
> We *always* sync WAL records for page changes before writing
> the buffer for the page.

Um, is that so ? And how is that done ? (e.g. bgwriter would need to
wait for or even initiate a WAL record sync)
Seems this is not clear to others eighter.

Why else would people think we need a full data scan after crash without
full-pages ?
If this is true, all pages with possibly pending IO during crash will be
touched and could thus be checked during startup recovery.

I can see that we need to sync wal before write for certain changes, but
I don't see why we would currently need it for e.g. a heap insert, that
gets a free slot.

Andreas

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 13:51:43
Message-ID:	45E6DA6F.8080603@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Zeugswetter Andreas ADI SD wrote:
>>> Maybe a suitable replacement for full-page would be to sync the
> first
>>> WAL record for a page change before writing the buffer
>> We *always* sync WAL records for page changes before writing
>> the buffer for the page.
>
> Um, is that so ? And how is that done ? (e.g. bgwriter would need to
> wait for or even initiate a WAL record sync)
> Seems this is not clear to others eighter.

Yes. The bgwriter or backend that flushes the buffer indeed initiates a
WAL sync if the WAL record corresponding the LSN of the buffer hasn't
been flushed yet.

> Why else would people think we need a full data scan after crash without
> full-pages ?

They're confused. ;-)

> If this is true, all pages with possibly pending IO during crash will be
> touched and could thus be checked during startup recovery.

That's right. Torn page detection wouldn't require any extra I/O, just CPU.

> I can see that we need to sync wal before write for certain changes, but
> I don't see why we would currently need it for e.g. a heap insert, that
> gets a free slot.

The corresponding index insert might hit the disk but the heap insert
might not. Now if someone after the crash uses the same heap slot, you'd
end up with the index tuple of the aborted transaction pointing to wrong
heap tuple. There might be other error scenarios, that's just the first
example I could think of.

In practice, it's extremely rare for a writer to need to do the WAL
flush. When a buffer is chosen as a victim, it's been unused for some
time, and the WAL has already been flushed. It does happen when you're
vacuuming, see Simon's recent thread, but we should fix that anyway. It
also happens on checkpoints, but it's not a big performance issue.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-01 15:24:22
Message-ID:	20070301152421.GJ15006@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 28, 2007 at 12:16:10PM -0500, Bruce Momjian wrote:
> background writer, and I think after a server crash, all pages would
> have to be read and checked. The good news is that both of these are

Would they? If you're doing recovery you'd have to read all pages
dirtied since the last checkpoint... could there be pages other than
those that had been torn?
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	"A(dot)M(dot)" <agentm(at)themactionfaction(dot)com>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-04 04:19:00
Message-ID:	200703032319.00744.xzilla@users.sourceforge.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday 26 February 2007 19:27, A.M. wrote:
> On Feb 26, 2007, at 18:58 , Simon Riggs wrote:
> > On Mon, 2007-02-26 at 23:25 +0000, Richard Huxton wrote:
> >> Simon Riggs wrote:
> >>> Proposal: Implement a new option for COMMIT, for enhancing
> >>> performance,
> >>> providing a MySQL-like trade-off between performance and
> >>> robustness for
> >>> *only* those that want it.
> >>>
> >>> COMMIT NOWAIT
>
> Isn't the domain of transient data relegated to certain tables
> instead of specific transactions? Wouldn't it be easier to create un-
> wal-logged global temp tables?
>

A similar idea we've been kicking around would be having a set storage
parameter = nologging option for alter table which would, as it's name
implies, cause the system to ignore writing wal logs for the table, much like
it does for temp tables now. One cavaet would be you would probably need to
forbid such a table from being the parent side of a FK relationship, but
otherwise this should be fairly safe even for replay since alter table needs
an exclusive lock, so you have finite points where data for the table would
be written or ignored.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

From:	"A(dot)M(dot)" <agentm(at)themactionfaction(dot)com>
To:	PostgreSQL-development hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: COMMIT NOWAIT Performance Option
Date:	2007-03-05 20:55:49
Message-ID:	91760ECD-BE2F-4DBD-A134-399F5CF891CE@themactionfaction.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mar 3, 2007, at 23:19 , Robert Treat wrote:

>
> A similar idea we've been kicking around would be having a set storage
> parameter = nologging option for alter table which would, as it's name
> implies, cause the system to ignore writing wal logs for the table,
> much like
> it does for temp tables now. One cavaet would be you would
> probably need to
> forbid such a table from being the parent side of a FK
> relationship, but
> otherwise this should be fairly safe even for replay since alter
> table needs
> an exclusive lock, so you have finite points where data for the
> table would
> be written or ignored.

I can't think of any reason why a global temp table couldn't be
referenced by another global temp table, but maybe I'm missing
something. Whenever postgres starts, it would simply truncate the
tables to ensure "temporary" compliance (the no-foot-gun approach). I
could then place such tables in a ramdisk tablespace and make
postgresql a transaction-safe memcached replacement. This would also
be great for materialized views.

Cheers,
M