SRA Win32 sync() code

Lists: pgsql-hackerspgsql-hackers-win32pgsql-patches
From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: SRA Win32 sync() code
Date: 2003-11-16 05:00:56
Message-ID: 200311160500.hAG50u701539@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Here is the SRA sync() code for Win32. As you might know, fsync on
Win32 is _commit, and sync() is _flushall. However, _flushall only
flushes only _open_ kernel buffers, not dirty buffers that have been
closed. Therefore, on Win32, during checkpoint, you have to open,
fsync(_commit), close all file that have been modified since the
previous checkpoint.

Not sure how we are going to do this in Win32, but somehow we will have
to record all open files between checkpoints in an area that the
checkpoint process can read during a checkpoint.

Here is the SRA code that records the dirty file and the code that
cycles through the list and fsync's each one.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

Attachment Content-Type Size
unknown_filename text/plain 2.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 16:32:59
Message-ID: 28203.1069000379@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Not sure how we are going to do this in Win32, but somehow we will have
> to record all open files between checkpoints in an area that the
> checkpoint process can read during a checkpoint.

One reason I like the idea of adopting a sync-when-you-write policy is
that it eliminates the need for anything as messy as that.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 16:36:54
Message-ID: 200311161636.hAGGatT23547@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Not sure how we are going to do this in Win32, but somehow we will have
> > to record all open files between checkpoints in an area that the
> > checkpoint process can read during a checkpoint.
>
> One reason I like the idea of adopting a sync-when-you-write policy is
> that it eliminates the need for anything as messy as that.

Yes, but can we do it without causing a performance degredation, and I
would hate to change something to make things easier on Win32 while
penalizing all platforms.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 16:58:12
Message-ID: 28414.1069001892@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> One reason I like the idea of adopting a sync-when-you-write policy is
>> that it eliminates the need for anything as messy as that.

> Yes, but can we do it without causing a performance degredation, and I
> would hate to change something to make things easier on Win32 while
> penalizing all platforms.

Having to keep a list of modified files in shared memory isn't a penalty?

Seriously though, if we can move the bulk of the writing work into
background processes then I don't believe that there will be any
significant penalty for regular backends. And I believe that it would
be a huge advantage from a correctness point of view if we could stop
depending on sync(). The fact that Windows hasn't got sync() is merely
another reason we should stop using it.

regards, tom lane


From: Manfred Spraul <manfred(at)colorfullife(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 17:19:05
Message-ID: 3FB7B189.50308@colorfullife.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Tom Lane wrote:

>Seriously though, if we can move the bulk of the writing work into
>background processes then I don't believe that there will be any
>significant penalty for regular backends. And I believe that it would
>be a huge advantage from a correctness point of view if we could stop
>depending on sync().
>
Which function guarantees that renames of WAL files arrived on the disk?
AFAIK sync() is the only function that guarantees that.

What about the sync app from sysinternals? It seems Mark Russinovich
figured out how to implement sync on Win32:
http://www.sysinternals.com/ntw2k/source/misc.shtml#Sync

It requires administrative priveledges, but it shouldn't be that
difficult to write a tiny service that runs in the LocalSystem account,
listens to a pipe and syncs all disks when asked.

--
Manfred


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Manfred Spraul <manfred(at)colorfullife(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 17:29:52
Message-ID: 28666.1069003792@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Manfred Spraul <manfred(at)colorfullife(dot)com> writes:
> Which function guarantees that renames of WAL files arrived on the disk?

The OS itself is supposed to guarantee that; that's what a journaling
file system is for. In any case, I don't think we care. Renaming would
apply only to WAL segments that are not currently needed where they are,
and would only be needed under their new names at some future time.
If the rename gets lost shortly after it's done, it can be redone.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 17:38:52
Message-ID: 3FB7B62C.9000007@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Manfred Spraul wrote:

> Tom Lane wrote:
>
>> Seriously though, if we can move the bulk of the writing work into
>> background processes then I don't believe that there will be any
>> significant penalty for regular backends. And I believe that it would
>> be a huge advantage from a correctness point of view if we could stop
>> depending on sync().
>>
> Which function guarantees that renames of WAL files arrived on the
> disk? AFAIK sync() is the only function that guarantees that.
>
> What about the sync app from sysinternals? It seems Mark Russinovich
> figured out how to implement sync on Win32:
> http://www.sysinternals.com/ntw2k/source/misc.shtml#Sync
>
> It requires administrative priveledges, but it shouldn't be that
> difficult to write a tiny service that runs in the LocalSystem
> account, listens to a pipe and syncs all disks when asked.

I think we'd have to do it from scratch, because of these license terms:

-------------------------------

There is no charge to use any of the software published on this Web site
at home or at work, so long as each user downloads and installs the
product directly from www.sysinternals.com.

A commercial license is required to redistribute any of these utilities
directly (whether by computer media, a file server, an email attachment,
etc.) or to embed them in- or link them to- another program.
------------------------------

Also, do we want to force a broad brush sync() of just fsync our own files?

cheers

andrew


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 18:06:48
Message-ID: 200311161806.hAGI6mu11492@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> One reason I like the idea of adopting a sync-when-you-write policy is
> >> that it eliminates the need for anything as messy as that.
>
> > Yes, but can we do it without causing a performance degredation, and I
> > would hate to change something to make things easier on Win32 while
> > penalizing all platforms.
>
> Having to keep a list of modified files in shared memory isn't a penalty?
>
> Seriously though, if we can move the bulk of the writing work into
> background processes then I don't believe that there will be any
> significant penalty for regular backends. And I believe that it would
> be a huge advantage from a correctness point of view if we could stop
> depending on sync(). The fact that Windows hasn't got sync() is merely
> another reason we should stop using it.

If the background writer starts using fsync(), we can have normal
backends that do a write() set a shared memory boolean. We can then
test that boolean and do sync() only if other backends had to do their
own writes.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 18:21:38
Message-ID: 29231.1069006898@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> Seriously though, if we can move the bulk of the writing work into
>> background processes then I don't believe that there will be any
>> significant penalty for regular backends.

> If the background writer starts using fsync(), we can have normal
> backends that do a write() set a shared memory boolean. We can then
> test that boolean and do sync() only if other backends had to do their
> own writes.

That seems like the worst of both worlds --- you still are depending on
sync() for correctness.

Also, as long as backends only *seldom* do writes, making them fsync a
write when they do make one will be less of an impact on overall system
performance than having a sync() ensue shortly afterwards. I think you
are focusing too narrowly on the idea that backends shouldn't ever wait
for writes, and failing to see the bigger picture. What we need to
optimize is overall system performance, not an arbitrary restriction
that certain processes never wait for certain things.

regards, tom lane


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 20:00:24
Message-ID: 3FB7D758.20608@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Tom Lane wrote:

> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> Tom Lane wrote:
>>> Seriously though, if we can move the bulk of the writing work into
>>> background processes then I don't believe that there will be any
>>> significant penalty for regular backends.
>
>> If the background writer starts using fsync(), we can have normal
>> backends that do a write() set a shared memory boolean. We can then
>> test that boolean and do sync() only if other backends had to do their
>> own writes.
>
> That seems like the worst of both worlds --- you still are depending on
> sync() for correctness.
>
> Also, as long as backends only *seldom* do writes, making them fsync a
> write when they do make one will be less of an impact on overall system
> performance than having a sync() ensue shortly afterwards. I think you
> are focusing too narrowly on the idea that backends shouldn't ever wait
> for writes, and failing to see the bigger picture. What we need to
> optimize is overall system performance, not an arbitrary restriction
> that certain processes never wait for certain things.

Removing sync() entirely requires very accurate fsync()'ing in the
background writer, the checkpointer and the backends. Basically none of
them can mark a block "clean" if he fails to fsync() the relation later!
This will be a mess to code.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 20:17:08
Message-ID: 9687.1069013828@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> Removing sync() entirely requires very accurate fsync()'ing in the
> background writer, the checkpointer and the backends. Basically none of
> them can mark a block "clean" if he fails to fsync() the relation later!
> This will be a mess to code.

Not really. The O_SYNC solution for example would be trivial to code.

regards, tom lane


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 20:43:27
Message-ID: 3FB7E16F.8090701@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Tom Lane wrote:

> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> Removing sync() entirely requires very accurate fsync()'ing in the
>> background writer, the checkpointer and the backends. Basically none of
>> them can mark a block "clean" if he fails to fsync() the relation later!
>> This will be a mess to code.
>
> Not really. The O_SYNC solution for example would be trivial to code.

Well, the bgwriter has basically the same chance the checkpointer has
... mdblindwrt() in the end, because he doesn't have the relcache handy.
So you want to open(O_SYNC), write(), close() every single block? I
don't expect that to be much better than a global sync().

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-16 22:46:17
Message-ID: 15635.1069022777@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> Well, the bgwriter has basically the same chance the checkpointer has
> ... mdblindwrt() in the end, because he doesn't have the relcache handy.

We could easily get rid of mdblindwrt --- there is no very good reason
that we use the relcache for I/O. There could and should be a
lower-level notion of "open relation" that bgwriter and checkpoint
could use. See recent discussion with Neil, for example. Vadim had
always wanted to do that, IIRC.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-17 01:58:48
Message-ID: 200311170158.hAH1wmc06667@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> Seriously though, if we can move the bulk of the writing work into
> >> background processes then I don't believe that there will be any
> >> significant penalty for regular backends.
>
> > If the background writer starts using fsync(), we can have normal
> > backends that do a write() set a shared memory boolean. We can then
> > test that boolean and do sync() only if other backends had to do their
> > own writes.
>
> That seems like the worst of both worlds --- you still are depending on
> sync() for correctness.
>
> Also, as long as backends only *seldom* do writes, making them fsync a
> write when they do make one will be less of an impact on overall system
> performance than having a sync() ensue shortly afterwards. I think you
> are focusing too narrowly on the idea that backends shouldn't ever wait
> for writes, and failing to see the bigger picture. What we need to
> optimize is overall system performance, not an arbitrary restriction
> that certain processes never wait for certain things.

OK, let me give you my logic and you can tell me where I am wrong.

First, how many backend can a single write process support if all the
backends are doing insert/update/deletes? 5? 10? Let's assume 10.
Second, once we change write to write/fsync, how much slower will that
be? 100x, 1000x? Let's say 10x.

So, by my logic, if we have 100 backends all doing updates, we will need
10 * 100 or 1000 writer processes or threads to keep up with that load.
That seems quite excessive to me from a context switching and process
overhead perspective.

Where am I wrong?

Also, if we go with the fsync only at checkpoint, we are doing fsync's
once every minute (at checkpoint time) rather than several times a
second potentially.

Do we know that having the background writer fsync a file that was
written by a backend cause all the data to fsync? I think I could write
a program to test this by timing each of these tests:

create an empty file
open file
time fsync
close

open file
write 2mb into the file
time fsync
close

open file
write 2mb into the file
close
open file
time fsync
close

If I do the write via system(), I am doing it in a separate process so
the test should work. Should I try this?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-17 03:25:34
Message-ID: 4769.1069039534@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Where am I wrong?

I don't think any of this is relevant. There are a certain number of
blocks we have to get down to disk before we can declare a transaction
committed, and there are a certain number that we have to get down to
disk before we can declare a checkpoint complete. You are focusing too
much on the question of whether a particular process performs an fsync
operation, and ignoring the fact that ultimately it's got to wait for
I/O to complete --- directly or indirectly. If it blocks waiting for
some other process to declare a buffer clean, rather than writing for
itself, what's the difference?

Sure, fsync serializes the particular process that's doing it, but we
can deal with that by spreading the fsyncs across multiple processes,
and trying to ensure that they are mostly background processes rather
than foreground ones.

I don't claim that immediate-fsync-on-write is the only answer, but
I cannot follow your reasoning for dimissing it out of hand ... and I
certainly cannot buy *any* logic that says that sync() is a good answer
to any of these issues. AFAICS sync() means that we abandon
responsibility.

> Do we know that having the background writer fsync a file that was
> written by a backend cause all the data to fsync? I think I could write
> a program to test this by timing each of these tests:

That might prove something about the particular platform you tested it
on; but it would not speak to the real problem, which is what we can
assume is true on every platform...

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-17 04:33:46
Message-ID: 200311170433.hAH4Xki03920@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Where am I wrong?
>
> I don't think any of this is relevant. There are a certain number of
> blocks we have to get down to disk before we can declare a transaction
> committed, and there are a certain number that we have to get down to
> disk before we can declare a checkpoint complete. You are focusing too
> much on the question of whether a particular process performs an fsync
> operation, and ignoring the fact that ultimately it's got to wait for
> I/O to complete --- directly or indirectly. If it blocks waiting for
> some other process to declare a buffer clean, rather than writing for
> itself, what's the difference?

The difference is two-fold. First, there might be 10 other backends
asking for writes, so it isn't clear that asking someone else do the
right is as fast. Second, that other writer is doing fsync, so it is
100x or 1000x slower than our current setup.

> Sure, fsync serializes the particular process that's doing it, but we
> can deal with that by spreading the fsyncs across multiple processes,
> and trying to ensure that they are mostly background processes rather
> than foreground ones.

How many? That was my point, that it might require 1000 backend
processes _and_ it would be slower because we are write/fsync rather
than write. However, I think we could fix that by doing the write and
returning OK to the backend, then doing the fsync whenever we want ---
perhaps that was already your plan.

> I don't claim that immediate-fsync-on-write is the only answer, but
> I cannot follow your reasoning for dismissing it out of hand ... and I
> certainly cannot buy *any* logic that says that sync() is a good answer
> to any of these issues. AFAICS sync() means that we abandon
> responsibility.

sync() means we group the fsync into one massive one, that sync all
other process I/O too --- clearly bad, but I am hoping for something as
good as what we currently have because that sync hopefully is only ever
few minutes.

> > Do we know that having the background writer fsync a file that was
> > written by a backend cause all the data to fsync? I think I could write
> > a program to test this by timing each of these tests:
>
> That might prove something about the particular platform you tested it
> on; but it would not speak to the real problem, which is what we can
> assume is true on every platform...

Yes, it would only be per platform. I was thinking we could have a
platform test and enable this behavior on platforms that support it
(all?) and use sync on the others.

Also, let me say I am glad we are delving into this. Our buffer system
has needed an overhaul for a while, and right now we already have some
major improvements for 7.5, and this discussion is just a continuation
of those improvements.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-17 05:46:34
Message-ID: 200311170546.hAH5kYn17867@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Tom Lane wrote:
> > Do we know that having the background writer fsync a file that was
> > written by a backend cause all the data to fsync? I think I could write
> > a program to test this by timing each of these tests:
>
> That might prove something about the particular platform you tested it
> on; but it would not speak to the real problem, which is what we can
> assume is true on every platform...

The attached program does test if fsync can be used on a file descriptor
after the file is closed and then reopened. I see:

write 0.000613
write & fsync 0.001727
write, close & fsync 0.001633

This shows that fsync works even after the file is closed and reopened.
I could test by writing using a subprocess, but I don't see how that
would be different, and it would mess up my timings.

Anyway, if we find all our platforms can pass this test, we might be
able to allow backends to do their own writes and just record the file
name somewhere for the checkpointer to fsync. It also shows write/fsync
was 3x slower than simple write.

Does anyone have a platform where the last duration is significantly
different from the middle timing?

I am keeping this discussion on patches because of the C program
attachment.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

Attachment Content-Type Size
unknown_filename text/plain 2.1 KB

From: Shridhar Daithankar <shridhar_daithankar(at)persistent(dot)co(dot)in>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [pgsql-hackers-win32] SRA Win32 sync() code
Date: 2003-11-17 08:15:49
Message-ID: 200311171345.49674.shridhar_daithankar@persistent.co.in
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

On Monday 17 November 2003 11:16, Bruce Momjian wrote:
> Tom Lane wrote:
> > > Do we know that having the background writer fsync a file that was
> > > written by a backend cause all the data to fsync? I think I could
> > > write a program to test this by timing each of these tests:
> >
> > That might prove something about the particular platform you tested it
> > on; but it would not speak to the real problem, which is what we can
> > assume is true on every platform...
>
> The attached program does test if fsync can be used on a file descriptor
> after the file is closed and then reopened. I see:
>
> write 0.000613
> write & fsync 0.001727
> write, close & fsync 0.001633

ArchLinux, maxtor IDE HDD, write cache enabled.

[shridhar(at)daithan tmp]$ gcc -o test_fsync test_fsync.c
[shridhar(at)daithan tmp]$ ./test_fsync
write 0.002403
write & fsync 0.009423
write, close & fsync 0.006457
[shridhar(at)daithan tmp]$ uname -a
Linux daithan 2.4.21 #1 SMP Tue Jul 8 19:41:52 PDT 2003 i686 unknown

> Anyway, if we find all our platforms can pass this test, we might be
> able to allow backends to do their own writes and just record the file
> name somewhere for the checkpointer to fsync. It also shows write/fsync
> was 3x slower than simple write.
>
> Does anyone have a platform where the last duration is significantly
> different from the middle timing?

Does 30% difference above count as significant?

Assuming fsync on a file descriptor flushes dirty buffers of that file, from
all processes, would following be sufficient?

1. Open WAL with O_SYNC|O_DIRECT (Later whereever possible) And issue fsync on
WAL files whenever required.

2. Use regular writes for data files and fsync them in background.

May be if background process is the only one that issues any fsync on data
files, that could maximize overall system throughput.

Say, all backends write to a datafile and signal the background writer, that
they are blocked on this write to complete. BGWriter could chunk all such
requests and flush them/fsync when there is enough disk activity. Hopefully
none of them would be stalled for too long. That way slowest part of the
system i.e the disk will be kept full of load.

Besides since WAL writes are synchornous, backgrounds can safely push a write
and move to further business, most of the times. I guess BGWriter has to
fsync the data files anyways to recycle a WAL segment.

In idle conditions, this mechanism should not be a problem.

Just a thought. Does this take care of sync?

> I am keeping this discussion on patches because of the C program
> attachment.

I dropped win32 list. I am not subscribed to it. Just getting thread out of
it.

I will write a short program which writes to a file in different processes and
attempts to fsync them from only one. Let's see what that turns out.

Shridhar


From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-17 10:28:34
Message-ID: 1069064914.20092.41.camel@fuji.krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Bruce Momjian kirjutas E, 17.11.2003 kell 03:58:

>
> OK, let me give you my logic and you can tell me where I am wrong.
>
> First, how many backend can a single write process support if all the
> backends are doing insert/update/deletes? 5? 10? Let's assume 10.
> Second, once we change write to write/fsync, how much slower will that
> be? 100x, 1000x? Let's say 10x.
>
> So, by my logic, if we have 100 backends all doing updates, we will need
> 10 * 100 or 1000 writer processes or threads to keep up with that load.
> That seems quite excessive to me from a context switching and process
> overhead perspective.
>
> Where am I wrong?

Maybe you meant 100/10 instead of 100*10 ;)

------------
Hannu


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgreSQL(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-17 15:32:11
Message-ID: 200311171532.hAHFWB305530@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Where am I wrong?
>
> I don't think any of this is relevant. There are a certain number of
> blocks we have to get down to disk before we can declare a transaction
> committed, and there are a certain number that we have to get down to
> disk before we can declare a checkpoint complete. You are focusing too
> much on the question of whether a particular process performs an fsync
> operation, and ignoring the fact that ultimately it's got to wait for
> I/O to complete --- directly or indirectly. If it blocks waiting for
> some other process to declare a buffer clean, rather than writing for
> itself, what's the difference?
>
> Sure, fsync serializes the particular process that's doing it, but we
> can deal with that by spreading the fsyncs across multiple processes,
> and trying to ensure that they are mostly background processes rather
> than foreground ones.
>
> I don't claim that immediate-fsync-on-write is the only answer, but
> I cannot follow your reasoning for dimissing it out of hand ... and I
> certainly cannot buy *any* logic that says that sync() is a good answer
> to any of these issues. AFAICS sync() means that we abandon
> responsibility.

[ Discussion moved to hackers/win32.]

I was thinking about sync() --- one of its problems is that is schedules
writes to disk but returns before it is completed. If we do sync(),
then open, write, fsync, close a file, aren't we then pretty sure all
the scheduled sync writes have completed too? I know SCSI has tagged
queueing, but I figured creating a new file and writing/fsync would come
after the sync writes.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Hannu Krosing <hannu(at)tm(dot)ee>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-17 15:33:49
Message-ID: 200311171533.hAHFXns05677@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Hannu Krosing wrote:
> Bruce Momjian kirjutas E, 17.11.2003 kell 03:58:
>
> >
> > OK, let me give you my logic and you can tell me where I am wrong.
> >
> > First, how many backend can a single write process support if all the
> > backends are doing insert/update/deletes? 5? 10? Let's assume 10.
> > Second, once we change write to write/fsync, how much slower will that
> > be? 100x, 1000x? Let's say 10x.
> >
> > So, by my logic, if we have 100 backends all doing updates, we will need
> > 10 * 100 or 1000 writer processes or threads to keep up with that load.
> > That seems quite excessive to me from a context switching and process
> > overhead perspective.
> >
> > Where am I wrong?
>
> Maybe you meant 100/10 instead of 100*10 ;)

I figured 10 backends, but using fsync, they are not 100x slower (10 *
100). However, testing shows fsync is only 3x slower.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Kurt Roeckx <Q(at)ping(dot)be>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-19 20:07:34
Message-ID: 20031119200734.GA8685@ping.be
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

On Mon, Nov 17, 2003 at 12:46:34AM -0500, Bruce Momjian wrote:
> Tom Lane wrote:
> > > Do we know that having the background writer fsync a file that was
> > > written by a backend cause all the data to fsync? I think I could write
> > > a program to test this by timing each of these tests:
> >
> > That might prove something about the particular platform you tested it
> > on; but it would not speak to the real problem, which is what we can
> > assume is true on every platform...
>
> The attached program does test if fsync can be used on a file descriptor
> after the file is closed and then reopened. I see:
>
> write 0.000613
> write & fsync 0.001727
> write, close & fsync 0.001633

> Does anyone have a platform where the last duration is significantly
> different from the middle timing?

write 0.002807
write & fsync 0.015248
write, close & fsync 0.004696

This is a Linux 2.6.0-test5 on an old IDE disk.

The results change alot. An other result shows:
write 0.002737
write & fsync 0.006658
write, close & fsync 0.008431

The first time is stable, the other 2 aren't.

Averagly write & fsync would be about twice as big/slow as write,
close & fsync.

PS: Please specify some modes when creating files.

Kurt


From: Manfred Spraul <manfred(at)colorfullife(dot)com>
To: Shridhar Daithankar <shridhar_daithankar(at)persistent(dot)co(dot)in>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [pgsql-hackers-win32] SRA Win32 sync() code
Date: 2003-11-19 21:07:57
Message-ID: 3FBBDBAD.8050903@colorfullife.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

Shridhar Daithankar wrote:

>Does 30% difference above count as significant?
>
>
No. It's Linux, we can look at the sources: there is no per-fd cache,
the page cache is global. Thus fsync() syncs the whole cache to disk.
A problem could only occur if the file cache is not global - perhaps a
per-node file cache on NUMA systems - IRIX on an Origin 2000 cluster or
something similar.

But as I read the unix spec, fsync is guaranteed to sync all data to disk:
Draft 6 of the posix-200x spec:
SIO If _POSIX_SYNCHRONIZED_IO is defined, the fsync( ) function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for synchronized
I/O file integrity completion.

"All I/O operations associated with the file", not all operations
associated with the file descriptor.

--
Manfred


From: "Steve Tibbett" <stevex(at)stevex(dot)org>
To:
Cc: "'PostgreSQL Win32 port list'" <pgsql-hackers-win32(at)postgresql(dot)org>
Subject: Re: [PATCHES] SRA Win32 sync() code
Date: 2003-11-20 00:35:03
Message-ID: 3F61636001D91D89@mta3.wss.scd.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-hackers-win32 pgsql-patches

>I was thinking about sync() --- one of its problems is that is schedules
writes
>to disk but returns before it is completed. If we do sync(), then open,
write,
>fsync, close a file, aren't we then pretty sure all the scheduled sync
writes
>have completed too? I know SCSI has tagged queueing, but I figured
creating
>a new file and writing/fsync would come after the sync writes.

Sorry to jump in in the middle of this but I'm just wondering if some of the
Win32 file creation flags might help out here.. if the problem is figuring
out how to flush stuff to the disk, what about using FILE_FLAG_NO_BUFFERING?
I'm not sure how you'd create a stdio FILE with custom CreateFile flags but
there's probably a way, I can see if I can figure it out if it looks like it
would be useful.

There are quite a few flags that are Win32 specific but that might help for
various things, like..

FILE_FLAG_DELETE_ON_CLOSE (you can't delete a file while it's open, but you
can open it specifying that it go away when you close it, in the case where
you know it's a temp file).

FILE_ATTRIBUTE_TEMPORARY (an optimization that hints to the filesystem that
this is a temporary file so don't bother writing it to disk (keep it in the
cache) unless you need to).

FILE_FLAG_WRITE_THROUGH (kinda like FILE_FLAG_NO_BUFFERING but different..
the SDK docs describe how they interact).

- Steve