Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

Lists: pgsql-hackerspgsql-performance
From: Michael Clemmons <glassresistor(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-10 20:41:08
Message-ID: 4427a97a0912101241s68a83ee2pdea64f7478d4d92b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Hey,
I've got a computer which runs but 8.3 and 8.4. To create a db it takes 4s
for 8.3 and 9s for 8.4. I have many unit tests which create databases all
of the time and now run much slower than 8.3 but it seems to be much longer
as I remember at one point creating databases I considered an instantaneous
thing. Does any on the list know why this is true and if I can get it back
to normal.
-Michael


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-performance(at)postgresql(dot)org
Cc: Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-10 21:56:59
Message-ID: 200912102256.59868.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Thursday 10 December 2009 21:41:08 Michael Clemmons wrote:
> Hey,
> I've got a computer which runs but 8.3 and 8.4. To create a db it takes 4s
> for 8.3 and 9s for 8.4. I have many unit tests which create databases all
> of the time and now run much slower than 8.3 but it seems to be much longer
> as I remember at one point creating databases I considered an instantaneous
> thing. Does any on the list know why this is true and if I can get it back
> to normal.
Possibly you had fsync=off at the time?

Andres


From: Michael Clemmons <glassresistor(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-10 22:01:08
Message-ID: 4427a97a0912101401kb29e1a5ic5424d6d0b7efe62@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Im not sure what that means ppl in my office with slower hd speeds using 8.4
can create a db in 2s vs my 8-12s. Could using md5 instead of ident do it?

On Thu, Dec 10, 2009 at 4:56 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On Thursday 10 December 2009 21:41:08 Michael Clemmons wrote:
> > Hey,
> > I've got a computer which runs but 8.3 and 8.4. To create a db it takes
> 4s
> > for 8.3 and 9s for 8.4. I have many unit tests which create databases
> all
> > of the time and now run much slower than 8.3 but it seems to be much
> longer
> > as I remember at one point creating databases I considered an
> instantaneous
> > thing. Does any on the list know why this is true and if I can get it
> back
> > to normal.
> Possibly you had fsync=off at the time?
>
> Andres
>


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-performance(at)postgresql(dot)org
Cc: Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-10 22:09:03
Message-ID: 200912102309.04057.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Hi,

On Thursday 10 December 2009 23:01:08 Michael Clemmons wrote:
> Im not sure what that means ppl in my office with slower hd speeds using
> 8.4 can create a db in 2s vs my 8-12s.
- Possibly their config is different - they could have disabled the "fsync"
parameter which turns the database to be not crashsafe anymore but much faster
in some circumstances.

- Possibly you have much data in your template1 database?
You could check whether

CREATE DATABASE speedtest TEMPLATE template1; takes more time than
CREATE DATABASE speedtest TEMPLATE template0;.

You should issue both multiple times to ensure caching on the template
database doesnt play a role.

> Could using md5 instead of ident do it?
Seems unlikely.
Is starting psql near-instantaneus? Are you using "createdb" or are you
issuing "CREATE DATABASE ..."?

Andres


From: Nikolas Everett <nik9000(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 01:38:25
Message-ID: d4e11e980912101738t77b654b2j1db1c023eafd9a58@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

In my limited experience ext4 as presented by Karmic is not db friendly. I
had to carve my swap partition into a swap partition and an xfs partition to
get better db performance. Try fsync=off first, but if that doesn't work
then try a mini xfs.

On Thu, Dec 10, 2009 at 5:09 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On Thursday 10 December 2009 23:01:08 Michael Clemmons wrote:
> > Im not sure what that means ppl in my office with slower hd speeds using
> > 8.4 can create a db in 2s vs my 8-12s.
> - Possibly their config is different - they could have disabled the "fsync"
> parameter which turns the database to be not crashsafe anymore but much
> faster
> in some circumstances.
>
> - Possibly you have much data in your template1 database?
> You could check whether
>
> CREATE DATABASE speedtest TEMPLATE template1; takes more time than
> CREATE DATABASE speedtest TEMPLATE template0;.
>
> You should issue both multiple times to ensure caching on the template
> database doesnt play a role.
>
> > Could using md5 instead of ident do it?
> Seems unlikely.
> Is starting psql near-instantaneus? Are you using "createdb" or are you
> issuing "CREATE DATABASE ..."?
>
> Andres
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Nikolas Everett <nik9000(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 17:58:39
Message-ID: 1260554319.2611.10.camel@jd-desktop.unknown.charter.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Thu, 2009-12-10 at 20:38 -0500, Nikolas Everett wrote:
> In my limited experience ext4 as presented by Karmic is not db
> friendly. I had to carve my swap partition into a swap partition and
> an xfs partition to get better db performance. Try fsync=off first,
> but if that doesn't work then try a mini xfs.

Do not turn fsync off. That is bad advice. I would not suggest ext4 at
this point for database operations. Use ext3. It is backward compatible.

Joshua D. Drake

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander


From: Nikolas Everett <nik9000(at)gmail(dot)com>
To: jd(at)commandprompt(dot)com
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 20:43:59
Message-ID: d4e11e980912111243s6b0a67b1u2c2c22031fadf756@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Turning fsync off on a dev database is a bad idea? Sure you might kill it
and have to start over, but thats kind of the point in a dev database.

On Fri, Dec 11, 2009 at 12:58 PM, Joshua D. Drake <jd(at)commandprompt(dot)com>wrote:

> On Thu, 2009-12-10 at 20:38 -0500, Nikolas Everett wrote:
> > In my limited experience ext4 as presented by Karmic is not db
> > friendly. I had to carve my swap partition into a swap partition and
> > an xfs partition to get better db performance. Try fsync=off first,
> > but if that doesn't work then try a mini xfs.
>
> Do not turn fsync off. That is bad advice. I would not suggest ext4 at
> this point for database operations. Use ext3. It is backward compatible.
>
> Joshua D. Drake
>
>
> --
> PostgreSQL.org Major Contributor
> Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
> Consulting, Training, Support, Custom Development, Engineering
> If the world pushes look it in the eye and GRR. Then push back harder. -
> Salamander
>
>


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Nikolas Everett <nik9000(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 20:50:10
Message-ID: 1260564610.2611.41.camel@jd-desktop.unknown.charter.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, 2009-12-11 at 15:43 -0500, Nikolas Everett wrote:
> Turning fsync off on a dev database is a bad idea? Sure you might
> kill it and have to start over, but thats kind of the point in a dev
> database.

My experience is that bad dev practices turn into bad production
practices, whether intentionally or not.

Joshua D. Drake

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander


From: Nikolas Everett <nik9000(at)gmail(dot)com>
To: jd(at)commandprompt(dot)com
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 21:39:34
Message-ID: d4e11e980912111339q355f9fb2rf444f47a208b5b1e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Dec 11, 2009 at 3:50 PM, Joshua D. Drake <jd(at)commandprompt(dot)com>wrote:

> On Fri, 2009-12-11 at 15:43 -0500, Nikolas Everett wrote:
> > Turning fsync off on a dev database is a bad idea? Sure you might
> > kill it and have to start over, but thats kind of the point in a dev
> > database.
>
> My experience is that bad dev practices turn into bad production
> practices, whether intentionally or not.
>

Fair enough. I'm of the opinion that developers need to have their unit
tests run fast. If they aren't fast then your just not going to test as
much as you should. If your unit tests *have* to createdb then you have to
do whatever you have to do to get it fast. It'd probably be better if unit
tests don't create databases or alter tables at all though.

Regardless of what is going on on your dev box you really should leave fsync
on on your continuous integration, integration test, and QA machines.
They're what your really modeling your production on anyway.


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Nikolas Everett <nik9000(at)gmail(dot)com>
Cc: jd(at)commandprompt(dot)com, Andres Freund <andres(at)anarazel(dot)de>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 21:57:56
Message-ID: dcc563d10912111357k538816baybbfdbe3b846f7aa3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Dec 11, 2009 at 2:39 PM, Nikolas Everett <nik9000(at)gmail(dot)com> wrote:
>
>
> On Fri, Dec 11, 2009 at 3:50 PM, Joshua D. Drake <jd(at)commandprompt(dot)com>
> wrote:
>>
>> On Fri, 2009-12-11 at 15:43 -0500, Nikolas Everett wrote:
>> > Turning fsync off on a dev database is a bad idea?  Sure you might
>> > kill it and have to start over, but thats kind of the point in a dev
>> > database.
>>
>> My experience is that bad dev practices turn into bad production
>> practices, whether intentionally or not.
>
> Fair enough.  I'm of the opinion that developers need to have their unit
> tests run fast.  If they aren't fast then your just not going to test as
> much as you should.  If your unit tests *have* to createdb then you have to
> do whatever you have to do to get it fast.  It'd probably be better if unit
> tests don't create databases or alter tables at all though.

This is my big issue. dropping / creating databases for unit tests is
overkill. Running any DDL at all for a unit test seems wrong to me
too. Insert a row if you need it, MAYBE. Unit tests should work with
a test database that HAS the structure and database already in place.

What happens if your unit tests get lose in production and drop a
database, or a table. Not good.


From: Scott Mead <scott(dot)lists(at)enterprisedb(dot)com>
To: Nikolas Everett <nik9000(at)gmail(dot)com>
Cc: jd <jd(at)commandprompt(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 21:59:43
Message-ID: d3ab2ec80912111359s47925b0dl7ce54b849767bacc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Dec 11, 2009 at 4:39 PM, Nikolas Everett <nik9000(at)gmail(dot)com> wrote:

>
>
>
> Fair enough. I'm of the opinion that developers need to have their unit
> tests run fast. If they aren't fast then your just not going to test as
> much as you should. If your unit tests *have* to createdb then you have to
> do whatever you have to do to get it fast. It'd probably be better if unit
> tests don't create databases or alter tables at all though.
>
> Regardless of what is going on on your dev box you really should leave
> fsync on on your continuous integration, integration test, and QA machines.
> They're what your really modeling your production on anyway.
>

The other common issue is that developers running with something like
'fsync=off' means that they have completely unrealistic expectations of the
performance surrounding something. If your developers see that when fsync
is on, createdb takes x seconds vs. when it's off, then they'll know that
basing their entire process on that probably isn't a good idea. When
developers think something is lightning, they tend to base lots of stuff on
it, whether it's production ready or not.

--Scott


From: Scott Carey <scott(at)richrelevance(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Nikolas Everett <nik9000(at)gmail(dot)com>
Cc: "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 22:12:45
Message-ID: C74803DD.1B1BF%scott@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance


On 12/11/09 1:57 PM, "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com> wrote:

>
> This is my big issue. dropping / creating databases for unit tests is
> overkill. Running any DDL at all for a unit test seems wrong to me
> too. Insert a row if you need it, MAYBE. Unit tests should work with
> a test database that HAS the structure and database already in place.
>
> What happens if your unit tests get lose in production and drop a
> database, or a table. Not good.
>

Production should not have a db with the same username/pw combination as dev
boxes and unit tests . . .

Unfortunately, unit-like (often more than a 'unit') tests can't always rely
on a test db being already set up. If one leaves any cruft around, it might
break later tests later on non-deterministically. Automated tests that
insert data are absolutely required somewhere if the application inserts
data.

The best way to do this in postgres is to create a template database from
scratch with whatever DDL is needed at the start of the run, and then create
and drop db's as copies of that template per test or test suite.

So no, its not overkill at all IMO. I do wish to avoid it, and ideally all
tests clean up after themselves, but in practice this does not happen and
results in hard to track down issues where test X fails because of something
that any one of tests A to W did (all of which pass), often wasting time of
the most valuable developers -- those who know the majority of the system
well enough to track down such issues across the whole system.

One thing to consider, is putting this temp database in a RAMFS, or ramdisk
since postgres does a lot of file creates and fsyncs when cloning a db from
a template. For almost all such test db's the actual data is small, but the
# of tables is large.

> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Scott Mead <scott(dot)lists(at)enterprisedb(dot)com>
Cc: Nikolas Everett <nik9000(at)gmail(dot)com>, jd <jd(at)commandprompt(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 22:12:47
Message-ID: dcc563d10912111412j30b7c0d1n125e9d965053dfa3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Dec 11, 2009 at 2:59 PM, Scott Mead
<scott(dot)lists(at)enterprisedb(dot)com> wrote:
> On Fri, Dec 11, 2009 at 4:39 PM, Nikolas Everett <nik9000(at)gmail(dot)com> wrote:
>>
>>
>>
>> Fair enough.  I'm of the opinion that developers need to have their unit
>> tests run fast.  If they aren't fast then your just not going to test as
>> much as you should.  If your unit tests *have* to createdb then you have to
>> do whatever you have to do to get it fast.  It'd probably be better if unit
>> tests don't create databases or alter tables at all though.
>>
>> Regardless of what is going on on your dev box you really should leave
>> fsync on on your continuous integration, integration test, and QA machines.
>> They're what your really modeling your production on anyway.
>
>
>   The other common issue is that developers running with something like
> 'fsync=off' means that they have completely unrealistic expectations of the
> performance surrounding something.  If your developers see that when fsync
> is on, createdb takes x seconds vs. when it's off, then they'll know that
> basing their entire process on that probably isn't a good idea.  When
> developers think something is lightning, they tend to base lots of stuff on
> it, whether it's production ready or not.

Yeah, it's a huge mistake to give development super fast servers to
test on. Keep in mind production may need to handle 10k requests a
minute / second whatever. Developers cannot generate that kind of
load by just pointing and clicking. Our main production is on a
cluster of 8 and 12 core machines with scads of memory and RAID-10
arrays all over the place. Development gets a 4 core machine with 8G
ram and an 8 drive RAID-6. It ain't slow, but it ain't really that
fast either.


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Scott Carey <scott(at)richrelevance(dot)com>
Cc: Nikolas Everett <nik9000(at)gmail(dot)com>, "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 22:19:05
Message-ID: dcc563d10912111419y5074fab3w7f620a28a141012@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Dec 11, 2009 at 3:12 PM, Scott Carey <scott(at)richrelevance(dot)com> wrote:
>
> On 12/11/09 1:57 PM, "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com> wrote:
>
>>
>> This is my big issue.  dropping / creating databases for unit tests is
>> overkill.  Running any DDL at all for a unit test seems wrong to me
>> too.  Insert a row if you need it, MAYBE.  Unit tests should work with
>> a test database that HAS the structure and database already in place.
>>
>> What happens if your unit tests get lose in production and drop a
>> database, or a table.  Not good.
>>
>
> Production should not have a db with the same username/pw combination as dev
> boxes and unit tests . . .
>
> Unfortunately, unit-like (often more than a 'unit') tests can't always rely
> on a test db being already set up.  If one leaves any cruft around, it might
> break later tests later on non-deterministically.  Automated tests that
> insert data are absolutely required somewhere if the application inserts
> data.
>
> The best way to do this in postgres is to create a template database from
> scratch with whatever DDL is needed at the start of the run, and then create
> and drop db's as copies of that template per test or test suite.

Debateable. Last job we had 44k or so unit tests, and we gave each
dev their own db made from the main qa / unit testing db that they
could refresh at any time, and run the unit tests locally before
committing code. Actual failures like the one you mention were very
rare because of this approach. A simple ant refresh-db and they were
ready to test their code before committing it to the continuous
testing farm.


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Scott Mead <scott(dot)lists(at)enterprisedb(dot)com>
Cc: Nikolas Everett <nik9000(at)gmail(dot)com>, jd <jd(at)commandprompt(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 22:39:54
Message-ID: 4B22CA3A.1040909@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Scott Mead wrote:
> The other common issue is that developers running with something like
> 'fsync=off' means that they have completely unrealistic expectations
> of the performance surrounding something.
Right, but the flip side here is that often the production server will
have hardware such as a caching RAID card that vastly improves
performance in this area. There's some room to cheat in order to
accelerate the dev systems lack of such things, while still not giving a
completely unrealistic view of performance.

As far as I'm concerned, using "fsync=off" is almost never excusable if
you're running 8.3 or later where "synchronous_commit=off" is a
possibility. If you use that, it will usually improve the worst part of
commit issues substantially. And it happens in a way that's actually
quite similar to how a caching write production server will run: small
writes happen instantly, but eventually bigger ones will end up
bottlenecked at the disks anyway.

It would improve the average safety of our community members if anytime
someone suggests "fsync=off", we strongly suggest
"synchronous_commit=off" and potentially tuning its interval instead as
a middle ground, while still helping people who need to speed their
systems up. Saying "never turn fsync off" without suggesting this
alternative is counter-productive. If you're in the sort of position
where fsync is killing your performance you'll do anything to speed
things up (I've seen a 100:1 speed improvement) no matter how risky.
I've ran a production system of 8.2 with fsync off, a TB of data, and no
safety net if a crash introduced corruption beyond a ZFS snapshot. It
wasn't fun, but it was the only possibility to get bulk loading (there
was an ETL step in the middle after COPY) to happen fast enough. Using
async commit instead is a much better approach now that it's available.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Michael Clemmons <glassresistor(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 22:52:01
Message-ID: 4427a97a0912111452m35552852ub7c3fe30f8191849@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Thanks all this has been a good help.
I don't have control(or easy control) over unit tests creating/deleting
databases since Im using the django framework for this job. Createdb takes
12secs on my system(9.10 pg8.4 and ext4) which is impossibly slow for
running 200unittests. Fsync got it to .2secs or so which is blazing but
also the speed I expected being used to 8.3 and xfs. This dev box is my
laptop and the data is litterally unimportant and doesn't exist longer than
20sec but Im all about good practices. Will definately try synchronous
commit tonight once Im done working for the day. I've got some massive
copying todo later though so this will probably help in the future as well.

On Fri, Dec 11, 2009 at 5:39 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:

> Scott Mead wrote:
>
>> The other common issue is that developers running with something like
>> 'fsync=off' means that they have completely unrealistic expectations of the
>> performance surrounding something.
>>
> Right, but the flip side here is that often the production server will have
> hardware such as a caching RAID card that vastly improves performance in
> this area. There's some room to cheat in order to accelerate the dev
> systems lack of such things, while still not giving a completely unrealistic
> view of performance.
>
> As far as I'm concerned, using "fsync=off" is almost never excusable if
> you're running 8.3 or later where "synchronous_commit=off" is a possibility.
> If you use that, it will usually improve the worst part of commit issues
> substantially. And it happens in a way that's actually quite similar to how
> a caching write production server will run: small writes happen instantly,
> but eventually bigger ones will end up bottlenecked at the disks anyway.
>
> It would improve the average safety of our community members if anytime
> someone suggests "fsync=off", we strongly suggest "synchronous_commit=off"
> and potentially tuning its interval instead as a middle ground, while still
> helping people who need to speed their systems up. Saying "never turn fsync
> off" without suggesting this alternative is counter-productive. If you're
> in the sort of position where fsync is killing your performance you'll do
> anything to speed things up (I've seen a 100:1 speed improvement) no matter
> how risky. I've ran a production system of 8.2 with fsync off, a TB of
> data, and no safety net if a crash introduced corruption beyond a ZFS
> snapshot. It wasn't fun, but it was the only possibility to get bulk
> loading (there was an ETL step in the middle after COPY) to happen fast
> enough. Using async commit instead is a much better approach now that it's
> available.
>
> --
> Greg Smith 2ndQuadrant Baltimore, MD
> PostgreSQL Training, Services and Support
> greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com
>
>


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Michael Clemmons <glassresistor(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-11 23:59:13
Message-ID: dcc563d10912111559t4e2524di14369c2b4e3d802c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Dec 11, 2009 at 3:52 PM, Michael Clemmons
<glassresistor(at)gmail(dot)com> wrote:
> Thanks all this has been a good help.
> I don't have control(or easy control) over unit tests creating/deleting
> databases since Im using the django framework for this job.

Reminds of the issues we had with Ruby on Rails and it's (at the time)
very mysql-centric tools that made us take a fork to large portions of
its brain to get things like this working. Worked with a developer
for a day or two fixing most of the worst mysqlisms in RoR at the time
to just get this kind of stuff working.

>  Createdb takes
> 12secs on my system(9.10 pg8.4 and ext4)  which is impossibly slow for
> running 200unittests.

Wait, so each unit test createdbs by itself? Wow...

>  Fsync got it to .2secs or so which is blazing but
> also the speed I expected being used to 8.3 and xfs.  This dev box is my
> laptop and the data is litterally unimportant and doesn't exist longer than
> 20sec but Im all about good practices.  Will definately try synchronous
> commit tonight once Im done working for the day.  I've got some massive
> copying todo later though so this will probably help in the future as well.

Yeah, I'd probably resort to fsync off in that circumstance too
especially if syn commit off didn't help that much.


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-performance(at)postgresql(dot)org
Cc: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-12 00:19:38
Message-ID: 200912120119.39272.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Hi,

On Saturday 12 December 2009 00:59:13 Scott Marlowe wrote:
> On Fri, Dec 11, 2009 at 3:52 PM, Michael Clemmons
> > Createdb takes
> > 12secs on my system(9.10 pg8.4 and ext4) which is impossibly slow for
> > running 200unittests.
> > Fsync got it to .2secs or so which is blazing but
> > also the speed I expected being used to 8.3 and xfs. This dev box is my
> > laptop and the data is litterally unimportant and doesn't exist longer
> > than 20sec but Im all about good practices. Will definately try
> > synchronous commit tonight once Im done working for the day. I've got
> > some massive copying todo later though so this will probably help in the
> > future as well.
> Yeah, I'd probably resort to fsync off in that circumstance too
> especially if syn commit off didn't help that much.
How should syn commit help with creating databases?

The problem with 8.4 and creating databases is that the number of files
increased hugely because of the introduction of relation forks.
It probably wouldnt be that hard to copy all files first, then reopen and fsync
them. Actually that should be a patch doable in an hour or two.

Andres


From: Michael Clemmons <glassresistor(at)gmail(dot)com>
To: Hannu Krosing <hannu(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-performance(at)postgresql(dot)org, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-12 20:36:27
Message-ID: 4427a97a0912121236g4c06c562u221d638810ed2b6e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

If ppl think its worth it I'll create a ticket

On Sat, Dec 12, 2009 at 6:09 AM, Hannu Krosing <hannu(at)2ndquadrant(dot)com>wrote:

> On Sat, 2009-12-12 at 01:19 +0100, Andres Freund wrote:
> > Hi,
> >
> > On Saturday 12 December 2009 00:59:13 Scott Marlowe wrote:
> > > On Fri, Dec 11, 2009 at 3:52 PM, Michael Clemmons
> > > > Createdb takes
> > > > 12secs on my system(9.10 pg8.4 and ext4) which is impossibly slow
> for
> > > > running 200unittests.
> > > > Fsync got it to .2secs or so which is blazing but
> > > > also the speed I expected being used to 8.3 and xfs. This dev box is
> my
> > > > laptop and the data is litterally unimportant and doesn't exist
> longer
> > > > than 20sec but Im all about good practices. Will definately try
> > > > synchronous commit tonight once Im done working for the day. I've
> got
> > > > some massive copying todo later though so this will probably help in
> the
> > > > future as well.
> > > Yeah, I'd probably resort to fsync off in that circumstance too
> > > especially if syn commit off didn't help that much.
> >
> > How should syn commit help with creating databases?
>
> It does not help here. Tested ;)
>
> > The problem with 8.4 and creating databases is that the number of files
> > increased hugely because of the introduction of relation forks.
>
> Plus the fact that fsync on ext4 is really slow. some info here:
>
> http://ldn.linuxfoundation.org/article/filesystems-data-preservation-fsync-and-benchmarks-pt-3
>
> > It probably wouldnt be that hard to copy all files first, then reopen and
> fsync
> > them. Actually that should be a patch doable in an hour or two.
>
> Probably something worth doing, as it will speed this up on all
> filesystems, and doubly so on ext4 and xfs.
>
> --
> Hannu Krosing http://www.2ndQuadrant.com
> PostgreSQL Scalability and Availability
> Services, Consulting and Training
>
>
>


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-performance(at)postgresql(dot)org
Cc: Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-12 20:38:41
Message-ID: 200912122138.42404.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Saturday 12 December 2009 21:36:27 Michael Clemmons wrote:
> If ppl think its worth it I'll create a ticket
Thanks, no need. I will post a patch tomorrow or so.

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: Scott Mead <scott(dot)lists(at)enterprisedb(dot)com>, Nikolas Everett <nik9000(at)gmail(dot)com>, jd <jd(at)commandprompt(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>, Michael Clemmons <glassresistor(at)gmail(dot)com>
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-13 03:56:42
Message-ID: 603c8f070912121956l7a634251yf6f544e828961875@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Dec 11, 2009 at 5:12 PM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> On Fri, Dec 11, 2009 at 2:59 PM, Scott Mead
> <scott(dot)lists(at)enterprisedb(dot)com> wrote:
>> On Fri, Dec 11, 2009 at 4:39 PM, Nikolas Everett <nik9000(at)gmail(dot)com> wrote:
>>>
>>>
>>>
>>> Fair enough.  I'm of the opinion that developers need to have their unit
>>> tests run fast.  If they aren't fast then your just not going to test as
>>> much as you should.  If your unit tests *have* to createdb then you have to
>>> do whatever you have to do to get it fast.  It'd probably be better if unit
>>> tests don't create databases or alter tables at all though.
>>>
>>> Regardless of what is going on on your dev box you really should leave
>>> fsync on on your continuous integration, integration test, and QA machines.
>>> They're what your really modeling your production on anyway.
>>
>>
>>   The other common issue is that developers running with something like
>> 'fsync=off' means that they have completely unrealistic expectations of the
>> performance surrounding something.  If your developers see that when fsync
>> is on, createdb takes x seconds vs. when it's off, then they'll know that
>> basing their entire process on that probably isn't a good idea.  When
>> developers think something is lightning, they tend to base lots of stuff on
>> it, whether it's production ready or not.
>
> Yeah, it's a huge mistake to give development super fast servers to
> test on.  Keep in mind production may need to handle 10k requests a
> minute / second whatever.  Developers cannot generate that kind of
> load by just pointing and clicking.  Our main production is on a
> cluster of 8 and 12 core machines with scads of memory and RAID-10
> arrays all over the place.  Development gets a 4 core machine with 8G
> ram and an 8 drive RAID-6.  It ain't slow, but it ain't really that
> fast either.

My development box at work is an 1.8 Ghz Celeron with 256K of CPU
cache, 1 GB of memory, and a single IDE drive... I don't have too
many slow queries in there.

...Robert


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Cc: Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-28 22:54:51
Message-ID: 200912282354.51892.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Saturday 12 December 2009 21:38:41 Andres Freund wrote:
> On Saturday 12 December 2009 21:36:27 Michael Clemmons wrote:
> > If ppl think its worth it I'll create a ticket
> Thanks, no need. I will post a patch tomorrow or so.
Well. It was a long day...

Anyway.
In this patch I delay the fsync done in copy_file and simply do a second pass
over the directory in copy_dir and fsync everything in that pass.
Including the directory - which was not done before and actually might be
necessary in some cases.
I added a posix_fadvise(..., FADV_DONTNEED) to make it more likely that the
copied file reaches storage before the fsync. Without the speed benefits were
quite a bit smaller and essentially random (which seems sensible).

This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s on my
laptop. Still slower than with fsync off (~0.25) but quite a worthy
improvement.

The benefits are obviously bigger if the template database includes anything
added.

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-28 22:59:43
Message-ID: 200912282359.43979.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Monday 28 December 2009 23:54:51 Andres Freund wrote:
> On Saturday 12 December 2009 21:38:41 Andres Freund wrote:
> > On Saturday 12 December 2009 21:36:27 Michael Clemmons wrote:
> > > If ppl think its worth it I'll create a ticket
> >
> > Thanks, no need. I will post a patch tomorrow or so.
>
> Well. It was a long day...
>
> Anyway.
> In this patch I delay the fsync done in copy_file and simply do a second
> pass over the directory in copy_dir and fsync everything in that pass.
> Including the directory - which was not done before and actually might be
> necessary in some cases.
> I added a posix_fadvise(..., FADV_DONTNEED) to make it more likely that the
> copied file reaches storage before the fsync. Without the speed benefits
> were quite a bit smaller and essentially random (which seems sensible).
>
> This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s on
> my laptop. Still slower than with fsync off (~0.25) but quite a worthy
> improvement.
>
> The benefits are obviously bigger if the template database includes
> anything added.
Obviously the patch would be helpfull.

Andres

Attachment Content-Type Size
0001-Delay-fsyncing-files-during-copying-in-CREATE-DATABA.patch text/x-patch 3.2 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-28 23:06:28
Message-ID: 3454.1262041588@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Andres Freund <andres(at)anarazel(dot)de> writes:
> This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s on my
> laptop. Still slower than with fsync off (~0.25) but quite a worthy
> improvement.

I can't help wondering whether that's real or some kind of
platform-specific artifact. I get numbers more like 3.5s (fsync off)
vs 4.5s (fsync on) on a machine where I believe the disks aren't lying
about write-complete. It makes sense that an fsync at the end would be
a little bit faster, because it would give the kernel some additional
freedom in scheduling the required I/O, but it isn't cutting the total
I/O required at all. So I find it really hard to believe a 10x speedup.

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-28 23:20:35
Message-ID: 200912290020.35843.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 29 December 2009 00:06:28 Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s
> > on my laptop. Still slower than with fsync off (~0.25) but quite a
> > worthy improvement.
> I can't help wondering whether that's real or some kind of
> platform-specific artifact. I get numbers more like 3.5s (fsync off)
> vs 4.5s (fsync on) on a machine where I believe the disks aren't lying
> about write-complete. It makes sense that an fsync at the end would be
> a little bit faster, because it would give the kernel some additional
> freedom in scheduling the required I/O, but it isn't cutting the total
> I/O required at all. So I find it really hard to believe a 10x speedup.
Well, a template database is about 5.5MB big here - that shouldnt take too
long when written near-sequentially?
As I said the real benefit only occurred after adding posix_fadvise(..,
FADV_DONTNEED) which is somewhat plausible, because i.e. the directory entries
don't need to get scheduled for every file and because the kernel can reorder a
whole directory nearly sequentially. Without the advice it the kernel doesn't
know in time that it should write that data back and it wont do it for 5
seconds by default on linux or such...

I looked at the strace output - it looks sensible timewise to me. If youre
interested I can give you output of that.

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-28 23:31:56
Message-ID: 200912290031.57136.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 29 December 2009 00:06:28 Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s
> > on my laptop. Still slower than with fsync off (~0.25) but quite a
> > worthy improvement.
>
> I can't help wondering whether that's real or some kind of
> platform-specific artifact. I get numbers more like 3.5s (fsync off)
> vs 4.5s (fsync on) on a machine where I believe the disks aren't lying
> about write-complete. It makes sense that an fsync at the end would be
> a little bit faster, because it would give the kernel some additional
> freedom in scheduling the required I/O, but it isn't cutting the total
> I/O required at all. So I find it really hard to believe a 10x speedup.
I only comfortably have access to two smaller machines without BBU from here
(being in the Hacker Jeopardy at the ccc congress ;-)) and both show this
behaviour. I guess its somewhat filesystem dependent.

Andres


From: Thomas Kellerer <spam_eater(at)gmx(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: 8.4.1 ubuntu karmic slow createdb
Date: 2009-12-28 23:57:42
Message-ID: hhbglj$2vc$1@ger.gmane.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Michael Clemmons wrote on 11.12.2009 23:52:
> Thanks all this has been a good help.
> I don't have control(or easy control) over unit tests creating/deleting
> databases since Im using the django framework for this job. Createdb
> takes 12secs on my system(9.10 pg8.4 and ext4) which is impossibly slow
> for running 200unittests.

I wonder if you could simply create one database, and then a new schema for each of the tests.

After creating the schema you could alter the search_path for the "unit test user" and it would look like a completely new database.

Thomas


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 00:27:29
Message-ID: 407d949e0912281627q2b857c8cn672e3a9b2cb083a0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Mon, Dec 28, 2009 at 10:54 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> fsync everything in that pass.
> Including the directory - which was not done before and actually might be
> necessary in some cases.

Er. Yes. At least on ext4 this is pretty important. I wish it weren't,
but it doesn't look like we're going to convince the ext4 developers
they're crazy any day soon and it would really suck for a database
created from a template to have files in it go missin.

--
greg


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 00:29:34
Message-ID: 200912290129.35267.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 29 December 2009 01:27:29 Greg Stark wrote:
> On Mon, Dec 28, 2009 at 10:54 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > fsync everything in that pass.
> > Including the directory - which was not done before and actually might be
> > necessary in some cases.
>
> Er. Yes. At least on ext4 this is pretty important. I wish it weren't,
> but it doesn't look like we're going to convince the ext4 developers
> they're crazy any day soon and it would really suck for a database
> created from a template to have files in it go missin.
Actually it was necessary on ext3 as well - the window to hit the problem just
was much smaller, wasnt it?

Actually that part should possibly get backported.

Andres


From: david(at)lang(dot)hm
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 00:30:17
Message-ID: alpine.DEB.2.00.0912281629150.24130@asgard.lang.hm
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, 29 Dec 2009, Greg Stark wrote:

> On Mon, Dec 28, 2009 at 10:54 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> fsync everything in that pass.
>> Including the directory - which was not done before and actually might be
>> necessary in some cases.
>
> Er. Yes. At least on ext4 this is pretty important. I wish it weren't,
> but it doesn't look like we're going to convince the ext4 developers
> they're crazy any day soon and it would really suck for a database
> created from a template to have files in it go missin.

actually, as I understand it you need to do this on all filesystems except
ext3, and on ext3 fsync is horribly slow because it writes out
_everything_ that's pending, not just stuff related to the file you do the
fsync on.

David Lang


From: Andres Freund <andres(at)anarazel(dot)de>
To: david(at)lang(dot)hm
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 00:43:15
Message-ID: 200912290143.15866.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 29 December 2009 01:30:17 david(at)lang(dot)hm wrote:
> On Tue, 29 Dec 2009, Greg Stark wrote:
> > On Mon, Dec 28, 2009 at 10:54 PM, Andres Freund <andres(at)anarazel(dot)de>
wrote:
> >> fsync everything in that pass.
> >> Including the directory - which was not done before and actually might
> >> be necessary in some cases.
> >
> > Er. Yes. At least on ext4 this is pretty important. I wish it weren't,
> > but it doesn't look like we're going to convince the ext4 developers
> > they're crazy any day soon and it would really suck for a database
> > created from a template to have files in it go missin.
>
> actually, as I understand it you need to do this on all filesystems except
> ext3, and on ext3 fsync is horribly slow because it writes out
> _everything_ that's pending, not just stuff related to the file you do the
> fsync on.
I dont think its all filesystems (ext2 should not be affected...), but generally
youre right. At least jfs, xfs are affected as well.

Its btw not necessarily nearly-safe and slow on ext3 as well (data=writeback).

Andres


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 00:46:21
Message-ID: 4B39515D.6050309@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Andres Freund wrote:
> As I said the real benefit only occurred after adding posix_fadvise(..,
> FADV_DONTNEED) which is somewhat plausible, because i.e. the directory entries
> don't need to get scheduled for every file and because the kernel can reorder a
> whole directory nearly sequentially. Without the advice it the kernel doesn't
> know in time that it should write that data back and it wont do it for 5
> seconds by default on linux or such...
>
I know they just fiddled with the logic in the last release, but for
most of the Linux kernels out there now pdflush wakes up every 5 seconds
by default. But typically it only worries about writing things that
have been in the queue for 30 seconds or more until you've filled quite
a bit of memory, so that's also an interesting number. I tried to
document the main tunables here and describe how they fit together at
http://www.westnet.com/~gsmith/content/linux-pdflush.htm

It would be interesting to graph the "Dirty" and "Writeback" figures in
/proc/meminfo over time with and without this patch in place. That
should make it obvious what the kernel is doing differently in the two
cases.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: david(at)lang(dot)hm
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 00:46:26
Message-ID: alpine.DEB.2.00.0912281645040.24130@asgard.lang.hm
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, 29 Dec 2009, Andres Freund wrote:

> On Tuesday 29 December 2009 01:30:17 david(at)lang(dot)hm wrote:
>> On Tue, 29 Dec 2009, Greg Stark wrote:
>>> On Mon, Dec 28, 2009 at 10:54 PM, Andres Freund <andres(at)anarazel(dot)de>
> wrote:
>>>> fsync everything in that pass.
>>>> Including the directory - which was not done before and actually might
>>>> be necessary in some cases.
>>>
>>> Er. Yes. At least on ext4 this is pretty important. I wish it weren't,
>>> but it doesn't look like we're going to convince the ext4 developers
>>> they're crazy any day soon and it would really suck for a database
>>> created from a template to have files in it go missin.
>>
>> actually, as I understand it you need to do this on all filesystems except
>> ext3, and on ext3 fsync is horribly slow because it writes out
>> _everything_ that's pending, not just stuff related to the file you do the
>> fsync on.
> I dont think its all filesystems (ext2 should not be affected...), but generally
> youre right. At least jfs, xfs are affected as well.

ext2 definantly needs the fsync on the directory as well as the file
(well, if the file metadata like size, change)

> Its btw not necessarily nearly-safe and slow on ext3 as well (data=writeback).

no, then it's just unsafe and slow ;-)

David Lang


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 02:05:39
Message-ID: 200912290305.39932.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 29 December 2009 01:46:21 Greg Smith wrote:
> Andres Freund wrote:
> > As I said the real benefit only occurred after adding posix_fadvise(..,
> > FADV_DONTNEED) which is somewhat plausible, because i.e. the directory
> > entries don't need to get scheduled for every file and because the kernel
> > can reorder a whole directory nearly sequentially. Without the advice it
> > the kernel doesn't know in time that it should write that data back and
> > it wont do it for 5 seconds by default on linux or such...
> It would be interesting to graph the "Dirty" and "Writeback" figures in
> /proc/meminfo over time with and without this patch in place. That
> should make it obvious what the kernel is doing differently in the two
> cases.
I did some analysis using blktrace (usefull tool btw) and the results show that
the io pattern is *significantly* different.

For one with the direct fsyncing nearly no hardware queuing is used and for
another nearly no requests are merged on software side.

Short stats:

OLD:

Total (8,0):
Reads Queued: 2, 8KiB Writes Queued: 7854, 29672KiB
Read Dispatches: 2, 8KiB Write Dispatches: 1926, 29672KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 2, 8KiB Writes Completed: 2362, 29672KiB
Read Merges: 0, 0KiB Write Merges: 5492, 21968KiB
PC Reads Queued: 0, 0KiB PC Writes Queued: 0, 0KiB
PC Read Disp.: 436, 0KiB PC Write Disp.: 0, 0KiB
PC Reads Req.: 0 PC Writes Req.: 0
PC Reads Compl.: 0 PC Writes Compl.: 2362
IO unplugs: 2395 Timer unplugs: 557

New:

Total (8,0):
Reads Queued: 0, 0KiB Writes Queued: 1716, 5960KiB
Read Dispatches: 0, 0KiB Write Dispatches: 324, 5960KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 550, 5960KiB
Read Merges: 0, 0KiB Write Merges: 1166, 4664KiB
PC Reads Queued: 0, 0KiB PC Writes Queued: 0, 0KiB
PC Read Disp.: 226, 0KiB PC Write Disp.: 0, 0KiB
PC Reads Req.: 0 PC Writes Req.: 0
PC Reads Compl.: 0 PC Writes Compl.: 550
IO unplugs: 503 Timer unplugs: 30

Andres


From: Michael Clemmons <glassresistor(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 02:53:12
Message-ID: 4427a97a0912281853g36e7132al67a73927ae4068c0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Andres,
Great job. Looking through the emails and thinking about why this works I
think this patch should significantly speedup 8.4 on most any file
system(obviously some more than others) unless the system has significantly
reduced memory or a slow single core. On a Celeron with 256 memory I suspect
it'll crash out or just hit the swap and be a worse bottleneck. Anyone
have something like this to test on?
-Michael

On Mon, Dec 28, 2009 at 9:05 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On Tuesday 29 December 2009 01:46:21 Greg Smith wrote:
> > Andres Freund wrote:
> > > As I said the real benefit only occurred after adding posix_fadvise(..,
> > > FADV_DONTNEED) which is somewhat plausible, because i.e. the directory
> > > entries don't need to get scheduled for every file and because the
> kernel
> > > can reorder a whole directory nearly sequentially. Without the advice
> it
> > > the kernel doesn't know in time that it should write that data back and
> > > it wont do it for 5 seconds by default on linux or such...
> > It would be interesting to graph the "Dirty" and "Writeback" figures in
> > /proc/meminfo over time with and without this patch in place. That
> > should make it obvious what the kernel is doing differently in the two
> > cases.
> I did some analysis using blktrace (usefull tool btw) and the results show
> that
> the io pattern is *significantly* different.
>
> For one with the direct fsyncing nearly no hardware queuing is used and for
> another nearly no requests are merged on software side.
>
> Short stats:
>
> OLD:
>
> Total (8,0):
> Reads Queued: 2, 8KiB Writes Queued: 7854,
> 29672KiB
> Read Dispatches: 2, 8KiB Write Dispatches: 1926,
> 29672KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 2, 8KiB Writes Completed: 2362,
> 29672KiB
> Read Merges: 0, 0KiB Write Merges: 5492,
> 21968KiB
> PC Reads Queued: 0, 0KiB PC Writes Queued: 0,
> 0KiB
> PC Read Disp.: 436, 0KiB PC Write Disp.: 0,
> 0KiB
> PC Reads Req.: 0 PC Writes Req.: 0
> PC Reads Compl.: 0 PC Writes Compl.: 2362
> IO unplugs: 2395 Timer unplugs: 557
>
>
> New:
>
> Total (8,0):
> Reads Queued: 0, 0KiB Writes Queued: 1716,
> 5960KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 324,
> 5960KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 550,
> 5960KiB
> Read Merges: 0, 0KiB Write Merges: 1166,
> 4664KiB
> PC Reads Queued: 0, 0KiB PC Writes Queued: 0,
> 0KiB
> PC Read Disp.: 226, 0KiB PC Write Disp.: 0,
> 0KiB
> PC Reads Req.: 0 PC Writes Req.: 0
> PC Reads Compl.: 0 PC Writes Compl.: 550
> IO unplugs: 503 Timer unplugs: 30
>
>
> Andres
>


From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Clemmons <glassresistor(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 02:55:37
Message-ID: 200912290355.38382.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 29 December 2009 03:53:12 Michael Clemmons wrote:
> Andres,
> Great job. Looking through the emails and thinking about why this works I
> think this patch should significantly speedup 8.4 on most any file
> system(obviously some more than others) unless the system has significantly
> reduced memory or a slow single core. On a Celeron with 256 memory I
> suspect it'll crash out or just hit the swap and be a worse bottleneck.
> Anyone have something like this to test on?
Why should it crash? The kernel should just block on writing and write out the
dirty memory before continuing?
Pg is not caching anything here...

Andres


From: Michael Clemmons <glassresistor(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 03:04:06
Message-ID: 4427a97a0912281904qc90817i4edd6852b26dde7f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Maybe not crash out but in this situation.
N=0
while(N>=0):
CREATE DATABASE new_db_N;
Since the fsync is the part which takes the memory and time but is happening
in the background want the fsyncs pile up in the background faster than can
be run filling up the memory and stack.
This is very likely a mistake on my part about how postgres/processes
actually works.
-Michael

On Mon, Dec 28, 2009 at 9:55 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On Tuesday 29 December 2009 03:53:12 Michael Clemmons wrote:
> > Andres,
> > Great job. Looking through the emails and thinking about why this works
> I
> > think this patch should significantly speedup 8.4 on most any file
> > system(obviously some more than others) unless the system has
> significantly
> > reduced memory or a slow single core. On a Celeron with 256 memory I
> > suspect it'll crash out or just hit the swap and be a worse bottleneck.
> > Anyone have something like this to test on?
> Why should it crash? The kernel should just block on writing and write out
> the
> dirty memory before continuing?
> Pg is not caching anything here...
>
> Andres
>


From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Clemmons <glassresistor(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 03:11:14
Message-ID: 200912290411.15659.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 29 December 2009 04:04:06 Michael Clemmons wrote:
> Maybe not crash out but in this situation.
> N=0
> while(N>=0):
> CREATE DATABASE new_db_N;
> Since the fsync is the part which takes the memory and time but is
> happening in the background want the fsyncs pile up in the background
> faster than can be run filling up the memory and stack.
> This is very likely a mistake on my part about how postgres/processes
The difference should not be visible outside the "CREATE DATABASE ..." at all.
Currently the process simplifiedly works like:

------------
for file in source directory:
copy_file(source/file, target/file);
fsync(target/file);
------------

I changed it to:

-------------
for file in source directory:
copy_file(source/file, target/file);

/*please dear kernel, write this out, but dont block*/
posix_fadvise(target/file, FADV_DONTNEED);

for file in source directory:
fsync(target/file);
-------------

If at any point in time there is not enough cache available to cache anything
copy_file() will just have to wait for the kernel to write out the data.
fsync() does not use memory itself.

Andres


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 10:48:10
Message-ID: 407d949e0912290248y24acd059q1f31d6f0d057bbe6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, Dec 29, 2009 at 2:05 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>  Reads Completed:        2,        8KiB  Writes Completed:     2362,    29672KiB
> New:
>  Reads Completed:        0,        0KiB  Writes Completed:      550,     5960KiB

It looks like the new method is only doing 1/6th as much i/o. Do you
know what's going on there?

--
greg


From: Andres Freund <andres(at)anarazel(dot)de>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 11:13:21
Message-ID: 200912291213.21721.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 29 December 2009 11:48:10 Greg Stark wrote:
> On Tue, Dec 29, 2009 at 2:05 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Reads Completed: 2, 8KiB Writes Completed: 2362,
> > 29672KiB New:
> > Reads Completed: 0, 0KiB Writes Completed: 550,
> > 5960KiB
>
> It looks like the new method is only doing 1/6th as much i/o. Do you
> know what's going on there?
While I was surprised by the amount of difference I am not surprised at all
that there is a significant one - currently the fsync will write out a whole
bunch of useless stuff every time its called (all metadata, directory structure
and so on)

This is reproducible...

6MB sounds sensible for the operation btw - the template database is around
5MB.

Will try to analyze later what exactly causes the additional io.

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2009-12-29 18:30:49
Message-ID: 200912291930.50812.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Monday 28 December 2009 23:59:43 Andres Freund wrote:
> On Monday 28 December 2009 23:54:51 Andres Freund wrote:
> > On Saturday 12 December 2009 21:38:41 Andres Freund wrote:
> > > On Saturday 12 December 2009 21:36:27 Michael Clemmons wrote:
> > > > If ppl think its worth it I'll create a ticket
> > >
> > > Thanks, no need. I will post a patch tomorrow or so.
> >
> > Well. It was a long day...
> >
> > Anyway.
> > In this patch I delay the fsync done in copy_file and simply do a second
> > pass over the directory in copy_dir and fsync everything in that pass.
> > Including the directory - which was not done before and actually might be
> > necessary in some cases.
> > I added a posix_fadvise(..., FADV_DONTNEED) to make it more likely that
> > the copied file reaches storage before the fsync. Without the speed
> > benefits were quite a bit smaller and essentially random (which seems
> > sensible).
> >
> > This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s
> > on my laptop. Still slower than with fsync off (~0.25) but quite a
> > worthy improvement.
> >
> > The benefits are obviously bigger if the template database includes
> > anything added.
>
> Obviously the patch would be helpfull.
And it should also be helpfull not to have annoying oversights in there. A
FreeDir(xldir); is missing at the end of copydir().

Andres


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-01-18 16:35:59
Message-ID: 407d949e1001180835m79ffe3e4w53e20ce2e1a58f2d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Looking at this patch for the commitfest I have a few questions.

1) You said you added an fsync of the new directory -- where is that I
don't see it anywhere.

2) Why does the second pass to do the fsyncs read through fromdir to
find all the filenames. I find that odd and counterintuitive. It would
be much more natural to just loop through the files in the new
directory. But I suppose it serves as an added paranoia check that the
files are in fact still there and we're not fsyncing any files we
didn't just copy. I think it should still work, we should have an
exclusive lock on the template database so there really ought to be no
differences between the directory trees.

3) It would be tempting to do the posix_fadvise on each chunk as we
copy it. That way we avoid poisoning the filesystem cache even as far
as a 1G file. This might actually be quite significant if we're built
without the 1G file chunk size. I'm a bit concerned that the code will
be a big more complex having to depend on a good off_t definition
though. Do we only use >1GB files on systems where off_t is capable of
handling >2^32 without gymnastics?

--
greg


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-01-19 14:52:25
Message-ID: 407d949e1001190652i43f4f276x6a485f375647219a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Mon, Jan 18, 2010 at 4:35 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> Looking at this patch for the commitfest I have a few questions.

So I've touched this patch up a bit:

1) moved the posix_fadvise call to a new fd.c function
pg_fsync_start(fd,offset,nbytes) which initiates an fsync without
waiting on it. Currently it's only implemented with
posix_fadvise(DONT_NEED) but I want to look into using sync_file_range
in the future -- it looks like this call might be good enough for our
checkpoints.

2) advised each 64k chunk as we write it which should avoid poisoning
the cache if you do a large create database on an active system.

3) added the promised but afaict missing fsync of the directory -- i
think we should actually backpatch this.

Barring any objections shall I commit it like this?

--
greg

--
greg

Attachment Content-Type Size
create-database-speedup-using-pfa.diff text/x-patch 4.7 KB

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-01-19 14:57:14
Message-ID: 407d949e1001190657m7cd289d5s74c3d8c475e0bdc7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, Jan 19, 2010 at 2:52 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> Barring any objections shall I commit it like this?

Actually before we get there could someone who demonstrated the
speedup verify that this patch still gets that same speedup?

--
greg


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-01-19 15:03:16
Message-ID: 201001191603.17013.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 19 January 2010 15:52:25 Greg Stark wrote:
> On Mon, Jan 18, 2010 at 4:35 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > Looking at this patch for the commitfest I have a few questions.
>
> So I've touched this patch up a bit:
>
> 1) moved the posix_fadvise call to a new fd.c function
> pg_fsync_start(fd,offset,nbytes) which initiates an fsync without
> waiting on it. Currently it's only implemented with
> posix_fadvise(DONT_NEED) but I want to look into using sync_file_range
> in the future -- it looks like this call might be good enough for our
> checkpoints.
>
> 2) advised each 64k chunk as we write it which should avoid poisoning
> the cache if you do a large create database on an active system.
>
> 3) added the promised but afaict missing fsync of the directory -- i
> think we should actually backpatch this.
Yes, that was a bit stupid from me - I added the fsync for directories which
get recursed into (by not checking if its a file) but not for the uppermost
level.
So all directories should get fsynced right now but the topmost one.

I will review the patch later when I finally will have some time off again...
~4h.

Thanks!

Andres


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-01-19 15:25:46
Message-ID: 9319.1263914746@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Greg Stark <gsstark(at)mit(dot)edu> writes:
> 1) moved the posix_fadvise call to a new fd.c function
> pg_fsync_start(fd,offset,nbytes) which initiates an fsync without
> waiting on it. Currently it's only implemented with
> posix_fadvise(DONT_NEED) but I want to look into using sync_file_range
> in the future -- it looks like this call might be good enough for our
> checkpoints.

That function *seriously* needs documentation, in particular the fact
that it's a no-op on machines without the right kernel call. The name
you've chosen is very bad for those semantics. I'd pick something
else myself. Maybe "pg_start_data_flush" or something like that?

Other than that quibble it seems basically sane.

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-performance(at)postgresql(dot)org
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-01-20 04:01:55
Message-ID: 201001200501.55902.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Hi Greg,

On Monday 18 January 2010 17:35:59 Greg Stark wrote:
> 2) Why does the second pass to do the fsyncs read through fromdir to
> find all the filenames. I find that odd and counterintuitive. It would
> be much more natural to just loop through the files in the new
> directory. But I suppose it serves as an added paranoia check that the
> files are in fact still there and we're not fsyncing any files we
> didn't just copy. I think it should still work, we should have an
> exclusive lock on the template database so there really ought to be no
> differences between the directory trees.
If it weren't safe we would already have a big problem....

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-01-20 04:02:17
Message-ID: 201001200502.18134.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Hi Greg,

On Tuesday 19 January 2010 15:52:25 Greg Stark wrote:
> On Mon, Jan 18, 2010 at 4:35 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > Looking at this patch for the commitfest I have a few questions.
>
> So I've touched this patch up a bit:
>
> 1) moved the posix_fadvise call to a new fd.c function
> pg_fsync_start(fd,offset,nbytes) which initiates an fsync without
> waiting on it. Currently it's only implemented with
> posix_fadvise(DONT_NEED) but I want to look into using sync_file_range
> in the future -- it looks like this call might be good enough for our
> checkpoints.
Why exactly should that depend on fsync? Sure, thats where most of the pain
comes from now but avoiding that cache poisoning wouldnt hurt otherwise as
well.

I would rather have it called pg_flush_cache_range or such...

> 2) advised each 64k chunk as we write it which should avoid poisoning
> the cache if you do a large create database on an active system.
>
> 3) added the promised but afaict missing fsync of the directory -- i
> think we should actually backpatch this.
I think as well. You need it during recursing as well though (where I had
added it) and not only for the final directory.

> Barring any objections shall I commit it like this?
Other than the two things above it looks fine to me.

Thanks,

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-01-20 04:13:03
Message-ID: 201001200513.04510.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 19 January 2010 15:57:14 Greg Stark wrote:
> On Tue, Jan 19, 2010 at 2:52 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > Barring any objections shall I commit it like this?
>
> Actually before we get there could someone who demonstrated the
> speedup verify that this patch still gets that same speedup?
At least on the three machines I tested last time the result is still in the
same ballpark.

Andres


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-01-20 05:21:07
Message-ID: 4B5692C3.6080807@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Greg Stark wrote:
> On Tue, Jan 19, 2010 at 2:52 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>
>> Barring any objections shall I commit it like this?
>>
>
> Actually before we get there could someone who demonstrated the
> speedup verify that this patch still gets that same speedup?
>

I think the final version of this patch could use at least one more
performance checking report that it does something useful. We got a lot
of data from Andres, but do we know that the improvements here hold for
others too? I can take a look at it later this week, I have some
interest in this area.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-01-27 07:21:44
Message-ID: 4B5FE988.3070604@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Greg Stark wrote:
> Actually before we get there could someone who demonstrated the
> speedup verify that this patch still gets that same speedup?
>

Let's step back a second and get to the bottom of why some people are
seeing this and others aren't. The original report here suggested this
was an ext4 issue. As I pointed out recently on the performance list,
the reason for that is likely that the working write-barrier support for
ext4 means it's passing through the fsync to "lying" hard drives via a
proper cache flush, which didn't happen on your typical ext3 install.
Given that, I'd expect I could see the same issue with ext3 given a
drive with its write cache turned off, so that the theory I started
trying to prove before seeing the patch operate.

What I did was create a little test program that created 5 databases and
then dropped them:

\timing
create database a;
create database b;
create database c;
create database d;
create database e;
drop database a;
drop database b;
drop database c;
drop database d;
drop database e;

(All of the drop times were very close by the way; around 100ms, nothing
particularly interesting there)

If I have my system's boot drive (attached to the motherboard, not on
the caching controller) in its regular, lying mode with write cache on,
the creates take the following times:

Time: 713.982 ms Time: 659.890 ms Time: 590.842 ms Time: 675.506 ms
Time: 645.521 ms

A second run gives similar results; seems quite repeatable for every
test I ran so I'll just show one run of each.

If I then turn off the write-cache on the drive:

$ sudo hdparm -W 0 /dev/sdb

And repeat, these times show up instead:

Time: 6781.205 ms Time: 6805.271 ms Time: 6947.037 ms Time: 6938.644
ms Time: 7346.838 ms

So there's the problem case reproduced, right on regular old ext3 and
Ubuntu Jaunty: around 7 seconds to create a database, not real impressive.

Applying the last patch you attached, with the cache on, I see this:

Time: 396.105 ms Time: 389.984 ms Time: 469.800 ms Time: 386.043 ms
Time: 441.269 ms

And if I then turn the write cache off, back to slow times, but much better:

Time: 2162.687 ms Time: 2174.057 ms Time: 2215.785 ms Time: 2174.100
ms Time: 2190.811 ms

That makes the average times I'm seeing on my server:

HEAD Cached: 657 ms Uncached: 6964 ms
Patched Cached: 417 ms Uncached: 2183 ms

Modest speedup even with a caching drive, and a huge speedup in the case
when you have one with slow fsync. Looks to me that if you address
Tom's concern about documentation and function naming, comitting this
patch will certainly deliver as promised on the performance side. Maybe
2 seconds is still too long for some people, but it's at least a whole
lot better.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.co


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-01-29 18:56:23
Message-ID: 407d949e1001291056q22915b1cqbce5fbc918a15d69@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, Jan 19, 2010 at 3:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> That function *seriously* needs documentation, in particular the fact
> that it's a no-op on machines without the right kernel call.  The name
> you've chosen is very bad for those semantics.  I'd pick something
> else myself.  Maybe "pg_start_data_flush" or something like that?
>

I would like to make one token argument in favour of the name I
picked. If it doesn't convince I'll change it since we can always
revisit the API down the road.

I envision having two function calls, pg_fsync_start() and
pg_fsync_finish(). The latter will wait until the data synced in the
first call is actually synced. The fall-back if there's no
implementation of this would be for fsync_start() to be a noop (or
something unreliable like posix_fadvise) and fsync_finish() to just be
a regular fsync.

I think we can accomplish this with sync_file_range() but I need to
read up on how it actually works a bit more. In this case it doesn't
make a difference since when we call fsync_finish() it's going to be
for the entire file and nothing else will have been writing to these
files. But for wal writing and checkpointing it might have very
different performance characteristics.

The big objection to this is that then we don't really have an api for
FADV_DONT_NEED which is more about cache policy than about syncing to
disk. So for example a sequential scan might want to indicate that it
isn't planning on reading the buffers it's churning through but
doesn't want to force them to be written sooner than otherwise and is
never going to call fsync_finish().

--
greg


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-02 17:36:12
Message-ID: 603c8f071002020936k5723e30kd4eac594092aba3b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Jan 29, 2010 at 1:56 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Tue, Jan 19, 2010 at 3:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> That function *seriously* needs documentation, in particular the fact
>> that it's a no-op on machines without the right kernel call.  The name
>> you've chosen is very bad for those semantics.  I'd pick something
>> else myself.  Maybe "pg_start_data_flush" or something like that?
>>
>
> I would like to make one token argument in favour of the name I
> picked. If it doesn't convince I'll change it since we can always
> revisit the API down the road.
>
> I envision having two function calls, pg_fsync_start() and
> pg_fsync_finish(). The latter will wait until the data synced in the
> first call is actually synced. The fall-back if there's no
> implementation of this would be for fsync_start() to be a noop (or
> something unreliable like posix_fadvise) and fsync_finish() to just be
> a regular fsync.
>
> I think we can accomplish this with sync_file_range() but I need to
> read up on how it actually works a bit more. In this case it doesn't
> make a difference since when we call fsync_finish() it's going to be
> for the entire file and nothing else will have been writing to these
> files. But for wal writing and checkpointing it might have very
> different performance characteristics.
>
> The big objection to this is that then we don't really have an api for
> FADV_DONT_NEED which is more about cache policy than about syncing to
> disk. So for example a sequential scan might want to indicate that it
> isn't planning on reading the buffers it's churning through but
> doesn't want to force them to be written sooner than otherwise and is
> never going to call fsync_finish().

I took a look at this patch today and I agree with Tom that
pg_fsync_start() is a very confusing name. I don't know what the
right name is, but this doesn't fsync so I don't think it shuld have
fsync in the name. Maybe something like pg_advise_abandon() or
pg_abandon_cache(). The current name is really wishful thinking:
you're hoping that it will make the kernel start the fsync, but it
might not. I think pg_start_data_flush() is similarly optimistic.

...Robert


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-02 17:43:15
Message-ID: 201002021843.49983.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 02 February 2010 18:36:12 Robert Haas wrote:
> On Fri, Jan 29, 2010 at 1:56 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > On Tue, Jan 19, 2010 at 3:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> That function *seriously* needs documentation, in particular the fact
> >> that it's a no-op on machines without the right kernel call. The name
> >> you've chosen is very bad for those semantics. I'd pick something
> >> else myself. Maybe "pg_start_data_flush" or something like that?
> >
> > I would like to make one token argument in favour of the name I
> > picked. If it doesn't convince I'll change it since we can always
> > revisit the API down the road.
> >
> > I envision having two function calls, pg_fsync_start() and
> > pg_fsync_finish(). The latter will wait until the data synced in the
> > first call is actually synced. The fall-back if there's no
> > implementation of this would be for fsync_start() to be a noop (or
> > something unreliable like posix_fadvise) and fsync_finish() to just be
> > a regular fsync.
> >
> > I think we can accomplish this with sync_file_range() but I need to
> > read up on how it actually works a bit more. In this case it doesn't
> > make a difference since when we call fsync_finish() it's going to be
> > for the entire file and nothing else will have been writing to these
> > files. But for wal writing and checkpointing it might have very
> > different performance characteristics.
> >
> > The big objection to this is that then we don't really have an api for
> > FADV_DONT_NEED which is more about cache policy than about syncing to
> > disk. So for example a sequential scan might want to indicate that it
> > isn't planning on reading the buffers it's churning through but
> > doesn't want to force them to be written sooner than otherwise and is
> > never going to call fsync_finish().
>
> I took a look at this patch today and I agree with Tom that
> pg_fsync_start() is a very confusing name. I don't know what the
> right name is, but this doesn't fsync so I don't think it shuld have
> fsync in the name. Maybe something like pg_advise_abandon() or
> pg_abandon_cache(). The current name is really wishful thinking:
> you're hoping that it will make the kernel start the fsync, but it
> might not. I think pg_start_data_flush() is similarly optimistic.
What about: pg_fsync_prepare(). That gives the reason why were doing that and
doesnt promise that it is actually doing an fsync.
I dislike really having "cache" in the name, because the primary aim is not to
discard the cache...

Andres


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-02 17:50:15
Message-ID: 23321.1265133015@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Andres Freund <andres(at)anarazel(dot)de> writes:
> On Tuesday 02 February 2010 18:36:12 Robert Haas wrote:
>> I took a look at this patch today and I agree with Tom that
>> pg_fsync_start() is a very confusing name. I don't know what the
>> right name is, but this doesn't fsync so I don't think it shuld have
>> fsync in the name. Maybe something like pg_advise_abandon() or
>> pg_abandon_cache(). The current name is really wishful thinking:
>> you're hoping that it will make the kernel start the fsync, but it
>> might not. I think pg_start_data_flush() is similarly optimistic.

> What about: pg_fsync_prepare().

prepare_for_fsync()?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-02 18:14:40
Message-ID: 603c8f071002021014h4b489f44s9014377e1b38dbc7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, Feb 2, 2010 at 12:50 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> On Tuesday 02 February 2010 18:36:12 Robert Haas wrote:
>>> I took a look at this patch today and I agree with Tom that
>>> pg_fsync_start() is a very confusing name.  I don't know what the
>>> right name is, but this doesn't fsync so I don't think it shuld have
>>> fsync in the name.  Maybe something like pg_advise_abandon() or
>>> pg_abandon_cache().  The current name is really wishful thinking:
>>> you're hoping that it will make the kernel start the fsync, but it
>>> might not.  I think pg_start_data_flush() is similarly optimistic.
>
>> What about: pg_fsync_prepare().
>
> prepare_for_fsync()?

It still seems mis-descriptive to me. Couldn't the same routine be
used simply to abandon undirtied data that we no longer care about
caching?

...Robert


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-02 18:34:07
Message-ID: 201002021934.18444.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 02 February 2010 19:14:40 Robert Haas wrote:
> On Tue, Feb 2, 2010 at 12:50 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> >> On Tuesday 02 February 2010 18:36:12 Robert Haas wrote:
> >>> I took a look at this patch today and I agree with Tom that
> >>> pg_fsync_start() is a very confusing name. I don't know what the
> >>> right name is, but this doesn't fsync so I don't think it shuld have
> >>> fsync in the name. Maybe something like pg_advise_abandon() or
> >>> pg_abandon_cache(). The current name is really wishful thinking:
> >>> you're hoping that it will make the kernel start the fsync, but it
> >>> might not. I think pg_start_data_flush() is similarly optimistic.
> >>
> >> What about: pg_fsync_prepare().
> >
> > prepare_for_fsync()?
>
> It still seems mis-descriptive to me. Couldn't the same routine be
> used simply to abandon undirtied data that we no longer care about
> caching?
For now it could - but it very well might be converted to sync_file_range or
similar, which would have different "sideeffects".

As the potential code duplication is rather small I would prefer to describe
the prime effect not the sideeffects...

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-02 19:06:32
Message-ID: 603c8f071002021106o5eef7b40r1ed21c9bd05251ec@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, Feb 2, 2010 at 1:34 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> For now it could - but it very well might be converted to sync_file_range or
> similar, which would have different "sideeffects".
>
> As the potential code duplication is rather small I would prefer to describe
> the prime effect not the sideeffects...

Hmm, in that case, I think the problem is that this function has no
comment explaining its intended charter.

...Robert


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-02 19:08:12
Message-ID: 201002022008.13898.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tuesday 02 February 2010 20:06:32 Robert Haas wrote:
> On Tue, Feb 2, 2010 at 1:34 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > For now it could - but it very well might be converted to sync_file_range
> > or similar, which would have different "sideeffects".
> >
> > As the potential code duplication is rather small I would prefer to
> > describe the prime effect not the sideeffects...
>
> Hmm, in that case, I think the problem is that this function has no
> comment explaining its intended charter.
I agree there. Greg, do you want to update the patch with some comments or
shall I?

Andres


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-02 19:33:30
Message-ID: 29129.1265139210@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Hmm, in that case, I think the problem is that this function has no
> comment explaining its intended charter.

That's certainly a big problem, but a comment won't fix the fact that
the name is misleading. We need both a comment and a name change.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-02 19:45:46
Message-ID: 603c8f071002021145x1939c9e7hb2eedc8ae4dfe23e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, Feb 2, 2010 at 2:33 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Hmm, in that case, I think the problem is that this function has no
>> comment explaining its intended charter.
>
> That's certainly a big problem, but a comment won't fix the fact that
> the name is misleading.  We need both a comment and a name change.

I think you're probably right, but it's not clear what the new name
should be until we have a comment explaining what the function is
responsible for.

...Robert


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-03 11:53:58
Message-ID: 407d949e1002030353g522f1afev782a97fd755c5926@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, Feb 2, 2010 at 7:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think you're probably right, but it's not clear what the new name
> should be until we have a comment explaining what the function is
> responsible for.

So I wrote some comments but wasn't going to repost the patch with the
unchanged name without explanation... But I think you're right though
I was looking at it the other way around. I want to have an API for a
two-stage sync and of course if I do that I'll comment it to explain
that clearly.

The gist of the comments was that the function is preparing to fsync
to initiate the i/o early and allow the later fsync to fast -- but
also at the same time have the beneficial side-effect of avoiding
cache poisoning. It's not clear that the two are necessarily linked
though. Perhaps we need two separate apis, though it'll be hard to
keep them separate on all platforms.

--
greg


From: Andres Freund <andres(at)anarazel(dot)de>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-03 12:03:04
Message-ID: 4B6965F8.7040106@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 02/03/10 12:53, Greg Stark wrote:
> On Tue, Feb 2, 2010 at 7:45 PM, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
>> I think you're probably right, but it's not clear what the new name
>> should be until we have a comment explaining what the function is
>> responsible for.
>
> So I wrote some comments but wasn't going to repost the patch with the
> unchanged name without explanation... But I think you're right though
> I was looking at it the other way around. I want to have an API for a
> two-stage sync and of course if I do that I'll comment it to explain
> that clearly.
>
> The gist of the comments was that the function is preparing to fsync
> to initiate the i/o early and allow the later fsync to fast -- but
> also at the same time have the beneficial side-effect of avoiding
> cache poisoning. It's not clear that the two are necessarily linked
> though. Perhaps we need two separate apis, though it'll be hard to
> keep them separate on all platforms.
I vote for two seperate apis - sure, there will be some unfortunate
overlap for most unixoid platforms but its sure better possibly to allow
adding more platforms later at a centralized place than having to
analyze every place where the api is used.

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-03 13:42:57
Message-ID: 603c8f071002030542r779a4f47k78355c62b0b7853@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, Feb 3, 2010 at 6:53 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Tue, Feb 2, 2010 at 7:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I think you're probably right, but it's not clear what the new name
>> should be until we have a comment explaining what the function is
>> responsible for.
>
> So I wrote some comments but wasn't going to repost the patch with the
> unchanged name without explanation... But I think you're right though
> I was looking at it the other way around. I want to have an API for a
> two-stage sync and of course if I do that I'll comment it to explain
> that clearly.
>
> The gist of the comments was that the function is preparing to fsync
> to initiate the i/o early and allow the later fsync to fast -- but
> also at the same time have the beneficial side-effect of avoiding
> cache poisoning. It's not clear that the two are necessarily linked
> though. Perhaps we need two separate apis, though it'll be hard to
> keep them separate on all platforms.

Well, maybe we should start with a discussion of what kernel calls
you're aware of on different platforms and then we could try to put an
API around it. I mean, right now all you've got is
POSIX_FADV_DONTNEED, so given just that I feel like the API could
simply be pg_dontneed() or something. It's hard to design a general
framework based on one example.

...Robert


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-03 14:19:49
Message-ID: 4B698605.50506@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 02/03/10 14:42, Robert Haas wrote:
> On Wed, Feb 3, 2010 at 6:53 AM, Greg Stark<gsstark(at)mit(dot)edu> wrote:
>> On Tue, Feb 2, 2010 at 7:45 PM, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
>>> I think you're probably right, but it's not clear what the new name
>>> should be until we have a comment explaining what the function is
>>> responsible for.
>>
>> So I wrote some comments but wasn't going to repost the patch with the
>> unchanged name without explanation... But I think you're right though
>> I was looking at it the other way around. I want to have an API for a
>> two-stage sync and of course if I do that I'll comment it to explain
>> that clearly.
>>
>> The gist of the comments was that the function is preparing to fsync
>> to initiate the i/o early and allow the later fsync to fast -- but
>> also at the same time have the beneficial side-effect of avoiding
>> cache poisoning. It's not clear that the two are necessarily linked
>> though. Perhaps we need two separate apis, though it'll be hard to
>> keep them separate on all platforms.
>
> Well, maybe we should start with a discussion of what kernel calls
> you're aware of on different platforms and then we could try to put an
> API around it.
In linux there is sync_file_range. On newer Posixish systems one can
emulate that with mmap() and msync() (in batches obviously).

No idea about windows.

Andres


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-06 05:03:30
Message-ID: 4B6CF822.9010608@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Andres Freund wrote:
> On 02/03/10 14:42, Robert Haas wrote:
>> Well, maybe we should start with a discussion of what kernel calls
>> you're aware of on different platforms and then we could try to put an
>> API around it.
> In linux there is sync_file_range. On newer Posixish systems one can
> emulate that with mmap() and msync() (in batches obviously).
>
> No idea about windows.

There's a series of parameters you can pass into CreateFile:
http://msdn.microsoft.com/en-us/library/aa363858(VS.85).aspx

A lot of these are already mapped inside of src/port/open.c in a pretty
straightforward way from the POSIX-oriented interface:

O_RDWR,O_WRONLY -> GENERIC_WRITE, GENERIC_READ
O_RANDOM -> FILE_FLAG_RANDOM_ACCESS
O_SEQUENTIAL -> FILE_FLAG_SEQUENTIAL_SCAN
O_SHORT_LIVED -> FILE_ATTRIBUTE_TEMPORARY
O_TEMPORARY -> FILE_FLAG_DELETE_ON_CLOSE
O_DIRECT -> FILE_FLAG_NO_BUFFERING
O_DSYNC -> FILE_FLAG_WRITE_THROUGH

You have to read the whole "Caching Behavior" section to see exactly how
all of those interact, and even then notes like
http://support.microsoft.com/kb/99794 are needed to follow the fine
points of things like FILE_FLAG_NO_BUFFERING vs. FILE_FLAG_WRITE_THROUGH.

So anything that's setting those POSIX open flags better than before is
getting the benefit of that improvement on Windows, too. But that's not
quite the same as the changes using fadvise to provide better targeted
cache control hints.

I'm getting the impression that doing much better on Windows might fall
into the same sort of category as Solaris, where the primary interface
for this sort of thing is to use an AIO implementation instead:
http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx

The effective_io_concurrency feature had proof of concept test programs
that worked using AIO, but actually following through on that
implementation would require a major restructuring of how the database
interacts with the OS in terms of reads and writes of blocks. It looks
to me like doing something similar to sync_file_range on Windows would
be similarly difficult.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us


From: Andres Freund <andres(at)anarazel(dot)de>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-06 12:03:50
Message-ID: 201002061304.01682.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Saturday 06 February 2010 06:03:30 Greg Smith wrote:
> Andres Freund wrote:
> > On 02/03/10 14:42, Robert Haas wrote:
> >> Well, maybe we should start with a discussion of what kernel calls
> >> you're aware of on different platforms and then we could try to put an
> >> API around it.
> >
> > In linux there is sync_file_range. On newer Posixish systems one can
> > emulate that with mmap() and msync() (in batches obviously).
> >
> > No idea about windows.
> The effective_io_concurrency feature had proof of concept test programs
> that worked using AIO, but actually following through on that
> implementation would require a major restructuring of how the database
> interacts with the OS in terms of reads and writes of blocks. It looks
> to me like doing something similar to sync_file_range on Windows would
> be similarly difficult.
Looking a bit arround it seems one could achieve something approximediately
similar to pg_prepare_fsync() by using
CreateFileMapping && MapViewOfFile && FlushViewOfFile

If I understand it correctly that will flush, but not wait. Unfortunately you
cant event make it wait, so its not possible to implement sync_file_range or
similar fully.

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-07 05:13:15
Message-ID: 603c8f071002062113k262dd8xcd0cdd482d2c6150@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sat, Feb 6, 2010 at 7:03 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Saturday 06 February 2010 06:03:30 Greg Smith wrote:
>> Andres Freund wrote:
>> > On 02/03/10 14:42, Robert Haas wrote:
>> >> Well, maybe we should start with a discussion of what kernel calls
>> >> you're aware of on different platforms and then we could try to put an
>> >> API around it.
>> >
>> > In linux there is sync_file_range. On newer Posixish systems one can
>> > emulate that with mmap() and msync() (in batches obviously).
>> >
>> > No idea about windows.
>> The effective_io_concurrency feature had proof of concept test programs
>> that worked using AIO, but actually following through on that
>> implementation would require a major restructuring of how the database
>> interacts with the OS in terms of reads and writes of blocks.  It looks
>> to me like doing something similar to sync_file_range on Windows would
>> be similarly difficult.
> Looking a bit arround it seems one could achieve something approximediately
> similar to pg_prepare_fsync() by using
> CreateFileMapping && MapViewOfFile && FlushViewOfFile
>
> If I understand it correctly that will flush, but not wait. Unfortunately you
> cant event make it wait, so its not possible to implement sync_file_range or
> similar fully.

Well it seems that what we're trying to implement is more like
it_would_be_nice_if_you_would_start_syncing_this_file_range_but_its_ok_if_you_dont(),
so maybe that would work.

Anyway, is there something that we can agree on and get committed here
for 9.0, or should we postpone this to 9.1? It seems simple enough
that we ought to be able to get it done, but we're running out of time
and we don't seem to have a clear vision here yet...

...Robert


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-07 09:23:14
Message-ID: 4B6E8682.7030006@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Robert Haas wrote:
> Well it seems that what we're trying to implement is more like
> it_would_be_nice_if_you_would_start_syncing_this_file_range_but_its_ok_if_you_dont(),
> so maybe that would work.
>
> Anyway, is there something that we can agree on and get committed here
> for 9.0, or should we postpone this to 9.1? It seems simple enough
> that we ought to be able to get it done, but we're running out of time
> and we don't seem to have a clear vision here yet...
>

This is turning into yet another one of those situations where something
simple and useful is being killed by trying to generalize it way more
than it needs to be, given its current goals and its lack of external
interfaces. There's no catversion bump or API breakage to hinder future
refactoring if this isn't optimally designed internally from day one.

The feature is valuable and there seems at least one spot where it may
be resolving the possibility of a subtle OS interaction bug by being
more thorough in the way that it writes and syncs. The main contention
seems to be over naming and completely optional additional abstraction.
I consider the whole "let's make this cover every type of complicated
sync on every platform" goal interesting and worthwhile, but it's
completely optional for this release. The stuff being fretted over now
is ultimately an internal interface that can be refactored at will in
later releases with no user impact.

If the goal here could be shifted back to finding the minimal level of
abstraction that doesn't seem completely wrong, then updating the
function names and comments to match that more closely, this could
return to committable. That's all I thought was left to do when I moved
it to "ready for committer", and as far as I've seen this expanded scope
of discussion has just moved backwards from that point.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-07 16:24:00
Message-ID: 24184.1265559840@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> This is turning into yet another one of those situations where something
> simple and useful is being killed by trying to generalize it way more
> than it needs to be, given its current goals and its lack of external
> interfaces. There's no catversion bump or API breakage to hinder future
> refactoring if this isn't optimally designed internally from day one.

I agree that it's too late in the cycle for any major redesign of the
patch. But is it too much to ask to use a less confusing name for the
function?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-07 18:23:10
Message-ID: 603c8f071002071023v6d5329at6f80b453f472ebb3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Greg Smith <greg(at)2ndquadrant(dot)com> writes:
>> This is turning into yet another one of those situations where something
>> simple and useful is being killed by trying to generalize it way more
>> than it needs to be, given its current goals and its lack of external
>> interfaces.  There's no catversion bump or API breakage to hinder future
>> refactoring if this isn't optimally designed internally from day one.
>
> I agree that it's too late in the cycle for any major redesign of the
> patch.  But is it too much to ask to use a less confusing name for the
> function?

+1. Let's just rename the thing, add some comments, and call it good.

...Robert


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-07 18:27:02
Message-ID: 201002071927.27795.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sunday 07 February 2010 19:23:10 Robert Haas wrote:
> On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> >> This is turning into yet another one of those situations where something
> >> simple and useful is being killed by trying to generalize it way more
> >> than it needs to be, given its current goals and its lack of external
> >> interfaces. There's no catversion bump or API breakage to hinder future
> >> refactoring if this isn't optimally designed internally from day one.
> >
> > I agree that it's too late in the cycle for any major redesign of the
> > patch. But is it too much to ask to use a less confusing name for the
> > function?
>
> +1. Let's just rename the thing, add some comments, and call it good.
Will post a updated patch in the next hours unless somebody beats me too it.

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-08 01:31:42
Message-ID: 201002080231.47206.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sunday 07 February 2010 19:27:02 Andres Freund wrote:
> On Sunday 07 February 2010 19:23:10 Robert Haas wrote:
> > On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > > Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> > >> This is turning into yet another one of those situations where
> > >> something simple and useful is being killed by trying to generalize
> > >> it way more than it needs to be, given its current goals and its lack
> > >> of external interfaces. There's no catversion bump or API breakage
> > >> to hinder future refactoring if this isn't optimally designed
> > >> internally from day one.
> > >
> > > I agree that it's too late in the cycle for any major redesign of the
> > > patch. But is it too much to ask to use a less confusing name for the
> > > function?
> >
> > +1. Let's just rename the thing, add some comments, and call it good.
>
> Will post a updated patch in the next hours unless somebody beats me too
> it.
Here we go.

I left the name at my suggestion pg_fsync_prepare instead of Tom's
prepare_for_fsync because it seemed more consistend with the naming in the
rest of the file. Obviously feel free to adjust.

I personally think the fsync on the directory should be added to the stable
branches - other opinions?
If wanted I can prepare patches for that.

Andres

Attachment Content-Type Size
faster_createdb_v3.patch text/x-patch 5.8 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-08 03:09:01
Message-ID: 20100208030901.GA7407@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Andres Freund escribió:

> I personally think the fsync on the directory should be added to the stable
> branches - other opinions?
> If wanted I can prepare patches for that.

Yeah, it seems there are two patches here -- one is the addition of
fsync_fname() and the other is the fsync_prepare stuff.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-08 04:53:23
Message-ID: 603c8f071002072053w16897addtd52fcc4cf880df53@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Andres Freund escribió:
>> I personally think the fsync on the directory should be added to the stable
>> branches - other opinions?
>> If wanted I can prepare patches for that.
>
> Yeah, it seems there are two patches here -- one is the addition of
> fsync_fname() and the other is the fsync_prepare stuff.

Andres, you want to take a crack at splitting this up?

...Robert


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-08 07:13:41
Message-ID: 201002080813.51977.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Monday 08 February 2010 05:53:23 Robert Haas wrote:
> On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
>
> <alvherre(at)commandprompt(dot)com> wrote:
> > Andres Freund escribió:
> >> I personally think the fsync on the directory should be added to the
> >> stable branches - other opinions?
> >> If wanted I can prepare patches for that.
> >
> > Yeah, it seems there are two patches here -- one is the addition of
> > fsync_fname() and the other is the fsync_prepare stuff.
>
> Andres, you want to take a crack at splitting this up?
Will do. Later today or tomorrow morning.

Andres


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-08 18:34:01
Message-ID: 407d949e1002081034n3dd82addw2b93d35db042042c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Mon, Feb 8, 2010 at 4:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
>> Yeah, it seems there are two patches here -- one is the addition of
>> fsync_fname() and the other is the fsync_prepare stuff.

Sorry, I'm just catching up on my mail from FOSDEM this past weekend.

I had come to the same conclusion as Greg that I might as well just
commit it with Tom's "pg_flush_data()" name and we can decide later if
and when we have pg_fsync_start()/pg_fsync_finish() whether it's worth
keeping two apis or not.

So I was just going to commit it like that but I discovered last week
that I don't have cvs write access set up yet. I'll commit it as soon
as I generate a new ssh key and Dave installs it, etc. I intentionally
picked a small simple patch that nobody was waiting on because I knew
there was a risk of delays like this and the paperwork. I'm nearly
there.

--
greg


From: Andres Freund <andres(at)anarazel(dot)de>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-08 19:29:46
Message-ID: 201002082029.55041.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Monday 08 February 2010 19:34:01 Greg Stark wrote:
> On Mon, Feb 8, 2010 at 4:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
> >
> >> Yeah, it seems there are two patches here -- one is the addition of
> >> fsync_fname() and the other is the fsync_prepare stuff.
>
> Sorry, I'm just catching up on my mail from FOSDEM this past weekend.
>
> I had come to the same conclusion as Greg that I might as well just
> commit it with Tom's "pg_flush_data()" name and we can decide later if
> and when we have pg_fsync_start()/pg_fsync_finish() whether it's worth
> keeping two apis or not.
>
> So I was just going to commit it like that but I discovered last week
> that I don't have cvs write access set up yet. I'll commit it as soon
> as I generate a new ssh key and Dave installs it, etc. I intentionally
> picked a small simple patch that nobody was waiting on because I knew
> there was a risk of delays like this and the paperwork. I'm nearly
> there.
Do you still want me to split the patches into two or do you want to do it
yourself?
One in multiple versions for the directory fsync and another one for 9.0?

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Michael Clemmons <glassresistor(at)gmail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Subject: Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-11 02:27:30
Message-ID: 201002110327.35810.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Monday 08 February 2010 05:53:23 Robert Haas wrote:
> On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
>
> <alvherre(at)commandprompt(dot)com> wrote:
> > Andres Freund escribió:
> >> I personally think the fsync on the directory should be added to the
> >> stable branches - other opinions?
> >> If wanted I can prepare patches for that.
> >
> > Yeah, it seems there are two patches here -- one is the addition of
> > fsync_fname() and the other is the fsync_prepare stuff.
>
> Andres, you want to take a crack at splitting this up?
I hope I didnt duplicate Gregs work, but I didnt hear back from him, so...

Everything <8.1 is hopeless because cp is used there... I didnt see it worth
to replace that. The patch applies cleanly for 8.1 to 8.4 and survives the
regression tests

Given pg's heavy commit model I didnt see a point to split the patch for 9.0
as well...

Andres

Attachment Content-Type Size
directory-fsync-8.1-to-8.4.patch text/x-patch 1.1 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Stark <gsstark(at)mit(dot)edu>
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-12 15:49:16
Message-ID: 603c8f071002120749o6a2e9e54h5cf21dc91d1790b9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, Feb 10, 2010 at 9:27 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Monday 08 February 2010 05:53:23 Robert Haas wrote:
>> On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
>>
>> <alvherre(at)commandprompt(dot)com> wrote:
>> > Andres Freund escribió:
>> >> I personally think the fsync on the directory should be added to the
>> >> stable branches - other opinions?
>> >> If wanted I can prepare patches for that.
>> >
>> > Yeah, it seems there are two patches here -- one is the addition of
>> > fsync_fname() and the other is the fsync_prepare stuff.
>>
>> Andres, you want to take a crack at splitting this up?
> I hope I didnt duplicate Gregs work, but I didnt hear back from him, so...
>
> Everything <8.1 is hopeless because cp is used there... I didnt see it worth
> to replace that. The patch applies cleanly for 8.1 to 8.4 and survives the
> regression tests
>
> Given pg's heavy commit model I didnt see a point to split the patch for 9.0
> as well...

I'd probably argue for committing this patch to both HEAD and the
back-branches, and doing a second commit with the remaining stuff for
HEAD only, but I don't care very much.

Greg Stark, have you managed to get your access issues sorted out? If
you like, I can do the actual commit on this one.

...Robert


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-14 14:03:44
Message-ID: 407d949e1002140603v1c7515d7t697568a3865ac6fe@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Feb 12, 2010 at 3:49 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Greg Stark, have you managed to get your access issues sorted out?  If

Yep, will look at this today.

--
greg


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-14 15:31:58
Message-ID: 407d949e1002140731j85a4f97nc053ec1b0f3cc458@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sun, Feb 14, 2010 at 2:03 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Fri, Feb 12, 2010 at 3:49 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Greg Stark, have you managed to get your access issues sorted out?  If
>
> Yep, will look at this today.

So I think we have a bigger problem than just copydir.c. It seems to
me we should be fsyncing the table space data directories on every
checkpoint. Otherwise any newly created relations or removed relations
could disappear even though the data in them was fsynced. I'm thinking
I should add an _mdfd_opentblspc(reln) call which returns a file
descriptor for the tablespace and have mdsync() use that to sync the
directory whenever it fsyncs a relation. It would be nice to remember
which tablespaces have been fsynced and only fsync them once though,
that would need another hash table just for tablespaces.

We probably also need to fsync the pg_xlog directory every time we
create or rename an xlog segment.

Are there any other places we do directory operations which we need to
be permanent?

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-14 17:11:39
Message-ID: 22295.1266167499@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Greg Stark <gsstark(at)mit(dot)edu> writes:
> So I think we have a bigger problem than just copydir.c. It seems to
> me we should be fsyncing the table space data directories on every
> checkpoint.

Is there any evidence that anyone anywhere has ever lost data because
of a lack of directory fsyncs? I sure don't recall any bug reports
that seem to match that theory.

It seems to me that we're talking about a huge hit in both code
complexity and performance to deal with a problem that doesn't actually
occur in the field; and which furthermore is trivially solved on any
modern filesystem by choosing the right filesystem options. Why don't
we just document those options, instead?

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-14 17:27:00
Message-ID: 201002141827.04993.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sunday 14 February 2010 18:11:39 Tom Lane wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > So I think we have a bigger problem than just copydir.c. It seems to
> > me we should be fsyncing the table space data directories on every
> > checkpoint.
>
> Is there any evidence that anyone anywhere has ever lost data because
> of a lack of directory fsyncs? I sure don't recall any bug reports
> that seem to match that theory.
I have actually seen the issue during create database at least. In a
virtualized hw though...
~1GB template database, lots and lots of small tables, the crash occured maybe
a minute after CREATE DB, filesystem was xfs, kernel 2.6.30.y.

> It seems to me that we're talking about a huge hit in both code
> complexity and performance to deal with a problem that doesn't actually
> occur in the field; and which furthermore is trivially solved on any
> modern filesystem by choosing the right filesystem options. Why don't
> we just document those options, instead?
Which options would that be? I am not aware that there any for any of the
recent linux filesystems.
Well, except "sync" that is, but that sure would be more of a performance hit
than fsyncing the directory...

Andres


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-14 17:37:15
Message-ID: 22649.1266169035@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Andres Freund <andres(at)anarazel(dot)de> writes:
> On Sunday 14 February 2010 18:11:39 Tom Lane wrote:
>> It seems to me that we're talking about a huge hit in both code
>> complexity and performance to deal with a problem that doesn't actually
>> occur in the field; and which furthermore is trivially solved on any
>> modern filesystem by choosing the right filesystem options. Why don't
>> we just document those options, instead?

> Which options would that be? I am not aware that there any for any of the
> recent linux filesystems.

Shouldn't journaling of metadata be sufficient?

regards, tom lane


From: Florian Weimer <fw(at)deneb(dot)enyo(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync
Date: 2010-02-14 20:24:24
Message-ID: 87wryfpnuv.fsf@mid.deneb.enyo.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

* Tom Lane:

>> Which options would that be? I am not aware that there any for any of the
>> recent linux filesystems.
>
> Shouldn't journaling of metadata be sufficient?

You also need to enforce ordering between the directory update and the
file update. The file metadata is flushed with fsync(), but the
directory isn't. On some systems, all directory operations are
synchronous, but not on Linux.


From: Mark Mielke <mark(at)mark(dot)mielke(dot)cc>
To: Florian Weimer <fw(at)deneb(dot)enyo(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync
Date: 2010-02-14 20:41:02
Message-ID: 4B785FDE.8040308@mark.mielke.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 02/14/2010 03:24 PM, Florian Weimer wrote:
> * Tom Lane:
>
>>> Which options would that be? I am not aware that there any for any of the
>>> recent linux filesystems.
>>>
>> Shouldn't journaling of metadata be sufficient?
>>
> You also need to enforce ordering between the directory update and the
> file update. The file metadata is flushed with fsync(), but the
> directory isn't. On some systems, all directory operations are
> synchronous, but not on Linux.
>

dirsync
All directory updates within the filesystem should be
done syn-
chronously. This affects the following system calls:
creat,
link, unlink, symlink, mkdir, rmdir, mknod and rename.

The widely reported problems, though, did not tend to be a problem with
directory changes written too late - but directory changes being written
too early. That is, the directory change is written to disk, but the
file content is not. This is likely because of the "ordered journal"
mode widely used in ext3/ext4 where metadata changes are journalled, but
file pages are not journalled. Therefore, it is important for some
operations, that the file pages are pushed to disk using fsync(file),
before the metadata changes are journalled.

In theory there is some open hole where directory updates need to be
synchronized with file updates, as POSIX doesn't enforce this ordering,
and we can't trust that all file systems implicitly order things
correctly, but in practice, I don't see this sort of problem happening.

If you are concerned, enable dirsync.

Cheers,
mark

--
Mark Mielke<mark(at)mielke(dot)cc>


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Mark Mielke <mark(at)mark(dot)mielke(dot)cc>, Florian Weimer <fw(at)deneb(dot)enyo(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync
Date: 2010-02-14 20:49:09
Message-ID: 201002142149.12786.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sunday 14 February 2010 21:41:02 Mark Mielke wrote:
> On 02/14/2010 03:24 PM, Florian Weimer wrote:
> > * Tom Lane:
> >>> Which options would that be? I am not aware that there any for any of
> >>> the recent linux filesystems.
> >>
> >> Shouldn't journaling of metadata be sufficient?
> >
> > You also need to enforce ordering between the directory update and the
> > file update. The file metadata is flushed with fsync(), but the
> > directory isn't. On some systems, all directory operations are
> > synchronous, but not on Linux.
>
> dirsync
> All directory updates within the filesystem should be
> done syn-
> chronously. This affects the following system calls:
> creat,
> link, unlink, symlink, mkdir, rmdir, mknod and rename.
>
> The widely reported problems, though, did not tend to be a problem with
> directory changes written too late - but directory changes being written
> too early. That is, the directory change is written to disk, but the
> file content is not. This is likely because of the "ordered journal"
> mode widely used in ext3/ext4 where metadata changes are journalled, but
> file pages are not journalled. Therefore, it is important for some
> operations, that the file pages are pushed to disk using fsync(file),
> before the metadata changes are journalled.
Well, but thats not a problem with pg as it fsyncs the file contents.

> In theory there is some open hole where directory updates need to be
> synchronized with file updates, as POSIX doesn't enforce this ordering,
> and we can't trust that all file systems implicitly order things
> correctly, but in practice, I don't see this sort of problem happening.
I can try to reproduce it if you want...

> If you are concerned, enable dirsync.
If the filesystem already behaves that way a fsync on it should be fairly
cheap. If it doesnt behave that way doing it is correct...

Besides there is no reason to fsync the directory before the checkpoint, so
dirsync would require a higher cost than doing it correctly.

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-14 20:57:08
Message-ID: 603c8f071002141257g2dfb7a09i5c5930e9787783e1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sun, Feb 14, 2010 at 10:31 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Sun, Feb 14, 2010 at 2:03 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> On Fri, Feb 12, 2010 at 3:49 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> Greg Stark, have you managed to get your access issues sorted out?  If
>>
>> Yep, will look at this today.
>
> So I think we have a bigger problem than just copydir.c. It seems to
> me we should be fsyncing the table space data directories on every
> checkpoint. Otherwise any newly created relations or removed relations
> could disappear even though the data in them was fsynced. I'm thinking
> I should add an _mdfd_opentblspc(reln) call which returns a file
> descriptor for the tablespace and have mdsync() use that to sync the
> directory whenever it fsyncs a relation. It would be nice to remember
> which tablespaces have been fsynced and only fsync them once though,
> that would need another hash table just for tablespaces.
>
> We probably also need to fsync the pg_xlog directory every time we
> create or rename an xlog segment.
>
> Are there any other places we do directory operations which we need to
> be permanent?

I agree with Tom that we need to see some actual reproducible test
cases where this is an issue before we go too crazy with it. In
theory what you're talking about could also happen when extending a
relation, if we extend into a new file; but I think we need to
convince ourselves that it really happens before we make any more
changes.

On a pragmatic note, if this does turn out to be a problem, it's a
bug: and we can and do fix bugs whenever we discover them. But the
other part of this patch - to speed up createdb - is a feature - and
we are very rapidly running out of time for 9.0 features. So I'd like
to vote for getting the feature part of this committed (assuming it's
in good shape, of course) and we can continue to investigate the other
issues but without quite as much urgency.

...Robert


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-14 21:43:23
Message-ID: 201002142243.24871.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sunday 14 February 2010 21:57:08 Robert Haas wrote:
> On Sun, Feb 14, 2010 at 10:31 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > On Sun, Feb 14, 2010 at 2:03 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> >> On Fri, Feb 12, 2010 at 3:49 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
> >>> Greg Stark, have you managed to get your access issues sorted out? If
> >>
> >> Yep, will look at this today.
> >
> > So I think we have a bigger problem than just copydir.c. It seems to
> > me we should be fsyncing the table space data directories on every
> > checkpoint. Otherwise any newly created relations or removed relations
> > could disappear even though the data in them was fsynced. I'm thinking
> > I should add an _mdfd_opentblspc(reln) call which returns a file
> > descriptor for the tablespace and have mdsync() use that to sync the
> > directory whenever it fsyncs a relation. It would be nice to remember
> > which tablespaces have been fsynced and only fsync them once though,
> > that would need another hash table just for tablespaces.
> >
> > We probably also need to fsync the pg_xlog directory every time we
> > create or rename an xlog segment.
> >
> > Are there any other places we do directory operations which we need to
> > be permanent?
>
> I agree with Tom that we need to see some actual reproducible test
> cases where this is an issue before we go too crazy with it. In
> theory what you're talking about could also happen when extending a
> relation, if we extend into a new file; but I think we need to
> convince ourselves that it really happens before we make any more
> changes.
Ok, will try to reproduce.

> On a pragmatic note, if this does turn out to be a problem, it's a
> bug: and we can and do fix bugs whenever we discover them. But the
> other part of this patch - to speed up createdb - is a feature - and
> we are very rapidly running out of time for 9.0 features. So I'd like
> to vote for getting the feature part of this committed (assuming it's
> in good shape, of course) and we can continue to investigate the other
> issues but without quite as much urgency.
Sound sensible.

Andres


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Date: 2010-02-14 23:33:54
Message-ID: 407d949e1002141533m406d6ffev383908177ff5a18a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sun, Feb 14, 2010 at 8:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On a pragmatic note, if this does turn out to be a problem, it's a
> bug: and we can and do fix bugs whenever we discover them.  But the
> other part of this patch - to speed up createdb - is a feature - and
> we are very rapidly running out of time for 9.0 features.  So I'd like
> to vote for getting the feature part of this committed (assuming it's
> in good shape, of course) and we can continue to investigate the other
> issues but without quite as much urgency.

No problem, I already committed the part that overlaps so I can commit
the rest now. I just want to take extra care given how much wine I've
already had tonight...

Incidentally, sorry Andres, I forgot to credit you in the first commit.
--
greg


From: Mark Mielke <mark(at)mark(dot)mielke(dot)cc>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Florian Weimer <fw(at)deneb(dot)enyo(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync
Date: 2010-02-15 00:08:10
Message-ID: 4B78906A.7020309@mark.mielke.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 02/14/2010 03:49 PM, Andres Freund wrote:
> On Sunday 14 February 2010 21:41:02 Mark Mielke wrote:
>
>> The widely reported problems, though, did not tend to be a problem with
>> directory changes written too late - but directory changes being written
>> too early. That is, the directory change is written to disk, but the
>> file content is not. This is likely because of the "ordered journal"
>> mode widely used in ext3/ext4 where metadata changes are journalled, but
>> file pages are not journalled. Therefore, it is important for some
>> operations, that the file pages are pushed to disk using fsync(file),
>> before the metadata changes are journalled.
>>
> Well, but thats not a problem with pg as it fsyncs the file contents.
>

Exactly. Not a problem.

>> If you are concerned, enable dirsync.
>>
> If the filesystem already behaves that way a fsync on it should be fairly
> cheap. If it doesnt behave that way doing it is correct...
>

Well, I disagree, as the whole point of this thread is that fsync() is
*not* cheap. :-)

> Besides there is no reason to fsync the directory before the checkpoint, so
> dirsync would require a higher cost than doing it correctly.
>

Using "ordered" metadata journaling has approximately the same effect.
Provided that the data is fsync()'d before the metadata is required,
either the metadata is recorded in the journal, in which case the data
is accessible, or the metadata is NOT recorded in the journal, in which
case, the files will appear missing. The races that theoretically exist
would be in situations where the data of one file references a separate
file that does not yet exist.

You said you would try and reproduce - are you going to try and
reproduce on ext3/ext4 with ordered journalling enabled? I think
reproducing outside of a case such as CREATE DATABASE would be
difficult. It would have to be something like:

open(O_CREAT)/write()/fsync()/close() of new data file, where data
gets written, but directory data is not yet written out to journal
open()/.../write()/fsync()/close() of existing file to point to new
data file, but directory data is still not yet written out to journal
crash

In this case, "dirsync" should be effective at closing this hole.

As for cost? Well, most PostgreSQL data is stored within file content,
not directory metadata. I think "dirsync" might slow down some
operations like CREATE DATABASE or "rm -fr", but I would not expect it
to effect day-to-day performance of the database under real load. Many
operating systems enable the equivalent of "dirsync" by default. I
believe Solaris does this, for example, and other than slowing down "rm
-fr", I don't recall any real complaints about the cost of "dirsync".

After writing the above, I'm seriously considering adding "dirsync" to
my /db mounts that hold PostgreSQL and MySQL data.

Cheers,
mark

--
Mark Mielke<mark(at)mielke(dot)cc>