Re: fallocate / posix_fallocate for new WAL file creation (etc...)

Lists: pgsql-hackers
From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-14 01:54:39
Message-ID: CAKuK5J0raLwOiKfSh5d8SxtCY2snJAMsfo6RGTBMfcQYB+-faQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pertinent to another thread titled
[HACKERS] corrupt pages detected by enabling checksums
I hope to explore the possibility of using fallocate (or
posix_fallocate) for new WAL file creation.

Most modern Linux filesystems support fast fallocate/posix_fallocate,
reducing extent fragmentation (where extents are used) and frequently
offering a pretty significant speed improvement. In my tests, using
posix_fallocate (followed by pg_fsync) is at least 28 times quicker
than using the current method (which writes zeroes followed by
pg_fsync).

I have written up a patch to use posix_fallocate in new WAL file
creation, including configuration by way of a GUC variable, but I've
not contributed to the PostgreSQL project before. Therefore, I'm
fairly certain the patch is not formatted properly or conforms to the
appropriate style guides. Currently, the patch is based on 9.2, and is
quite small in size - 3.6KiB.

Advice on how to proceed is appreciated.

--
Jon


From: David Fetter <david(at)fetter(dot)org>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-14 02:41:37
Message-ID: 20130514024137.GA24304@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, May 13, 2013 at 08:54:39PM -0500, Jon Nelson wrote:
> Pertinent to another thread titled
> [HACKERS] corrupt pages detected by enabling checksums
> I hope to explore the possibility of using fallocate (or
> posix_fallocate) for new WAL file creation.
>
> Most modern Linux filesystems support fast fallocate/posix_fallocate,
> reducing extent fragmentation (where extents are used) and frequently
> offering a pretty significant speed improvement. In my tests, using
> posix_fallocate (followed by pg_fsync) is at least 28 times quicker
> than using the current method (which writes zeroes followed by
> pg_fsync).
>
> I have written up a patch to use posix_fallocate in new WAL file
> creation, including configuration by way of a GUC variable, but I've
> not contributed to the PostgreSQL project before. Therefore, I'm
> fairly certain the patch is not formatted properly or conforms to the
> appropriate style guides. Currently, the patch is based on 9.2, and is
> quite small in size - 3.6KiB.
>
> Advice on how to proceed is appreciated.

Thanks for hopping in!

Please re-base the patch vs. git master, as new features like this go
there. Please also to send along the tests you're doing so others can
riff. Tests that find any weak points are also good.

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-15 02:43:06
Message-ID: CA+TgmoZUH69T-uDLE7zUfKcP=sUe07zfj8prMWtsGES9h1XTzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, May 13, 2013 at 9:54 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> Pertinent to another thread titled
> [HACKERS] corrupt pages detected by enabling checksums
> I hope to explore the possibility of using fallocate (or
> posix_fallocate) for new WAL file creation.
>
> Most modern Linux filesystems support fast fallocate/posix_fallocate,
> reducing extent fragmentation (where extents are used) and frequently
> offering a pretty significant speed improvement. In my tests, using
> posix_fallocate (followed by pg_fsync) is at least 28 times quicker
> than using the current method (which writes zeroes followed by
> pg_fsync).
>
> I have written up a patch to use posix_fallocate in new WAL file
> creation, including configuration by way of a GUC variable, but I've
> not contributed to the PostgreSQL project before. Therefore, I'm
> fairly certain the patch is not formatted properly or conforms to the
> appropriate style guides. Currently, the patch is based on 9.2, and is
> quite small in size - 3.6KiB.
>
> Advice on how to proceed is appreciated.

Make sure to list it here:

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-15 21:26:15
Message-ID: CAKuK5J151q9viPA1L3C=7DopApN3HZNpg+Ze5JEyGyPSxjDH5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 14, 2013 at 9:43 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, May 13, 2013 at 9:54 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
>> Pertinent to another thread titled
>> [HACKERS] corrupt pages detected by enabling checksums
>> I hope to explore the possibility of using fallocate (or
>> posix_fallocate) for new WAL file creation.
>>
>> Most modern Linux filesystems support fast fallocate/posix_fallocate,
>> reducing extent fragmentation (where extents are used) and frequently
>> offering a pretty significant speed improvement. In my tests, using
>> posix_fallocate (followed by pg_fsync) is at least 28 times quicker
>> than using the current method (which writes zeroes followed by
>> pg_fsync).
>>
>> I have written up a patch to use posix_fallocate in new WAL file
>> creation, including configuration by way of a GUC variable, but I've
>> not contributed to the PostgreSQL project before. Therefore, I'm
>> fairly certain the patch is not formatted properly or conforms to the
>> appropriate style guides. Currently, the patch is based on 9.2, and is
>> quite small in size - 3.6KiB.

I have re-based and reformatted the code, and basic testing shows a
reduction in WAL-file creation time of a fairly significant amount.
I ran 'make test' and did additional local testing without issue.
Therefore, I am attaching the patch. I will try to add it to the
commitfest page.

--
Jon

Attachment Content-Type Size
0001-enhance-GUC-and-xlog-with-wal_use_fallocate-boolean-.patch application/octet-stream 5.8 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-15 21:34:45
Message-ID: 20130515213445.GB22783@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-05-15 16:26:15 -0500, Jon Nelson wrote:
> >> I have written up a patch to use posix_fallocate in new WAL file
> >> creation, including configuration by way of a GUC variable, but I've
> >> not contributed to the PostgreSQL project before. Therefore, I'm
> >> fairly certain the patch is not formatted properly or conforms to the
> >> appropriate style guides. Currently, the patch is based on 9.2, and is
> >> quite small in size - 3.6KiB.
>
> I have re-based and reformatted the code, and basic testing shows a
> reduction in WAL-file creation time of a fairly significant amount.
> I ran 'make test' and did additional local testing without issue.
> Therefore, I am attaching the patch. I will try to add it to the
> commitfest page.

Some where quick comments, without thinking about this:

* needs a configure check for posix_fallocate. The current version will
e.g. fail to compile on windows or many other non linux systems. Check
how its done for posix_fadvise.
* Is wal file creation performance actually relevant? Is the performance
of a system running on fallocate()d wal files any different?
* According to the man page posix_fallocate doesn't set errno but rather
returns the error code.
* I wonder whether we ever want to actually disable this? Afair the libc
contains emulation for posix_fadvise if the filesystem doesn't support
it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-15 21:46:33
Message-ID: CAKuK5J1wgybQs_YQms60+8NYsuA5A8sACvhSAhVtPFihjmdGhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 15, 2013 at 4:34 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> On 2013-05-15 16:26:15 -0500, Jon Nelson wrote:
>> >> I have written up a patch to use posix_fallocate in new WAL file
>> >> creation, including configuration by way of a GUC variable, but I've
>> >> not contributed to the PostgreSQL project before. Therefore, I'm
>> >> fairly certain the patch is not formatted properly or conforms to the
>> >> appropriate style guides. Currently, the patch is based on 9.2, and is
>> >> quite small in size - 3.6KiB.
>>
>> I have re-based and reformatted the code, and basic testing shows a
>> reduction in WAL-file creation time of a fairly significant amount.
>> I ran 'make test' and did additional local testing without issue.
>> Therefore, I am attaching the patch. I will try to add it to the
>> commitfest page.
>
> Some where quick comments, without thinking about this:

Thank you for the kind feedback.

> * needs a configure check for posix_fallocate. The current version will
> e.g. fail to compile on windows or many other non linux systems. Check
> how its done for posix_fadvise.

I will address as soon as I am able.

> * Is wal file creation performance actually relevant? Is the performance
> of a system running on fallocate()d wal files any different?

In my limited testing, I noticed a drop of approx. 100ms per WAL file.
I do not have a good idea for how to really stress the WAL-file
creation area without calling pg_start_backup and pg_stop_backup over
and over (with archiving enabled).

However, a file allocated with fallocate is (supposed to be) less
fragmented than one created by the traditional means.

> * According to the man page posix_fallocate doesn't set errno but rather
> returns the error code.

That's true. I originally wrote the patch using fallocate(2). What
would be appropriate here? Should I switch on the return value and the
six (6) or so relevant error codes?

> * I wonder whether we ever want to actually disable this? Afair the libc
> contains emulation for posix_fadvise if the filesystem doesn't support
> it.

I know that glibc does, but I don't know about other libc implementations.

--
Jon


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-16 02:14:33
Message-ID: CAKuK5J3WtHSCYnhvtYP3u6VwxuGyijO4383Du03qs8R6avOG1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> On Wed, May 15, 2013 at 4:34 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
..
>> Some where quick comments, without thinking about this:
>
> Thank you for the kind feedback.
>
>> * needs a configure check for posix_fallocate. The current version will
>> e.g. fail to compile on windows or many other non linux systems. Check
>> how its done for posix_fadvise.

The following patch includes the changes to configure.in.
I had to make other changes (not included here) because my local
system uses autoconf 2.69, but I did test this successfully.

> That's true. I originally wrote the patch using fallocate(2). What
> would be appropriate here? Should I switch on the return value and the
> six (6) or so relevant error codes?

I addressed this, hopefully in a reasonable way.

--
Jon

Attachment Content-Type Size
fallocate.patch-v2 application/octet-stream 6.9 KB

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-16 03:17:05
Message-ID: 20130516031705.GE15045@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jon Nelson escribió:
> On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:

> > That's true. I originally wrote the patch using fallocate(2). What
> > would be appropriate here? Should I switch on the return value and the
> > six (6) or so relevant error codes?
>
> I addressed this, hopefully in a reasonable way.

Would it work to just assign the value you got from posix_fallocate (if
nonzero) to errno and then use %m in the errmsg() call in ereport()?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-16 03:36:36
Message-ID: CAKuK5J14rxgd259KK6Xd0HmR+90o3hpi++NFZfh5n84rrGZ5_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Jon Nelson escribió:
>> On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
>
>> > That's true. I originally wrote the patch using fallocate(2). What
>> > would be appropriate here? Should I switch on the return value and the
>> > six (6) or so relevant error codes?
>>
>> I addressed this, hopefully in a reasonable way.
>
> Would it work to just assign the value you got from posix_fallocate (if
> nonzero) to errno and then use %m in the errmsg() call in ereport()?

That strikes me as a better way. I'll work something up soon.
Thanks!

--
Jon


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-16 13:16:33
Message-ID: CAKuK5J0D8pb223Y1tq7VAO-pixadG8NXjf3gpFtMb8QGVgJ4+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 15, 2013 at 10:36 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera
> <alvherre(at)2ndquadrant(dot)com> wrote:
>> Jon Nelson escribió:
>>> On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
>>
>>> > That's true. I originally wrote the patch using fallocate(2). What
>>> > would be appropriate here? Should I switch on the return value and the
>>> > six (6) or so relevant error codes?
>>>
>>> I addressed this, hopefully in a reasonable way.
>>
>> Would it work to just assign the value you got from posix_fallocate (if
>> nonzero) to errno and then use %m in the errmsg() call in ereport()?
>
> That strikes me as a better way. I'll work something up soon.
> Thanks!

Please find attached version 3.
Am I doing this the right way? Should I be posting the full patch each
time, or incremental patches?

--
Jon

Attachment Content-Type Size
fallocate-v3.patch application/octet-stream 6.4 KB

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-16 16:54:18
Message-ID: 20130516165418.GG15045@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jon Nelson escribió:

> Am I doing this the right way? Should I be posting the full patch each
> time, or incremental patches?

Full patch each time is okay. Context-format patch is even better.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-17 00:05:31
Message-ID: 5195744B.2080006@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/16/13 9:16 AM, Jon Nelson wrote:
> Am I doing this the right way? Should I be posting the full patch each
> time, or incremental patches?

There are guidelines for getting your patch in the right format at
https://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git
that would improve this one. You have some formatting issues with tab
spacing at lines 120 through 133 in your v3 patch. And it looks like
there was a formatting change on line 146 that is making the diff larger
than it needs to be.

The biggest thing missing from this submission is information about what
performance testing you did. Ideally performance patches are submitted
with enough information for a reviewer to duplicate the same test the
author did, as well as hard before/after performance numbers from your
test system. It often turns tricky to duplicate a performance gain, and
being able to run the same test used for initial development eliminates
a lot of the problems.

Second bit of nitpicking. There are already some GUC values that appear
or disappear based on compile time options. They're all debugging
related things though. I would prefer not to see this one go away when
it's implementation isn't available. That's going to break any scripts
that SHOW the setting to see if it's turned on or not as a first
problem. I think the right model to follow here is the IFDEF setup used
for effective_io_concurrency. I wouldn't worry about this too much
though. Having a wal_use_fallocate GUC is good for testing. But if it
works out well, when it's ready for commit I don't see why anyone would
want it turned off on platforms where it works. There are already too
many performance tweaking GUCs. Something has to be very likely to be
changed from the default before its worth adding one for it.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-17 09:47:31
Message-ID: 20130517094731.GB6616@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:
> > * Is wal file creation performance actually relevant? Is the performance
> > of a system running on fallocate()d wal files any different?
>
> In my limited testing, I noticed a drop of approx. 100ms per WAL file.
> I do not have a good idea for how to really stress the WAL-file
> creation area without calling pg_start_backup and pg_stop_backup over
> and over (with archiving enabled).

My point is that wal file creation usually isn't all that performance
sensitive. Once the cluster has enough WAL files it will usually recycle
them and thus never allocate new ones. So for this to be really
beneficial it would be interesting to show different performance during
normal running. You could also check out of how many extents a wal file
is made out of with fallocate in comparison to the old style method
(filefrag will give you that for most filesystems).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-17 13:29:29
Message-ID: CAHyXU0z2+TJ0cRMOFv8RDAzebGVJJJQVs8QykjnS7-bdgXLy6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:
>> > * Is wal file creation performance actually relevant? Is the performance
>> > of a system running on fallocate()d wal files any different?
>>
>> In my limited testing, I noticed a drop of approx. 100ms per WAL file.
>> I do not have a good idea for how to really stress the WAL-file
>> creation area without calling pg_start_backup and pg_stop_backup over
>> and over (with archiving enabled).
>
> My point is that wal file creation usually isn't all that performance
> sensitive. Once the cluster has enough WAL files it will usually recycle
> them and thus never allocate new ones. So for this to be really
> beneficial it would be interesting to show different performance during
> normal running. You could also check out of how many extents a wal file
> is made out of with fallocate in comparison to the old style method
> (filefrag will give you that for most filesystems).

But why does it have to be *really* beneficial? We're already making
optional posix_fxxx calls and fallocate seems to do exactly what we
would want in this context. Even if the 100ms drop doesn't show up
all that often, I'd still take it just for the defragmentation
benefits and the patch is fairly tiny.

merlin


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-17 20:48:38
Message-ID: CAHyXU0w9s-ibu2TKDRwPzStDVi6ZMk+xveHKjoFay=gRDQ3SxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:
>>> > * Is wal file creation performance actually relevant? Is the performance
>>> > of a system running on fallocate()d wal files any different?
>>>
>>> In my limited testing, I noticed a drop of approx. 100ms per WAL file.
>>> I do not have a good idea for how to really stress the WAL-file
>>> creation area without calling pg_start_backup and pg_stop_backup over
>>> and over (with archiving enabled).
>>
>> My point is that wal file creation usually isn't all that performance
>> sensitive. Once the cluster has enough WAL files it will usually recycle
>> them and thus never allocate new ones. So for this to be really
>> beneficial it would be interesting to show different performance during
>> normal running. You could also check out of how many extents a wal file
>> is made out of with fallocate in comparison to the old style method
>> (filefrag will give you that for most filesystems).
>
> But why does it have to be *really* beneficial? We're already making
> optional posix_fxxx calls and fallocate seems to do exactly what we
> would want in this context. Even if the 100ms drop doesn't show up
> all that often, I'd still take it just for the defragmentation
> benefits and the patch is fairly tiny.

Here is sample output of filefrag on a somewhat busy database from our
testing environment that exactly duplicates our production workloads..
It does a lot of batch processing at night and a mix of 80%oltp 20%
olap during the day. This is on ext3. Interestingly, on ext4 servers
I never saw more than 2 extents per file (but those servers are mostly
not as busy).

[root(at)rpisatysw001 pg_xlog]# filefrag *
00000001000006D200000064: 490 extents found, perfection would be 1 extent
00000001000006D200000065: 33 extents found, perfection would be 1 extent
00000001000006D200000066: 43 extents found, perfection would be 1 extent
00000001000006D200000067: 71 extents found, perfection would be 1 extent
00000001000006D200000068: 43 extents found, perfection would be 1 extent
00000001000006D200000069: 156 extents found, perfection would be 1 extent
00000001000006D20000006A: 52 extents found, perfection would be 1 extent
00000001000006D20000006B: 108 extents found, perfection would be 1 extent

merlin


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-17 21:18:26
Message-ID: 20130517211826.GA19654@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-17 15:48:38 -0500, Merlin Moncure wrote:
> On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> > On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:
> >>> > * Is wal file creation performance actually relevant? Is the performance
> >>> > of a system running on fallocate()d wal files any different?
> >>>
> >>> In my limited testing, I noticed a drop of approx. 100ms per WAL file.
> >>> I do not have a good idea for how to really stress the WAL-file
> >>> creation area without calling pg_start_backup and pg_stop_backup over
> >>> and over (with archiving enabled).
> >>
> >> My point is that wal file creation usually isn't all that performance
> >> sensitive. Once the cluster has enough WAL files it will usually recycle
> >> them and thus never allocate new ones. So for this to be really
> >> beneficial it would be interesting to show different performance during
> >> normal running. You could also check out of how many extents a wal file
> >> is made out of with fallocate in comparison to the old style method
> >> (filefrag will give you that for most filesystems).
> >
> > But why does it have to be *really* beneficial? We're already making
> > optional posix_fxxx calls and fallocate seems to do exactly what we
> > would want in this context. Even if the 100ms drop doesn't show up
> > all that often, I'd still take it just for the defragmentation
> > benefits and the patch is fairly tiny.

Well, it needs to be tested et al. And its a fairly critical code
path. I seem to remember that there were older glibc versions that
didn't do such a great job at emulating fallocate for example.

> Here is sample output of filefrag on a somewhat busy database from our
> testing environment that exactly duplicates our production workloads..
> It does a lot of batch processing at night and a mix of 80%oltp 20%
> olap during the day. This is on ext3. Interestingly, on ext4 servers
> I never saw more than 2 extents per file (but those servers are mostly
> not as busy).

Ok, that's pretty bad. 490 extents in one file? Really? I'd consider
shutting down the cluster, copying the wal files in a moment where there
is enough free space. Just don't forget to sync afterwards.
EXT4 is notably better at allocating space in growing files than ext3
due to delayed allocation (and other things), so it wouldn't surprise me
similar differences in fragmentation even if the load were comparable.

Ext3 doesn't have fallocate btw, so it wouldn't benefit from such a
patch anyway.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-17 21:52:45
Message-ID: CAHyXU0zwic2=qA9GFv6S4upUGXVWvRLwZNRqJ_v6U+ydBBc_Tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 17, 2013 at 4:18 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-05-17 15:48:38 -0500, Merlin Moncure wrote:
>> On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> > On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >> On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:
>> >>> > * Is wal file creation performance actually relevant? Is the performance
>> >>> > of a system running on fallocate()d wal files any different?
>> >>>
>> >>> In my limited testing, I noticed a drop of approx. 100ms per WAL file.
>> >>> I do not have a good idea for how to really stress the WAL-file
>> >>> creation area without calling pg_start_backup and pg_stop_backup over
>> >>> and over (with archiving enabled).
>> >>
>> >> My point is that wal file creation usually isn't all that performance
>> >> sensitive. Once the cluster has enough WAL files it will usually recycle
>> >> them and thus never allocate new ones. So for this to be really
>> >> beneficial it would be interesting to show different performance during
>> >> normal running. You could also check out of how many extents a wal file
>> >> is made out of with fallocate in comparison to the old style method
>> >> (filefrag will give you that for most filesystems).
>> >
>> > But why does it have to be *really* beneficial? We're already making
>> > optional posix_fxxx calls and fallocate seems to do exactly what we
>> > would want in this context. Even if the 100ms drop doesn't show up
>> > all that often, I'd still take it just for the defragmentation
>> > benefits and the patch is fairly tiny.
>
> Well, it needs to be tested et al. And its a fairly critical code
> path. I seem to remember that there were older glibc versions that
> didn't do such a great job at emulating fallocate for example.
>
>> Here is sample output of filefrag on a somewhat busy database from our
>> testing environment that exactly duplicates our production workloads..
>> It does a lot of batch processing at night and a mix of 80%oltp 20%
>> olap during the day. This is on ext3. Interestingly, on ext4 servers
>> I never saw more than 2 extents per file (but those servers are mostly
>> not as busy).
>
> Ok, that's pretty bad. 490 extents in one file? Really? I'd consider
> shutting down the cluster, copying the wal files in a moment where there
> is enough free space. Just don't forget to sync afterwards.
> EXT4 is notably better at allocating space in growing files than ext3
> due to delayed allocation (and other things), so it wouldn't surprise me
> similar differences in fragmentation even if the load were comparable.
>
> Ext3 doesn't have fallocate btw, so it wouldn't benefit from such a
> patch anyway.

yeah -- I see your point. The object lesson isn't so much 'improve
postgres' as it is to 'use a modern filesystem'.

merlin


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-25 18:55:09
Message-ID: CAKuK5J3UPsP64G1f3GSiQyx7W2229yffjPZ4_ctGqvLh2khL+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 16, 2013 at 7:05 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 5/16/13 9:16 AM, Jon Nelson wrote:
>>
>> Am I doing this the right way? Should I be posting the full patch each
>> time, or incremental patches?
>
>
> There are guidelines for getting your patch in the right format at
> https://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git
> that would improve this one. You have some formatting issues with tab
> spacing at lines 120 through 133 in your v3 patch. And it looks like there
> was a formatting change on line 146 that is making the diff larger than it
> needs to be.

I've corrected the formatting change (end-of-line whitespace was
stripped) on line 146.
The other whitespace changes are - I think - due to newly-indented
code due to a new code block.
Included please find a v4 patch which uses context diffs per the above url.

> The biggest thing missing from this submission is information about what
> performance testing you did. Ideally performance patches are submitted with
> enough information for a reviewer to duplicate the same test the author did,
> as well as hard before/after performance numbers from your test system. It
> often turns tricky to duplicate a performance gain, and being able to run
> the same test used for initial development eliminates a lot of the problems.

This has been a bit of a struggle. While it's true that WAL file
creation doesn't happen with great frequency, and while it's also true
that - with strace and other tests - it can be proven that
fallocate(16MB) is much quicker than writing it zeroes by hand,
proving that in the larger context of a running install has been
challenging.

Attached you'll find a small test script (t.sh) which creates a new
cluster in 'foo', changes some config values, starts the cluster, and
then times how long it takes pgbench to prepare a database. I've used
"wal_level = hot_standby" in the hopes that this generates the largest
number of WAL files (and I set the number of such files to 1024). The
hardware is an AMD 9150e with a 2-disk software RAID1 (SATA disks) on
kernel 3.9.2 and ext4 (x86_64, openSUSE 12.3). The test results are
not that surprising. The longer the test (the larger the scale factor)
the less of a difference using posix_fallocate makes. With a scale
factor of 100, I see an average of 10-11% reduction in the time taken
to initialize the database. With 300, it's about 5.5% and with 900,
it's between 0 and 1.2%. I will be doing more testing but this is what
I started with. I'm very open to suggestions.

> Second bit of nitpicking. There are already some GUC values that appear or
> disappear based on compile time options. They're all debugging related
> things though. I would prefer not to see this one go away when it's
> implementation isn't available. That's going to break any scripts that SHOW
> the setting to see if it's turned on or not as a first problem. I think the
> right model to follow here is the IFDEF setup used for
> effective_io_concurrency. I wouldn't worry about this too much though.
> Having a wal_use_fallocate GUC is good for testing. But if it works out
> well, when it's ready for commit I don't see why anyone would want it turned
> off on platforms where it works. There are already too many performance
> tweaking GUCs. Something has to be very likely to be changed from the
> default before its worth adding one for it.

Ack. I've revised the patch to always have the GUC (for now), default
to false, and if configure can't find posix_fallocate (or the user
disables it by way of pg_config_manual.h) then it remains a GUC that
simply can't be changed.

I'll also be re-running the tests.

--
Jon

Attachment Content-Type Size
image/png 7.6 KB
fallocate-v4.patch application/octet-stream 7.2 KB
t.sh application/x-sh 952 bytes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-28 14:03:58
Message-ID: CA+TgmoYPiNEXonznqH4gsTSob2jcF7T1Qqf+XPWsdUeZgPp0zA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
>> The biggest thing missing from this submission is information about what
>> performance testing you did. Ideally performance patches are submitted with
>> enough information for a reviewer to duplicate the same test the author did,
>> as well as hard before/after performance numbers from your test system. It
>> often turns tricky to duplicate a performance gain, and being able to run
>> the same test used for initial development eliminates a lot of the problems.
>
> This has been a bit of a struggle. While it's true that WAL file
> creation doesn't happen with great frequency, and while it's also true
> that - with strace and other tests - it can be proven that
> fallocate(16MB) is much quicker than writing it zeroes by hand,
> proving that in the larger context of a running install has been
> challenging.

It's nice to be able to test things in the context of a running
install, but sometimes a microbenchmark is just as good. I mean, if
posix_fallocate() is faster, then it's just faster, right? It's
likely to be pretty hard to get reproducible numbers for how much this
actually helps in the real world because write tests are inherently
pretty variable depending on a lot of factors we don't control, so
even if Jon has got the best possible test, the numbers may bounce
around so much that you can't really measure the (probably small) gain
from this approach. But that doesn't seem like a reason not to adopt
the approach and take whatever gain there is. At least, not that I
can see.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-28 14:15:24
Message-ID: 20130528141524.GC4274@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-28 10:03:58 -0400, Robert Haas wrote:
> On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> >> The biggest thing missing from this submission is information about what
> >> performance testing you did. Ideally performance patches are submitted with
> >> enough information for a reviewer to duplicate the same test the author did,
> >> as well as hard before/after performance numbers from your test system. It
> >> often turns tricky to duplicate a performance gain, and being able to run
> >> the same test used for initial development eliminates a lot of the problems.
> >
> > This has been a bit of a struggle. While it's true that WAL file
> > creation doesn't happen with great frequency, and while it's also true
> > that - with strace and other tests - it can be proven that
> > fallocate(16MB) is much quicker than writing it zeroes by hand,
> > proving that in the larger context of a running install has been
> > challenging.
>
> It's nice to be able to test things in the context of a running
> install, but sometimes a microbenchmark is just as good. I mean, if
> posix_fallocate() is faster, then it's just faster, right?

Well, it's a bit more complex than that. Fallocate doesn't actually
initializes the disk space in most filesystems, just marks it as
allocated and zeroed which is one of the reasons it can be noticeably
faster. But that can make the runtime overhead of writing to those pages
higher.

I wonder whether noticeably upping checkpoint segments and then
a) COPY in a large table
b) a pgbench on a previously initialized table.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-28 14:19:47
Message-ID: CA+TgmobutWHS9T0MxNJF-ZrNG7wBJYhR-HtiqmPtoKaqn7=HAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 28, 2013 at 10:15 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-05-28 10:03:58 -0400, Robert Haas wrote:
>> On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
>> >> The biggest thing missing from this submission is information about what
>> >> performance testing you did. Ideally performance patches are submitted with
>> >> enough information for a reviewer to duplicate the same test the author did,
>> >> as well as hard before/after performance numbers from your test system. It
>> >> often turns tricky to duplicate a performance gain, and being able to run
>> >> the same test used for initial development eliminates a lot of the problems.
>> >
>> > This has been a bit of a struggle. While it's true that WAL file
>> > creation doesn't happen with great frequency, and while it's also true
>> > that - with strace and other tests - it can be proven that
>> > fallocate(16MB) is much quicker than writing it zeroes by hand,
>> > proving that in the larger context of a running install has been
>> > challenging.
>>
>> It's nice to be able to test things in the context of a running
>> install, but sometimes a microbenchmark is just as good. I mean, if
>> posix_fallocate() is faster, then it's just faster, right?
>
> Well, it's a bit more complex than that. Fallocate doesn't actually
> initializes the disk space in most filesystems, just marks it as
> allocated and zeroed which is one of the reasons it can be noticeably
> faster. But that can make the runtime overhead of writing to those pages
> higher.

Maybe it would be good to measure that impact. Something like this:

1. Write 16MB of zeroes to an empty file in the same size chunks we're
currently using (8kB?). Time that. Rewrite the file with real data.
Time that.
2. posix_fallocate() an empty file out to 16MB. Time that. Rewrite
the fie with real data. Time that.

Personally, I have trouble believing that writing 16MB of zeroes by
hand is "better" than telling the OS to do it for us. If that's so,
the OS is just stupid, because it ought to be able to optimize the
crap out of that compared to anything we can do. Of course, it is
more than possible that the OS is in fact stupid. But I'd like to
hope not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-28 15:12:05
Message-ID: CAKuK5J2juNZRwsiR0+S4iNQLCmCWFErNzkD7SJeCPPJY+_zefA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 28, 2013 at 9:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, May 28, 2013 at 10:15 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> On 2013-05-28 10:03:58 -0400, Robert Haas wrote:
>>> On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
>>> >> The biggest thing missing from this submission is information about what
>>> >> performance testing you did. Ideally performance patches are submitted with
>>> >> enough information for a reviewer to duplicate the same test the author did,
>>> >> as well as hard before/after performance numbers from your test system. It
>>> >> often turns tricky to duplicate a performance gain, and being able to run
>>> >> the same test used for initial development eliminates a lot of the problems.
>>> >
>>> > This has been a bit of a struggle. While it's true that WAL file
>>> > creation doesn't happen with great frequency, and while it's also true
>>> > that - with strace and other tests - it can be proven that
>>> > fallocate(16MB) is much quicker than writing it zeroes by hand,
>>> > proving that in the larger context of a running install has been
>>> > challenging.
>>>
>>> It's nice to be able to test things in the context of a running
>>> install, but sometimes a microbenchmark is just as good. I mean, if
>>> posix_fallocate() is faster, then it's just faster, right?
>>
>> Well, it's a bit more complex than that. Fallocate doesn't actually
>> initializes the disk space in most filesystems, just marks it as
>> allocated and zeroed which is one of the reasons it can be noticeably
>> faster. But that can make the runtime overhead of writing to those pages
>> higher.
>
> Maybe it would be good to measure that impact. Something like this:
>
> 1. Write 16MB of zeroes to an empty file in the same size chunks we're
> currently using (8kB?). Time that. Rewrite the file with real data.
> Time that.
> 2. posix_fallocate() an empty file out to 16MB. Time that. Rewrite
> the fie with real data. Time that.
>
> Personally, I have trouble believing that writing 16MB of zeroes by
> hand is "better" than telling the OS to do it for us. If that's so,
> the OS is just stupid, because it ought to be able to optimize the
> crap out of that compared to anything we can do. Of course, it is
> more than possible that the OS is in fact stupid. But I'd like to
> hope not.

I wrote a little C program to do something very similar to that (which
I'll hope to post later today).
It opens a new file, fallocates 16MB, calls fdatasync. Then it loops
10 times: seek to the start of the file, writes 16MB of ones, calls
fdatasync.
Then it closes and removes the file, re-opens it, and this time writes
out 16MB of zeroes, calls fdatasync, and then does the same loop as
above. The program times the process from file open to file unlink,
inclusive.

The results - for me - are pretty consistent: using fallocate is
12-13% quicker than writing out zeroes. I used fdatasync twice to
(attempt) to mimic what the WAL writer does.

--
Jon


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-28 15:21:05
Message-ID: 20130528152105.GB16637@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-28 10:12:05 -0500, Jon Nelson wrote:
> On Tue, May 28, 2013 at 9:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Tue, May 28, 2013 at 10:15 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> On 2013-05-28 10:03:58 -0400, Robert Haas wrote:
> >>> On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> >>> >> The biggest thing missing from this submission is information about what
> >>> >> performance testing you did. Ideally performance patches are submitted with
> >>> >> enough information for a reviewer to duplicate the same test the author did,
> >>> >> as well as hard before/after performance numbers from your test system. It
> >>> >> often turns tricky to duplicate a performance gain, and being able to run
> >>> >> the same test used for initial development eliminates a lot of the problems.
> >>> >
> >>> > This has been a bit of a struggle. While it's true that WAL file
> >>> > creation doesn't happen with great frequency, and while it's also true
> >>> > that - with strace and other tests - it can be proven that
> >>> > fallocate(16MB) is much quicker than writing it zeroes by hand,
> >>> > proving that in the larger context of a running install has been
> >>> > challenging.
> >>>
> >>> It's nice to be able to test things in the context of a running
> >>> install, but sometimes a microbenchmark is just as good. I mean, if
> >>> posix_fallocate() is faster, then it's just faster, right?
> >>
> >> Well, it's a bit more complex than that. Fallocate doesn't actually
> >> initializes the disk space in most filesystems, just marks it as
> >> allocated and zeroed which is one of the reasons it can be noticeably
> >> faster. But that can make the runtime overhead of writing to those pages
> >> higher.
> >
> > Maybe it would be good to measure that impact. Something like this:
> >
> > 1. Write 16MB of zeroes to an empty file in the same size chunks we're
> > currently using (8kB?). Time that. Rewrite the file with real data.
> > Time that.
> > 2. posix_fallocate() an empty file out to 16MB. Time that. Rewrite
> > the fie with real data. Time that.
> >
> > Personally, I have trouble believing that writing 16MB of zeroes by
> > hand is "better" than telling the OS to do it for us. If that's so,
> > the OS is just stupid, because it ought to be able to optimize the
> > crap out of that compared to anything we can do. Of course, it is
> > more than possible that the OS is in fact stupid. But I'd like to
> > hope not.
>
> I wrote a little C program to do something very similar to that (which
> I'll hope to post later today).
> It opens a new file, fallocates 16MB, calls fdatasync. Then it loops
> 10 times: seek to the start of the file, writes 16MB of ones, calls
> fdatasync.

You need to call fsync() not fdatasync() the first time round. fdatasync
doesn't guarantee metadata is synced.

> Then it closes and removes the file, re-opens it, and this time writes
> out 16MB of zeroes, calls fdatasync, and then does the same loop as
> above. The program times the process from file open to file unlink,
> inclusive.
>
> The results - for me - are pretty consistent: using fallocate is
> 12-13% quicker than writing out zeroes.

Cool!

> I used fdatasync twice to (attempt) to mimic what the WAL writer does.

Not sure what you mean by that though?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-28 15:36:49
Message-ID: 51A4CF11.1060607@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/28/13 11:12 AM, Jon Nelson wrote:
> It opens a new file, fallocates 16MB, calls fdatasync.

Outside of the run for performance testing, I think it would be good at
this point to validate that there is really a 16MB file full of zeroes
resulting from these operations. I am not really concerned that
posix_fallocate might be slower in some cases; that seems unlikely. I
am concerned that it might result in a file that isn't structurally the
same as the 16MB of zero writes implementation used now.

The timing program you're writing has some aspects that are similar to
the contrib/pg_test_fsync program. You might borrow some code from
there usefully.

To clarify the suggestion I was making before about including
performance test results: that doesn't necessarily mean the testing
code must run using only the database. That's better if possible, but
as Robert says it may not be for some optimizations. The important
thing is to have something measuring the improvement that a reviewer can
duplicate, and if that's a standalone benchmark problem that's still
very useful. The main thing I'm wary of is any "this should be faster"
claims that don't come with any repeatable measurements at all. Very
often theories about the fastest way to do something don't match what's
actually seen in testing.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-29 02:00:18
Message-ID: CAKuK5J1OkqYeZzC1w2yz41h1q9Mz8H=2cH1yz_Lr2EXXsbqDjQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 28, 2013 at 10:36 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 5/28/13 11:12 AM, Jon Nelson wrote:
>>
>> It opens a new file, fallocates 16MB, calls fdatasync.
>
>
> Outside of the run for performance testing, I think it would be good at this
> point to validate that there is really a 16MB file full of zeroes resulting
> from these operations. I am not really concerned that posix_fallocate might
> be slower in some cases; that seems unlikely. I am concerned that it might
> result in a file that isn't structurally the same as the 16MB of zero writes
> implementation used now.

util-linux comes with fallocate which (might?) suffice for testing in
that respect, no?
If that is a real concern, it could be made part of the autoconf
testing, perhaps.

> The timing program you're writing has some aspects that are similar to the
> contrib/pg_test_fsync program. You might borrow some code from there
> usefully.

Thanks! If it looks like what I'm attaching will not do, then I'll
look at that as a possible next step.

> To clarify the suggestion I was making before about including performance
> test results: that doesn't necessarily mean the testing code must run using
> only the database. That's better if possible, but as Robert says it may not
> be for some optimizations. The important thing is to have something
> measuring the improvement that a reviewer can duplicate, and if that's a
> standalone benchmark problem that's still very useful. The main thing I'm
> wary of is any "this should be faster" claims that don't come with any
> repeatable measurements at all. Very often theories about the fastest way
> to do something don't match what's actually seen in testing.

Ack.
A note: The attached test program uses *fsync* instead of *fdatasync*
after calling fallocate (or writing out 16MB of zeroes), per an
earlier suggestion.

--
Jon

Attachment Content-Type Size
test_fallocate.c text/x-csrc 2.6 KB

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-29 02:10:54
Message-ID: 51A563AE.2090507@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/28/13 10:00 PM, Jon Nelson wrote:
> On Tue, May 28, 2013 at 10:36 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>> On 5/28/13 11:12 AM, Jon Nelson wrote:
>>>
>>> It opens a new file, fallocates 16MB, calls fdatasync.
>>
>>
>> Outside of the run for performance testing, I think it would be good at this
>> point to validate that there is really a 16MB file full of zeroes resulting
>> from these operations. I am not really concerned that posix_fallocate might
>> be slower in some cases; that seems unlikely. I am concerned that it might
>> result in a file that isn't structurally the same as the 16MB of zero writes
>> implementation used now.
>
> util-linux comes with fallocate which (might?) suffice for testing in
> that respect, no?
> If that is a real concern, it could be made part of the autoconf
> testing, perhaps.

I was just thinking of something to run in your test program, not
another build time check. Just run the new allocation sequence, and
then check the resulting WAL file for a) correct length, and b) 16K of
zero bytes. I would like to build some confidence that posix_fallocate
is operating correctly in this context on at least one platform. My
experience with Linux handling this class of functions correctly has
left me skeptical of them working until that's proven to be the case.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-29 14:12:54
Message-ID: 51A60CE6.9020607@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/28/13 11:36 AM, Greg Smith wrote:
> Outside of the run for performance testing, I think it would be good at
> this point to validate that there is really a 16MB file full of zeroes
> resulting from these operations. I am not really concerned that
> posix_fallocate might be slower in some cases; that seems unlikely. I
> am concerned that it might result in a file that isn't structurally the
> same as the 16MB of zero writes implementation used now.

I see nothing in the posix_fallocate() man pages that says that the
allocated space is filled with any kind of data or zeroes. It will
likely be garbage data, but that should be fine for a new WAL file.


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Greg Smith <greg(at)2ndQuadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-29 14:36:07
Message-ID: 20130529143607.GD6434@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Peter Eisentraut (peter_e(at)gmx(dot)net) wrote:
> On 5/28/13 11:36 AM, Greg Smith wrote:
> > Outside of the run for performance testing, I think it would be good at
> > this point to validate that there is really a 16MB file full of zeroes
> > resulting from these operations. I am not really concerned that
> > posix_fallocate might be slower in some cases; that seems unlikely. I
> > am concerned that it might result in a file that isn't structurally the
> > same as the 16MB of zero writes implementation used now.
>
> I see nothing in the posix_fallocate() man pages that says that the
> allocated space is filled with any kind of data or zeroes. It will
> likely be garbage data, but that should be fine for a new WAL file.

I *really* hope that the Linux kernel, and other, folks are smart enough
to realize that they can't just re-use random blocks from an I/O device
without cleaning it first. That would be one massive security hole. I
expect posix_fallocate() actually works more like spase files, except
that it also counts the space as being 'taken', but it doesn't go out
and actually pull blocks to use until you actually go to write to it.
At which point, perhaps there's an optimization that says "if the first
thing done with this is writing, then just write out whatever data is
requested and then fill the rest of the block out with zeros", and a
similar read operation which says "if we havn't formally assigned a
block for this, just return zeros". Hopefully it's smart enough to
avoid writing out all zeros and then turning around and writing out
whatever data is given, though since it'd all be in memory, perhaps
that wouldn't be too bad and might be simpler to implement.

Thanks,

Stephen


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Greg Smith <greg(at)2ndQuadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-29 14:42:45
Message-ID: 20130529144245.GC3955@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-29 10:36:07 -0400, Stephen Frost wrote:
> * Peter Eisentraut (peter_e(at)gmx(dot)net) wrote:
> > On 5/28/13 11:36 AM, Greg Smith wrote:
> > > Outside of the run for performance testing, I think it would be good at
> > > this point to validate that there is really a 16MB file full of zeroes
> > > resulting from these operations. I am not really concerned that
> > > posix_fallocate might be slower in some cases; that seems unlikely. I
> > > am concerned that it might result in a file that isn't structurally the
> > > same as the 16MB of zero writes implementation used now.
> >
> > I see nothing in the posix_fallocate() man pages that says that the
> > allocated space is filled with any kind of data or zeroes. It will
> > likely be garbage data, but that should be fine for a new WAL file.
>
> I *really* hope that the Linux kernel, and other, folks are smart enough
> to realize that they can't just re-use random blocks from an I/O device
> without cleaning it first.

FWIW, posix' description about posix_fallocate() doesn't actually say
*anything* about reading. The guarantee it makes is:
"If posix_fallocate() returns successfully, subsequent writes to the
specified file data shall not fail due to the lack of free space on the
file system storage media.".

http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html

So we don't even know whether we can read. I think that means we need to
zero the file anyway...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Greg Smith <greg(at)2ndQuadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-29 15:19:33
Message-ID: 51A61C85.60102@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/29/13 10:42 AM, Andres Freund wrote:
> On 2013-05-29 10:36:07 -0400, Stephen Frost wrote:
>> I *really* hope that the Linux kernel, and other, folks are smart enough
>> to realize that they can't just re-use random blocks from an I/O device
>> without cleaning it first.
>
> FWIW, posix' description about posix_fallocate() doesn't actually say
> *anything* about reading. The guarantee it makes is:
> "If posix_fallocate() returns successfully, subsequent writes to the
> specified file data shall not fail due to the lack of free space on the
> file system storage media.".
>
> http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
>
> So we don't even know whether we can read. I think that means we need to
> zero the file anyway...

We could use Linux fallocate(), which does guarantee that the file reads
back as zeroes. Or we use posix_fallocate() and write over the first
few bytes, enough for a subsequent reader to detect that it shouldn't
read any further.

But all of this is getting very complicated for such a marginal improvement.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 10:49:42
Message-ID: CA+Tgmobbf8KnP6fPHHHkCnCzHEcgih+EgBZV+mE8S6X0S9Ru6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 29, 2013 at 10:42 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> FWIW, posix' description about posix_fallocate() doesn't actually say
> *anything* about reading. The guarantee it makes is:
> "If posix_fallocate() returns successfully, subsequent writes to the
> specified file data shall not fail due to the lack of free space on the
> file system storage media.".
>
> http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
>
> So we don't even know whether we can read. I think that means we need to
> zero the file anyway...

Surely this is undue pessimism.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 10:55:16
Message-ID: 51A73014.4040908@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/30/13 6:49 AM, Robert Haas wrote:
> On Wed, May 29, 2013 at 10:42 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> So we don't even know whether we can read. I think that means we need to
>> zero the file anyway...
>
> Surely this is undue pessimism.

There have been many occasions where I've found the Linux kernel
defining support for POSIX behavior with a NOP stub that basically says
"we should make this work one day". I don't know whether the fallocate
code is one of those or a fully implemented call. Based on that
history, until I see a reader that validates the resulting files are
good I have to assume they're not.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 11:13:13
Message-ID: 20130530111313.GG4201@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-30 06:49:42 -0400, Robert Haas wrote:
> On Wed, May 29, 2013 at 10:42 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > FWIW, posix' description about posix_fallocate() doesn't actually say
> > *anything* about reading. The guarantee it makes is:
> > "If posix_fallocate() returns successfully, subsequent writes to the
> > specified file data shall not fail due to the lack of free space on the
> > file system storage media.".
> >
> > http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
> >
> > So we don't even know whether we can read. I think that means we need to
> > zero the file anyway...
>
> Surely this is undue pessimism.

Why? The spec doesn't specify that case and that very well allows other
behaviour. Glibc sure does behave sensibly and zeroes the data
(sysdeps/posix/posix_fallocate64.c for the generic implementation) and
so does linux' fallocate() syscall, but that doesn't say much about
other implementations.

None of the manpages I could find, nor the spec says anything about the
file's contents in the extended range. Given there were at least three
manpages of different origins that didn't specify that behaviour I am
not too optimistic. Why they didn't specify that completely obvious
question is hard to understand from my pov.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 11:17:17
Message-ID: 20130530111717.GH4201@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-30 06:55:16 -0400, Greg Smith wrote:
> On 5/30/13 6:49 AM, Robert Haas wrote:
> >On Wed, May 29, 2013 at 10:42 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >>So we don't even know whether we can read. I think that means we need to
> >>zero the file anyway...
> >
> >Surely this is undue pessimism.
>
> There have been many occasions where I've found the Linux kernel defining
> support for POSIX behavior with a NOP stub that basically says "we should
> make this work one day". I don't know whether the fallocate code is one of
> those or a fully implemented call. Based on that history, until I see a
> reader that validates the resulting files are good I have to assume they're
> not.

That argument in contrast I find not very convincing though. What was
the last incidence of such a system call that did not just error out
with ENOTSUPP or such?

The linux fallocate call is fully specified for this behaviour and got
added 2.6.23, there wasn't a stub before, so I am far less worried about
it than about the underspecifiedness of posix_fallocate(). Also, if some
system call doesn't follow its documented specifications it's not fully
our problem anymore. If we rely on undocumented behaviour though...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 11:48:51
Message-ID: 51A73CA3.7000003@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/30/13 7:17 AM, Andres Freund wrote:
> That argument in contrast I find not very convincing though. What was
> the last incidence of such a system call that did not just error out
> with ENOTSUPP or such?

http://linux.die.net/man/2/posix_fadvise talks about POSIX_FADV_NOREUSE
and POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op,
depending on your version. I know there were more examples than just
that one that popped up during the testing of effective_io_concurrency.
My starting position has to assume that posix_fallocate can have the
same sort of surprising behavior that showed up repeatedly when we were
trying to use posix_fadvise more aggressively.

The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same
thing) was a similar issue, and that's the first one that left me
forever skeptical of Linux kernel claims in this area until they are
explicitly validated: http://lwn.net/Articles/350225/

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 11:52:35
Message-ID: 20130530115235.GA13335@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-30 07:48:51 -0400, Greg Smith wrote:
> On 5/30/13 7:17 AM, Andres Freund wrote:
> >That argument in contrast I find not very convincing though. What was
> >the last incidence of such a system call that did not just error out
> >with ENOTSUPP or such?
>
> http://linux.die.net/man/2/posix_fadvise talks about POSIX_FADV_NOREUSE and
> POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op,
> depending on your version. I know there were more examples than just that
> one that popped up during the testing of effective_io_concurrency. My
> starting position has to assume that posix_fallocate can have the same sort
> of surprising behavior that showed up repeatedly when we were trying to use
> posix_fadvise more aggressively.

Uh. How is that a correctness problem? fadvise is a hint which is pretty
different from a fallocate where ignoring would have way much more
severe consequences.
I don't think that's a very meaningful comparison.

> The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same thing)
> was a similar issue, and that's the first one that left me forever skeptical
> of Linux kernel claims in this area until they are explicitly validated:
> http://lwn.net/Articles/350225/

Yea, but that mistake is literally decades old...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 11:57:48
Message-ID: 51A73EBC.5090809@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/30/13 7:52 AM, Andres Freund wrote:
> fadvise is a hint which is pretty
> different from a fallocate where ignoring would have way much more
> severe consequences.

Yes, it will. That's why I want to see it tested. There is more than
enough past examples of bad behavior here to be skeptical that this sort
of API may not work exactly as specified. If you're willing to believe
the spec, that's fine, but I think that's dangerously optimistic.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 12:02:56
Message-ID: CA+TgmoZwA_T+XyQcpRqUE2AEj3eESBcSr93Z=4Quc0MfmHgCTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 30, 2013 at 7:13 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Surely this is undue pessimism.
>
> Why? The spec doesn't specify that case and that very well allows other
> behaviour. Glibc sure does behave sensibly and zeroes the data
> (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
> so does linux' fallocate() syscall, but that doesn't say much about
> other implementations.
>
> None of the manpages I could find, nor the spec says anything about the
> file's contents in the extended range. Given there were at least three
> manpages of different origins that didn't specify that behaviour I am
> not too optimistic. Why they didn't specify that completely obvious
> question is hard to understand from my pov.

I think they didn't specify it because it IS obvious. As Stephen
says, it's been understood for decades that allowing unzeroed pages to
be reallocated to some other file is a major security hole. I think
we can assume that no credible OS does that. If there's some OS out
there that chooses to fill the pre-extended pages with 0x55 or cat
/dev/urandom instead of 0x00, they probably deserve what they get.
It's hard for me to be believe that anything that silly actually
exists.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 12:14:27
Message-ID: 20130530121427.GB7466@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-30 08:02:56 -0400, Robert Haas wrote:
> On Thu, May 30, 2013 at 7:13 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> Surely this is undue pessimism.
> >
> > Why? The spec doesn't specify that case and that very well allows other
> > behaviour. Glibc sure does behave sensibly and zeroes the data
> > (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
> > so does linux' fallocate() syscall, but that doesn't say much about
> > other implementations.
> >
> > None of the manpages I could find, nor the spec says anything about the
> > file's contents in the extended range. Given there were at least three
> > manpages of different origins that didn't specify that behaviour I am
> > not too optimistic. Why they didn't specify that completely obvious
> > question is hard to understand from my pov.
>
> I think they didn't specify it because it IS obvious. As Stephen
> says, it's been understood for decades that allowing unzeroed pages to
> be reallocated to some other file is a major security hole. I think
> we can assume that no credible OS does that. If there's some OS out
> there that chooses to fill the pre-extended pages with 0x55 or cat
> /dev/urandom instead of 0x00, they probably deserve what they get.
> It's hard for me to be believe that anything that silly actually
> exists.

I don't think there's much danger of getting uninitialized data or
such. That clearly would be insane. I think somebody might interpret it
as read(2) returning an error until the page has been written to which
isn't completely crazy.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 12:17:28
Message-ID: 51A74358.2010501@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/30/13 7:13 AM, Andres Freund wrote:
> Why? The spec doesn't specify that case and that very well allows other
> behaviour. Glibc sure does behave sensibly and zeroes the data
> (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
> so does linux' fallocate() syscall, but that doesn't say much about
> other implementations.

glibc actually only writes one byte to every file system block, to make
sure the block is allocated. It doesn't actually zero every byte.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 12:19:17
Message-ID: 51A743C5.7000804@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/30/13 8:02 AM, Robert Haas wrote:
> If there's some OS out
> there that chooses to fill the pre-extended pages with 0x55 or cat
> /dev/urandom instead of 0x00, they probably deserve what they get.

Even that wouldn't be a problem for our purpose. The only problem would
be if you can't read from the allocated region at all.


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 12:20:53
Message-ID: 20130530122053.GA14029@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-30 08:19:17 -0400, Peter Eisentraut wrote:
> On 5/30/13 8:02 AM, Robert Haas wrote:
> > If there's some OS out
> > there that chooses to fill the pre-extended pages with 0x55 or cat
> > /dev/urandom instead of 0x00, they probably deserve what they get.
>
> Even that wouldn't be a problem for our purpose. The only problem would
> be if you can't read from the allocated region at all.

Well, only as long as we only use it for preallocation of wal files. I
am much, much more interested in doing that for the heap. And there that
surely would be a problem.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 12:28:20
Message-ID: 20130530122820.GB14029@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-30 08:17:28 -0400, Peter Eisentraut wrote:
> On 5/30/13 7:13 AM, Andres Freund wrote:
> > Why? The spec doesn't specify that case and that very well allows other
> > behaviour. Glibc sure does behave sensibly and zeroes the data
> > (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
> > so does linux' fallocate() syscall, but that doesn't say much about
> > other implementations.
>
> glibc actually only writes one byte to every file system block, to make
> sure the block is allocated. It doesn't actually zero every byte.

Which is fine since that guarantees we can read from those areas... And
unless I misremember something that actually guarantees that the rest of
the data is initialized to zero as well. Yes: "subsequent reads of data
in the gap shall return bytes with the value 0 until data is actually
written into the gap".

But really, I am not at all concerned about some obscure values being
returned, but about a read() not being successful..

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 12:47:48
Message-ID: 20130530124748.GP6434@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Peter Eisentraut (peter_e(at)gmx(dot)net) wrote:
> On 5/30/13 7:13 AM, Andres Freund wrote:
> > Why? The spec doesn't specify that case and that very well allows other
> > behaviour. Glibc sure does behave sensibly and zeroes the data
> > (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
> > so does linux' fallocate() syscall, but that doesn't say much about
> > other implementations.
>
> glibc actually only writes one byte to every file system block, to make
> sure the block is allocated. It doesn't actually zero every byte.

That goes back to the 'sane implementation' question.. Is there a case
where that would actually be different from writing zeros for the entire
block..? Is there some OS that gives you random data for the 'hole'
when you write a byte, seek to the start of the next block and then
write another byte? That actually *would* be against what's documented
and required by spec, no?

Thanks,

Stephen


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 12:50:38
Message-ID: 20130530125038.GQ6434@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> On 2013-05-30 08:19:17 -0400, Peter Eisentraut wrote:
> > On 5/30/13 8:02 AM, Robert Haas wrote:
> > > If there's some OS out
> > > there that chooses to fill the pre-extended pages with 0x55 or cat
> > > /dev/urandom instead of 0x00, they probably deserve what they get.
> >
> > Even that wouldn't be a problem for our purpose. The only problem would
> > be if you can't read from the allocated region at all.
>
> Well, only as long as we only use it for preallocation of wal files. I
> am much, much more interested in doing that for the heap. And there that
> surely would be a problem.

Yes, that was my thinking as well. If posix_fallocate is faster than
writing out 8K of zeros, and the block can immediately be read as if it
had actually been written to, then I'd be very interested in using it to
extend heap files. As I mentioned in this thread (or perhaps it was
another), I don't think this solves the locking issue around the
relation extention lock, but it might help some and it may make it
easier to tweak that logic to allocate larger chunks in the future.

Thanks,

Stephen


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 12:53:37
Message-ID: 20130530125337.GR6434@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> But really, I am not at all concerned about some obscure values being
> returned, but about a read() not being successful..

Alright, so what do we need to do to test this? We really just need a
short C program written up and then a bunch of folks to run it on
various architectures, right? Gee, sounds like what the buildfarm was
made for (alright, alright, PostgreSQL isn't exactly a 'short C
program', but you get the idea). As I recall, Andrew reworked the
buildfarm code to be more modular too.. Anyone have thoughts about how
we could run these kinds of tests with it? Or do people think that's a
bad idea?

Thanks,

Stephen


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 12:58:26
Message-ID: 20130530125826.GE14029@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-05-30 08:53:37 -0400, Stephen Frost wrote:
> * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> > But really, I am not at all concerned about some obscure values being
> > returned, but about a read() not being successful..
>
> Alright, so what do we need to do to test this? We really just need a
> short C program written up and then a bunch of folks to run it on
> various architectures, right?

After a bit of standard perusing writing a single byte to the end of the
file after the fallocate ought to make at least the reading guaranteed
to be defined. If we did seek(last_byte); write(); posix_fallocate() we
should even always have defined content. Yuck.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 13:15:39
Message-ID: 20130530131539.GS6434@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> After a bit of standard perusing writing a single byte to the end of the
> file after the fallocate ought to make at least the reading guaranteed
> to be defined. If we did seek(last_byte); write(); posix_fallocate() we
> should even always have defined content. Yuck.

Alright, but would that actually be any better than just doing what
glibc's posix_fallocate() does in the generic case? And, to be honest,
it makes me a bit nervous to seek/write like that because it looks like
the typical "create a hole" setup, which we certainly aren't intending,
yet if the posix_fallocate() call disappeared, or did nothing, or this
code was copied w/o it, or someone didn't understand what it did, we
could end up with that.

Not a fan. :(

Thanks,

Stephen


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 13:29:24
Message-ID: 51A75434.80609@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/30/13 8:50 AM, Stephen Frost wrote:
> I don't think this solves the locking issue around the
> relation extention lock, but it might help some and it may make it
> easier to tweak that logic to allocate larger chunks in the future.

It might make it a bit faster, but it doesn't make it any easier to
implement. The messy part of extending relations in larger chunks is
how to communicate that back into the buffer manager usefully. The
extension path causing trouble is RelationGetBufferForTuple calling
ReadBufferBI. All of that is passing a single buffer around. There's
no simple way I can see to rewrite it to handle more than one at a time.

I have a test case for relation extension that I'm going to package up
soon. That makes it easy to see where the problem is at. Far as I can
tell the reason it hasn't been fixed before now is that it's a pain to
write the code.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 13:51:28
Message-ID: CA+TgmobUDCVipdq6dck9TM2J0-bQ3LwHZ3TxCOQQc23OdDU+=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 30, 2013 at 8:14 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I don't think there's much danger of getting uninitialized data or
> such. That clearly would be insane. I think somebody might interpret it
> as read(2) returning an error until the page has been written to which
> isn't completely crazy.

In the absence of tangible evidence of some implementation that
behaves that way, I think that's just paranoia.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 15:21:26
Message-ID: 20130530152126.GE2548@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith escribió:

> The messy part of extending relations in larger chunks
> is how to communicate that back into the buffer manager usefully.
> The extension path causing trouble is RelationGetBufferForTuple
> calling ReadBufferBI. All of that is passing a single buffer
> around. There's no simple way I can see to rewrite it to handle
> more than one at a time.

No, but we can have it create several pages and insert them into the
FSM. So they aren't returned to the original caller but are available
to future users.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-05-30 15:45:31
Message-ID: 51A7741B.4010506@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/30/13 11:21 AM, Alvaro Herrera wrote:
> Greg Smith escribió:
>
>> The messy part of extending relations in larger chunks
>> is how to communicate that back into the buffer manager usefully.
>> The extension path causing trouble is RelationGetBufferForTuple
>> calling ReadBufferBI. All of that is passing a single buffer
>> around. There's no simple way I can see to rewrite it to handle
>> more than one at a time.
>
> No, but we can have it create several pages and insert them into the
> FSM. So they aren't returned to the original caller but are available
> to future users.

There's actually a code comment wondering about this topic for the pages
that are already created, in src/backend/access/heap/hio.c :

"Remember the new page as our target for future insertions.
XXX should we enter the new page into the free space map immediately, or
just keep it for this backend's exclusive use in the short run (until
VACUUM sees it)? Seems to depend on whether you expect the current
backend to make more insertions or not, which is probably a good bet
most of the time. So for now, don't add it to FSM yet."

We have to be careful about touching too much at that particular point,
because it's holding a relation extension lock at the obvious spot to
make a change.

There's an interesting overlap with these questions about how files are
extended too, with this comment in that file too, just before the above:

"XXX This does an lseek - rather expensive - but at the moment it is the
only way to accurately determine how many blocks are in a relation. Is
it worth keeping an accurate file length in shared memory someplace,
rather than relying on the kernel to do it for us?"

That whole sequence of code took the easy way forward when it was
written, but it's obvious the harder one (also touching the FSM) was
considered even then. The whole sequence needs to be revisited to pull
off multiple page extension. I wouldn't say it's hard, but it's enough
work that I haven't been able to find a block of time to go through the
whole thing.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-01 02:31:02
Message-ID: 20130601023102.GA252605@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 30, 2013 at 02:58:26PM +0200, Andres Freund wrote:
> > * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> > > But really, I am not at all concerned about some obscure values being
> > > returned, but about a read() not being successful..

> After a bit of standard perusing writing a single byte to the end of the
> file after the fallocate ought to make at least the reading guaranteed
> to be defined. If we did seek(last_byte); write(); posix_fallocate() we
> should even always have defined content. Yuck.

This portion of the posix_fallocate() specification requires the hoped-for
effect on subsequent read() calls:

If the offset+ len is beyond the current file size, then posix_fallocate()
shall adjust the file size to offset+ len. Otherwise, the file size shall
not be changed.
-- http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_fallocate.html

When the file size increases, read()'s defined behavior switches from
returning short to retrieving zeros. There's no need for an additional
write() to ensure that.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-11 15:28:11
Message-ID: CAKuK5J2BA_is7YkK+QpL7UTafMeQ4fPWxx=1Y2w6VzKjqZfkTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

There hasn't been much activity here recently. I'm curious, then, if
there are questions that I can answer.
It may be useful to summarize some things here:

- the purpose of the patch is to use posix_fallocate when creating new
WAL files, because it's (usually) much quicker
- using posix_fallocate is *one* system call versus 2048 calls to write(2)
- additionally, using posix_fallocate /guarantees/ that the filesystem
has space for the WAL file (by spec)
- reportedly (difficult to test or prove), using posix_fallocate *may*
reduce file fragmentation
- the (limited) testing I've done bears this out: the more new WAL
file creation there is, the more the improvement. Once the number of
WAL files reaches a constant point, there does not appear to be either
a positive or a negative performance impact. This is as expected.
- a test program (C) was also written and used which creates,
allocates, and then writes to files as fast as possible. This test
program also shows the expected performance benefits.
- the performance benefits range from a few percent up to about 15 percent

Concerns:
- some were concerned that the spec makes no claims about
posix_fallocate being able to guarantee that the space allocated has
zeroes in it. This was discussed here and on the Linux Kernel mailing
list, wherein the expected behavior is that it does provide zeroes
- most systems don't allocate a great many new WAL files, so the
performance benefit is small
- <your concern here>

Benefits:
- new WAL file allocate is much quicker, more efficient (fewer system calls)
- the patch is (reportedly - I'm not a good judge here!) quite small

--
Jon


From: Stefan Drees <stefan(at)drees(dot)name>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-11 16:08:08
Message-ID: 51B74B68.7090300@drees.name
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11.06 17:28, Jon Nelson wrote:
> There hasn't been much activity here recently. I'm curious, then, if
> there are questions that I can answer.
> It may be useful to summarize some things here:
>
> - the purpose of the patch is to use posix_fallocate when creating new
> WAL files, because it's (usually) much quicker
> - using posix_fallocate is *one* system call versus 2048 calls to write(2)
> - additionally, using posix_fallocate /guarantees/ that the filesystem
> has space for the WAL file (by spec)
> - reportedly (difficult to test or prove), using posix_fallocate *may*
> reduce file fragmentation
> - the (limited) testing I've done bears this out: the more new WAL
> file creation there is, the more the improvement. Once the number of
> WAL files reaches a constant point, there does not appear to be either
> a positive or a negative performance impact. This is as expected.
> - a test program (C) was also written and used which creates,
> allocates, and then writes to files as fast as possible. This test
> program also shows the expected performance benefits.
> - the performance benefits range from a few percent up to about 15 percent

tried the test program of the patch at least a bit.

Retrieved it from:
http://www.postgresql.org/message-id/attachment/29088/test_fallocate.c

on an oldish linux box (Kernel 2.6.32, x86_64) following
$ gcc -o test_fallocate test_fallocate.c
a short
$ ./test_fallocate foo 1 1
yields:
without posix_fallocate: 1 open/close iterations, 1 rewrite in 26.1993s
with posix_fallocate: 1 open/close iterations, 1 rewrite in 13.3299s

on another box (Kernel 3.2.0, x86_64) same procedure yields:
without posix_fallocate: 1 open/close iterations, 1 rewrite in 19.1972s
with posix_fallocate: 1 open/close iterations, 1 rewrite in 9.9280s

Note, when trying gcc -O2 test_fallocate.c fails to compile with:

In file included from /usr/include/fcntl.h:252:0,
from test_fallocate.c:3:
In function ‘open’,
inlined from ‘main’ at test_fallocate.c:68:16:
/usr/include/x86_64-linux-gnu/bits/fcntl2.h:51:24: error: call to
‘__open_missing_mode’ declared with attribute error: open with O_CREAT
in second argument needs 3 arguments

> Concerns:
> - some were concerned that the spec makes no claims about
> posix_fallocate being able to guarantee that the space allocated has
> zeroes in it. This was discussed here and on the Linux Kernel mailing
> list, wherein the expected behavior is that it does provide zeroes
> - most systems don't allocate a great many new WAL files, so the
> performance benefit is small
> - <your concern here>
>
> Benefits:
> - new WAL file allocate is much quicker, more efficient (fewer system calls)
> - the patch is (reportedly - I'm not a good judge here!) quite small

HTH,
Stefan.


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: stefan(at)drees(dot)name
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-11 16:22:52
Message-ID: CAHyXU0yXe5hEcSka-TWwBRBQZ=X6S+j3=oqamUGh-a7CxkEfZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 11, 2013 at 11:08 AM, Stefan Drees <stefan(at)drees(dot)name> wrote:
> On 2013-11.06 17:28, Jon Nelson wrote:
>>
>> There hasn't been much activity here recently. I'm curious, then, if
>> there are questions that I can answer.
>> It may be useful to summarize some things here:
>>
>> - the purpose of the patch is to use posix_fallocate when creating new
>> WAL files, because it's (usually) much quicker
>> - using posix_fallocate is *one* system call versus 2048 calls to write(2)
>> - additionally, using posix_fallocate /guarantees/ that the filesystem
>> has space for the WAL file (by spec)
>> - reportedly (difficult to test or prove), using posix_fallocate *may*
>> reduce file fragmentation
>> - the (limited) testing I've done bears this out: the more new WAL
>> file creation there is, the more the improvement. Once the number of
>> WAL files reaches a constant point, there does not appear to be either
>> a positive or a negative performance impact. This is as expected.
>> - a test program (C) was also written and used which creates,
>> allocates, and then writes to files as fast as possible. This test
>> program also shows the expected performance benefits.
>> - the performance benefits range from a few percent up to about 15 percent
>
>
> tried the test program of the patch at least a bit.
>
> Retrieved it from:
> http://www.postgresql.org/message-id/attachment/29088/test_fallocate.c
>
> on an oldish linux box (Kernel 2.6.32, x86_64) following
> $ gcc -o test_fallocate test_fallocate.c
> a short
> $ ./test_fallocate foo 1 1
> yields:
> without posix_fallocate: 1 open/close iterations, 1 rewrite in 26.1993s
> with posix_fallocate: 1 open/close iterations, 1 rewrite in 13.3299s
>
> on another box (Kernel 3.2.0, x86_64) same procedure yields:
> without posix_fallocate: 1 open/close iterations, 1 rewrite in 19.1972s
> with posix_fallocate: 1 open/close iterations, 1 rewrite in 9.9280s
>
> Note, when trying gcc -O2 test_fallocate.c fails to compile with:
>
> In file included from /usr/include/fcntl.h:252:0,
> from test_fallocate.c:3:
> In function ‘open’,
> inlined from ‘main’ at test_fallocate.c:68:16:
> /usr/include/x86_64-linux-gnu/bits/fcntl2.h:51:24: error: call to
> ‘__open_missing_mode’ declared with attribute error: open with O_CREAT in
> second argument needs 3 arguments

It's understood that posix_fallocate is faster at this -- the question
on the table is 'does this matter in context of postgres?'.
Personally I think this patch should go in regardless -- the concerns
made IMNSHO are specious.

merlin


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: stefan(at)drees(dot)name, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-11 16:58:37
Message-ID: 20130611165837.GQ7200@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Merlin Moncure (mmoncure(at)gmail(dot)com) wrote:
> It's understood that posix_fallocate is faster at this -- the question
> on the table is 'does this matter in context of postgres?'.
> Personally I think this patch should go in regardless -- the concerns
> made IMNSHO are specious.

I've not had a chance to look at this patch, but I tend to agree with
Merlin. My main question is really- would this be useful for extending
*relations*? Apologies if it's already been discussed; I do plan to go
back and read the threads about this more fully, but I wanted to voice
my support for using posix_fallocate, when available, in general.

Thanks,

Stephen


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: stefan(at)drees(dot)name, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-11 17:45:15
Message-ID: 51B7622B.9080706@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/11/13 12:22 PM, Merlin Moncure wrote:

> Personally I think this patch should go in regardless -- the concerns
> made IMNSHO are specious.

That's nice, but we have this process for validating whether features go
in or not that relies on review instead of opinions.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Stefan Drees <stefan(at)drees(dot)name>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndQuadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-11 17:49:55
Message-ID: 51B76343.20105@drees.name
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-06-11 19:45 CEST, Greg Smith wrote:
> On 6/11/13 12:22 PM, Merlin Moncure wrote:
>
>> Personally I think this patch should go in regardless -- the concerns
>> made IMNSHO are specious.
>
> That's nice, but we have this process for validating whether features go
> in or not that relies on review instead of opinions.
>
;-) that's why I played with the test_fallocate.c, as it was easy to do
and I understood, the author (of the patch) wanted to trigger some
reviews ... I do not (yet) know anything about the core codes, so I
leave this to the hackers. My review result was, that with newer gcc's
you should specify an open mode flag as third argument of the fopen
call, as only with the test tool nothing important found.

Stefan.


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: stefan(at)drees(dot)name
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-11 17:52:25
Message-ID: CAKuK5J34ecGN_JN_pgNAx9CCH2s=Oc7Faj0k8hwEd=pct88RBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 11, 2013 at 12:49 PM, Stefan Drees <stefan(at)drees(dot)name> wrote:
> On 2013-06-11 19:45 CEST, Greg Smith wrote:
>>
>> On 6/11/13 12:22 PM, Merlin Moncure wrote:
>>
>>> Personally I think this patch should go in regardless -- the concerns
>>> made IMNSHO are specious.
>>
>>
>> That's nice, but we have this process for validating whether features go
>> in or not that relies on review instead of opinions.
>>
> ;-) that's why I played with the test_fallocate.c, as it was easy to do and
> I understood, the author (of the patch) wanted to trigger some reviews ... I
> do not (yet) know anything about the core codes, so I leave this to the
> hackers. My review result was, that with newer gcc's you should specify an
> open mode flag as third argument of the fopen call, as only with the test
> tool nothing important found.

If you change line 68 to read:

fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);

then it should work just fine (assuming you have not already done so).

--
Jon


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: stefan(at)drees(dot)name, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-11 18:36:57
Message-ID: CAHyXU0y8=njVe=oCObu4M83gEaQV4PDfQjPAXgdUZyPd-f+QLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 11, 2013 at 12:45 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 6/11/13 12:22 PM, Merlin Moncure wrote:
>
>> Personally I think this patch should go in regardless -- the concerns
>> made IMNSHO are specious.
>
> That's nice, but we have this process for validating whether features go in
> or not that relies on review instead of opinions.

Sure: I phrased that badly by 'go in' I meant 'go through the review
process', not commit out of hand.

merlin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, stefan(at)drees(dot)name, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-12 16:36:39
Message-ID: CA+TgmoadBa41kFx_J=PPg_k_yCB=cVbkF7QVupN-j8O6j+aUDw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 11, 2013 at 12:58 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> * Merlin Moncure (mmoncure(at)gmail(dot)com) wrote:
>> It's understood that posix_fallocate is faster at this -- the question
>> on the table is 'does this matter in context of postgres?'.
>> Personally I think this patch should go in regardless -- the concerns
>> made IMNSHO are specious.
>
> I've not had a chance to look at this patch, but I tend to agree with
> Merlin.

I also think this is a good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-14 17:06:14
Message-ID: 1371229574.27844.31.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote:
> Ack. I've revised the patch to always have the GUC (for now), default
> to false, and if configure can't find posix_fallocate (or the user
> disables it by way of pg_config_manual.h) then it remains a GUC that
> simply can't be changed.

Why have a GUC here at all? Perhaps this was already discussed, and I
missed it? Is it just for testing purposes, or did you intend for it to
be in the final version?

If it's supported, it seems like we always want it. I doubt there are
cases where it hurts performance; but if there are, it's pretty hard for
a DBA to know what those cases are, anyway.

Also:

* The other code assumes that no errno means ENOSPC. We should be
consistent about that assumption, and do the same thing in the fallocate
case.

* You check for the presence of posix_fallocate at configure time, but
don't #ifdef the call site. It looks like you removed this from the v2
patch, was there a reason for that? Won't that cause build errors for
platforms without it?

I started looking at this patch and it looks like we are getting a
consensus that it's the right approach. Microbenchmarks appear to show a
benefit, and (thanks to Noah's comment) it seems like the change is
safe. Are there any remaining questions or objections?

Regards,
Jeff Davis


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-14 17:21:38
Message-ID: 51BB5122.1060809@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/14/13 1:06 PM, Jeff Davis wrote:
> Why have a GUC here at all? Perhaps this was already discussed, and I
> missed it? Is it just for testing purposes, or did you intend for it to
> be in the final version?

You have guessed correctly! I suggested it stay in there only to make
review benchmarking easier.

> I started looking at this patch and it looks like we are getting a
> consensus that it's the right approach. Microbenchmarks appear to show a
> benefit, and (thanks to Noah's comment) it seems like the change is
> safe. Are there any remaining questions or objections?

I'm planning to duplicate Jon's test program on a few machines here, and
then see if that turns into a useful latency improvement for clients.
I'm trying to get this pgbench rate limit stuff working first though,
because one of the tests I had in mind for WAL creation overhead would
benefit from it.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, stefan(at)drees(dot)name, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-14 17:24:52
Message-ID: 1371230692.27844.36.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2013-06-11 at 12:58 -0400, Stephen Frost wrote:
> My main question is really- would this be useful for extending
> *relations*? Apologies if it's already been discussed; I do plan to go
> back and read the threads about this more fully, but I wanted to voice
> my support for using posix_fallocate, when available, in general.

+1, though separate from this patch.

Andres also pointed out that we can try to track a point in the file
that is below any place where a zero page might still exist. That will
allow us to call zero pages invalid unless they are related to a recent
extension, which is a weakness in the current checksums code.

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-14 18:03:52
Message-ID: 1371233032.27844.42.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2013-06-14 at 13:21 -0400, Greg Smith wrote:
> I'm planning to duplicate Jon's test program on a few machines here, and
> then see if that turns into a useful latency improvement for clients.
> I'm trying to get this pgbench rate limit stuff working first though,
> because one of the tests I had in mind for WAL creation overhead would
> benefit from it.

Great, I'll wait on those results.

Regards,
Jeff Davis


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-17 02:59:39
Message-ID: CAKuK5J26Ls90Ry2Ldgj+uoU3_DbfsJsO4yKxx_0a-pw3f8Q8VQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jun 14, 2013 at 12:06 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote:
>> Ack. I've revised the patch to always have the GUC (for now), default
>> to false, and if configure can't find posix_fallocate (or the user
>> disables it by way of pg_config_manual.h) then it remains a GUC that
>> simply can't be changed.
>
> Why have a GUC here at all? Perhaps this was already discussed, and I
> missed it? Is it just for testing purposes, or did you intend for it to
> be in the final version?

As Greg Smith noted, right now it's only for testing purposes.

> * The other code assumes that no errno means ENOSPC. We should be
> consistent about that assumption, and do the same thing in the fallocate
> case.

Unlike write(2), posix_fallocate does *not* set errno, but instead
returns a subset of the errno values as it's return code. The current
approach was suggested to me by Andres Freund and Alvaro Herrera.

> * You check for the presence of posix_fallocate at configure time, but
> don't #ifdef the call site. It looks like you removed this from the v2
> patch, was there a reason for that? Won't that cause build errors for
> platforms without it?

Indeed! I believe I have corrected the error and will be sending out a
revised patch (v5), once the performance and back-testing have
completed.

--
Jon


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-17 04:53:21
Message-ID: CAKuK5J2O7SjiyuZHdE62=N3xL2vSYQVLOU7eFXjeuFsi8Pk6hw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 16, 2013 at 9:59 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> On Fri, Jun 14, 2013 at 12:06 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>> On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote:
..
>> * You check for the presence of posix_fallocate at configure time, but
>> don't #ifdef the call site. It looks like you removed this from the v2
>> patch, was there a reason for that? Won't that cause build errors for
>> platforms without it?
>
> Indeed! I believe I have corrected the error and will be sending out a
> revised patch (v5), once the performance and back-testing have
> completed.

Please find attached v5 of the patch, with the above correction in place.
The only change from the v4 patch is wrapping the if
(wal_use_fallocate) block in an #ifdef USE_POSIX_FALLOCATE.
Thanks!

--
Jon

Attachment Content-Type Size
fallocate-v5.patch application/octet-stream 7.3 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-21 02:19:04
Message-ID: 1371781144.2349.1.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2013-06-16 at 23:53 -0500, Jon Nelson wrote:
> Please find attached v5 of the patch, with the above correction in place.
> The only change from the v4 patch is wrapping the if
> (wal_use_fallocate) block in an #ifdef USE_POSIX_FALLOCATE.
> Thanks!

Thank you.

Greg, are you still working on those tests?

Regards,
Jeff Davis


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-28 18:38:34
Message-ID: 51CDD82A.9060005@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/20/2013 07:19 PM, Jeff Davis wrote:
> On Sun, 2013-06-16 at 23:53 -0500, Jon Nelson wrote:
>> Please find attached v5 of the patch, with the above correction in place.
>> The only change from the v4 patch is wrapping the if
>> (wal_use_fallocate) block in an #ifdef USE_POSIX_FALLOCATE.
>> Thanks!
>
> Thank you.
>
> Greg, are you still working on those tests?

Since Greg seems to be busy, what needs to be done to test this?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-30 18:01:53
Message-ID: 1372615313.19747.13.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2013-05-28 at 22:10 -0400, Greg Smith wrote:
> I was just thinking of something to run in your test program, not
> another build time check. Just run the new allocation sequence, and
> then check the resulting WAL file for a) correct length, and b) 16K of
> zero bytes. I would like to build some confidence that posix_fallocate
> is operating correctly in this context on at least one platform. My
> experience with Linux handling this class of functions correctly has
> left me skeptical of them working until that's proven to be the case.

As I understand it, you are basically asking if posix_fallocate() works
at all anywhere.

Simple test program attached, which creates two files and fills them:
one by 2048 8KB writes; and another by 1 posix_fallocate of 16MB. Then,
I just cmp the resulting files (and also "ls" them, to make sure they
are 16MB).

Passes on my workstation:
$ uname -a
Linux jdavis 3.5.0-34-generic #55-Ubuntu SMP Thu Jun 6 20:18:19 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux

Regards,
Jeff Davis

Attachment Content-Type Size
fallocate.c text/x-csrc 341 bytes

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-30 18:11:58
Message-ID: 1372615918.19747.22.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2013-06-28 at 11:38 -0700, Josh Berkus wrote:
> Since Greg seems to be busy, what needs to be done to test this?

As I understand it, he was mainly asking if posix_fallocate works at
all. I tried to address that question with a simple test, which behaves
as I expected it to:

http://www.postgresql.org/message-id/1372615313.19747.13.camel@jdavis

Unless something surprising comes up, or someone thinks and objection
has been missed, I am going to commit this soon.

(Of course, to avoid your wrath, I'll get rid of the GUC which was added
for testing.)

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-30 19:50:28
Message-ID: 1372621828.19747.32.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2013-06-30 at 11:11 -0700, Jeff Davis wrote:
> Unless something surprising comes up, or someone thinks and objection
> has been missed, I am going to commit this soon.

Quick question to anyone who happens to know:

What is the standard procedure for changes to pg_config.h.win32? I
looked at an old patch of mine that Tom (CC'd) committed:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b966dd6c4228d696b291c1cdcb5ab8c8475fefa8

and I see that it modifies pg_config.h.win32. Should I modify it in a
similar way for this fallocate patch?

Right now, pg_config.in.win32 seems a little inconsistent because it has
an entry for HAVE_POSIX_SIGNALS but not HAVE_POSIX_FADVISE. It says it's
a generated file, but I don't have a windows environment or MingW.

Regards,
Jeff Davis


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-30 20:09:12
Message-ID: 51D09068.8020404@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 06/30/2013 03:50 PM, Jeff Davis wrote:
> On Sun, 2013-06-30 at 11:11 -0700, Jeff Davis wrote:
>> Unless something surprising comes up, or someone thinks and objection
>> has been missed, I am going to commit this soon.
> Quick question to anyone who happens to know:
>
> What is the standard procedure for changes to pg_config.h.win32? I
> looked at an old patch of mine that Tom (CC'd) committed:
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b966dd6c4228d696b291c1cdcb5ab8c8475fefa8
>
> and I see that it modifies pg_config.h.win32. Should I modify it in a
> similar way for this fallocate patch?
>
> Right now, pg_config.in.win32 seems a little inconsistent because it has
> an entry for HAVE_POSIX_SIGNALS but not HAVE_POSIX_FADVISE. It says it's
> a generated file, but I don't have a windows environment or MingW.
>

It was originally generated. Since then it's been maintained by hand.

If you need a Windows environment to test on, see the Amazon recipe at
<http://wiki.postgresql.org/wiki/Building_With_MinGW>

cheers

andrew

>
>
>


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-30 22:55:39
Message-ID: 51D0B76B.6040706@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/30/13 2:01 PM, Jeff Davis wrote:
> Simple test program attached, which creates two files and fills them:
> one by 2048 8KB writes; and another by 1 posix_fallocate of 16MB. Then,
> I just cmp the resulting files (and also "ls" them, to make sure they
> are 16MB).

This makes platform level testing a lot easier, thanks. Attached is an
updated copy of that program with some error checking. If the files it
creates already existed, the code didn't notice, and a series of write
errors happened. If you set the test up right it's not a problem, but
it's better if a bad setup is caught. I wrapped the whole test with a
shell script, also attached, which insures the right test sequence and
checks.

Your C test program compiles and passes on RHEL5/6 here, doesn't on OS X
Darwin. No surprises there, there's a long list of platforms that don't
support this call at
https://www.gnu.org/software/gnulib/manual/html_node/posix_005ffallocate.html
and the Mac is on it. Many other platforms I was worried about don't
support it too--older FreeBSD, HP-UX 11, Solaris 10, mingw, MSVC--so
that cuts down on testing quite a bit. If it runs faster on Linux,
that's the main target here, just like the existing
effective_io_concurrency fadvise code.

The specific thing I was worried about is that this interface might have
a stub that doesn't work perfectly in older Linux kernels. After being
surprised to find this interface worked on RHEL5 with your test program,
I dug into this more. It works there, but it may actually be slower.

posix_fallocate is actually implemented by glibc on Linux. Been there
since 2.1.94 according to the Linux man pages. But Linux itself didn't
add the feature until kernel 2.6.20: http://lwn.net/Articles/226436/
The biggest thing I was worried about--the call might be there in early
kernels but with a non-functional implementation--that's not the case.
Looking at the diff, before that patch there's no fallocate at all.

So what happened in earlier kernels, where there was no kernel level
fallocate available? According to
https://www.redhat.com/archives/fedora-devel-list/2009-April/msg00110.html
what glibc does is check for kernel fallocate(), and if it's not there
it writes a bunch of zeros to create the file instead. What is actually
happening on a RHEL5 system (with kernel 2.6.18) is that calling
posix_fallocate does this fallback behavior, where it basically does the
same thing the existing WAL clearing code does.

I can even prove that's the case. On RHEL5, if you run "strace -o out
./fallocate" the main write loop looks like this:

write(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192) = 8192

But when you call posix_fallocate, you still get a bunch of writes, but
4 bytes at a time:

pwrite(4, "\0", 1, 16769023) = 1
pwrite(4, "\0", 1, 16773119) = 1
pwrite(4, "\0", 1, 16777215) = 1

That's glibc helpfully converting your call to posix_fallocate into
small writes, because the OS doesn't provide a better way in that
kernel. It's not hard to imagine this being slower than what the WAL
code is doing right now. I'm not worried about correctness issues
anymore, but my gut paranoia about this not working as expected on older
systems was justified. Everyone who thought I was just whining owes me
a cookie.

This is what I plan to benchmark specifically next. If the
posix_fallocate approach is actually slower than what's done now when
it's not getting kernel acceleration, which is the case on RHEL5 era
kernels, we might need to make the configure time test more complicated.
Whether posix_fallocate is defined isn't sensitive enough; on Linux it
may be the case that this only is usable when fallocate() is also there.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment Content-Type Size
testfallocate text/plain 536 bytes
fallocate.c text/x-csrc 544 bytes

From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-30 23:41:57
Message-ID: CAKuK5J2QK0M1ON=DiB00Yp4_0B__Ks1J=axAYeJYBiA55FUqbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 30, 2013 at 5:55 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>
>
> pwrite(4, "\0", 1, 16769023) = 1
> pwrite(4, "\0", 1, 16773119) = 1
> pwrite(4, "\0", 1, 16777215) = 1
>
> That's glibc helpfully converting your call to posix_fallocate into small
> writes, because the OS doesn't provide a better way in that kernel. It's
> not hard to imagine this being slower than what the WAL code is doing right
> now. I'm not worried about correctness issues anymore, but my gut paranoia
> about this not working as expected on older systems was justified. Everyone
> who thought I was just whining owes me a cookie.

I had noted in the very early part of the thread that glibc emulates
posix_fallocate when the (Linux-specific) 'fallocate' systemcall
fails. In this case, it's writing 4 bytes of zeros and then
essentially seeking forward 4092 (4096-4) bytes. This prevents files
with holes in them because the holes have to be at least 4kiB in size,
if I recall properly. It's *not* writing out 16MiB in 4 byte
increments.

--
Jon


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-30 23:49:20
Message-ID: 51D0C400.1080001@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/28/13 10:00 PM, Jon Nelson wrote:

> A note: The attached test program uses *fsync* instead of *fdatasync*
> after calling fallocate (or writing out 16MB of zeroes), per an
> earlier suggestion.

I tried this out on the RHEL5 platform I'm worried about now. There's
something weird about the test program there. If I run it once it shows
posix_fallocate running much faster:

without posix_fallocate: 1 open/close iterations, 1 rewrite in 23.0169s
with posix_fallocate: 1 open/close iterations, 1 rewrite in 11.1904s

The problem is that I'm seeing the gap between the two get smaller the
more iterations I run, which makes me wonder if the test is completely fair:

without posix_fallocate: 2 open/close iterations, 2 rewrite in 34.3281s
with posix_fallocate: 2 open/close iterations, 2 rewrite in 23.1798s

without posix_fallocate: 3 open/close iterations, 3 rewrite in 44.4791s
with posix_fallocate: 3 open/close iterations, 3 rewrite in 33.6102s

without posix_fallocate: 5 open/close iterations, 5 rewrite in 65.6244s
with posix_fallocate: 5 open/close iterations, 5 rewrite in 61.0991s

You didn't show any output from the latest program on your system, so
I'm not sure how it behaved for you here.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-01 01:28:11
Message-ID: CAKuK5J1AcML-1cGBhnRzED-vh4oG+8HkmFhy2ECa-8JBJ-6qbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 30, 2013 at 6:49 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 5/28/13 10:00 PM, Jon Nelson wrote:
>
>> A note: The attached test program uses *fsync* instead of *fdatasync*
>> after calling fallocate (or writing out 16MB of zeroes), per an
>> earlier suggestion.
>
>
> I tried this out on the RHEL5 platform I'm worried about now. There's
> something weird about the test program there. If I run it once it shows
> posix_fallocate running much faster:
>
> without posix_fallocate: 1 open/close iterations, 1 rewrite in 23.0169s
> with posix_fallocate: 1 open/close iterations, 1 rewrite in 11.1904s

Assuming the platform chosen is using the glibc approach of pwrite(4
bytes) every 4KiB, then the results ought to be similar, and I'm at a
loss to explain why it's performing better (unless - grasping at
straws - simply the *volume* of data transferred from userspace to the
kernel is at play, in which case posix_fallocate will result in 4096
calls to pwrite but at 4 bytes each versus 2048 calls to write at 8KiB
each.) Ultimately the same amount of data gets written to disk (one
would imagine), but otherwise I can't really think of much.

I have also found several errors test_fallocate.c program I posted,
corrected below.
One of them is: it is missing two pairs of parentheses around two #defines:

#define SIXTEENMB 1024*1024*16
#define EIGHTKB 1024*8

should be:

#define SIXTEENMB (1024*1024*16)
#define EIGHTKB (1024*8)

Otherwise the program will end up writing (131072) 8KiB blocks instead of 2048.

This actually makes the comparison between writing 8KiB blocks and
using posix_fallocate favor the latter more strongly in the results
(also seen below).

> The problem is that I'm seeing the gap between the two get smaller the more
> iterations I run, which makes me wonder if the test is completely fair:
>
> without posix_fallocate: 2 open/close iterations, 2 rewrite in 34.3281s
> with posix_fallocate: 2 open/close iterations, 2 rewrite in 23.1798s
>
>
> without posix_fallocate: 3 open/close iterations, 3 rewrite in 44.4791s
> with posix_fallocate: 3 open/close iterations, 3 rewrite in 33.6102s
>
> without posix_fallocate: 5 open/close iterations, 5 rewrite in 65.6244s
> with posix_fallocate: 5 open/close iterations, 5 rewrite in 61.0991s
>
> You didn't show any output from the latest program on your system, so I'm
> not sure how it behaved for you here.

On the the platform I use - openSUSE (12.3, x86_64, kernel 3.9.7
currently) I never see posix_fadvise perform worse. Typically better,
sometimes much better.

To set the number of times the file is overwritten to just 1 (one):

for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done

I am including a revised version of test_fallocate.c that corrects the
above noted error, one typo (from when I changed fdatasync to fsync)
that did not alter program behavior, corrects a mis-placed
gettimeofday (which does change the results) and includes a new test
that aims (perhaps poorly) to emulate the glibc style of pwrite(4
bytes) for every 4KiB, and tests the resulting file size to make sure
it is 16MiB in size.

The performance of the latter (new) test sometimes seems to perform
worse and sometimes seems to perform better (usually worse) than
either of the other two. In all cases, posix_fallocate performs
better, but I don't have a sufficiently old kernel to test with.

The new results on one machine are below.

With 0 (zero) rewrites (testing *just*
open/some_type_of_allocation/fsync/close):

method: classic. 100 open/close iterations, 0 rewrite in 29.6060s
method: posix_fallocate. 100 open/close iterations, 0 rewrite in 2.1054s
method: glibc emulation. 100 open/close iterations, 0 rewrite in 31.7445s

And with the same number of rewrites as open/close cycles:

method: classic. 1 open/close iterations, 1 rewrite in 0.6297s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3028s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.5521s

method: classic. 2 open/close iterations, 2 rewrite in 1.6455s
method: posix_fallocate. 2 open/close iterations, 2 rewrite in 1.0409s
method: glibc emulation. 2 open/close iterations, 2 rewrite in 1.5604s

method: classic. 5 open/close iterations, 5 rewrite in 7.5916s
method: posix_fallocate. 5 open/close iterations, 5 rewrite in 6.9177s
method: glibc emulation. 5 open/close iterations, 5 rewrite in 8.1137s

method: classic. 10 open/close iterations, 10 rewrite in 29.2816s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 28.4400s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 31.2693s

--
Jon

Attachment Content-Type Size
test_fallocate.c text/x-csrc 3.5 KB

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-01 04:52:08
Message-ID: 51D10AF8.2010707@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/30/13 9:28 PM, Jon Nelson wrote:
> The performance of the latter (new) test sometimes seems to perform
> worse and sometimes seems to perform better (usually worse) than
> either of the other two. In all cases, posix_fallocate performs
> better, but I don't have a sufficiently old kernel to test with.

This updated test program looks reliable now. The numbers are very
tight when I'd expect them to be, and there's nowhere with the huge
differences I saw in the earlier test program.

Here's results from a few sets of popular older platforms:

RHEL5, ext3

method: classic. 10 open/close iterations, 10 rewrite in 22.6949s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 23.0113s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 22.4921s

method: classic. 10 open/close iterations, 10 rewrite in 23.2808s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 22.4736s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 23.9871s

method: classic. 10 open/close iterations, 10 rewrite in 22.4812s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 22.2393s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 23.6940s

RHEL6, ext4

method: classic. 10 open/close iterations, 10 rewrite in 56.0483s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 61.5092s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 53.8569s

method: classic. 10 open/close iterations, 10 rewrite in 57.0361s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 55.9840s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 64.9437sb

RHEL6, ext3

method: classic. 10 open/close iterations, 10 rewrite in 14.4080s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 16.1395s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 16.9657s

method: classic. 10 open/close iterations, 10 rewrite in 15.2825s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 16.5315s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 14.8115s

The win for posix_fallocate is there in most cases, but it's pretty hard
to see in these older systems. That could be OK. As long as the
difference is no more than noise, and that is the case, this could be
good enough to commit. If there are significantly better results on the
new platforms, the old ones need to just not get worse.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-01 15:24:44
Message-ID: CAKuK5J0sCoLOZ=COdptKQbjO-GTSznB2G5Sc_7-=2bgC1vjSRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 30, 2013 at 11:52 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 6/30/13 9:28 PM, Jon Nelson wrote:
>>
>> The performance of the latter (new) test sometimes seems to perform
>> worse and sometimes seems to perform better (usually worse) than
>> either of the other two. In all cases, posix_fallocate performs
>> better, but I don't have a sufficiently old kernel to test with.
>
>
> This updated test program looks reliable now. The numbers are very tight
> when I'd expect them to be, and there's nowhere with the huge differences I
> saw in the earlier test program.
>
> Here's results from a few sets of popular older platforms:

If you found yourself with a spare moment, could you run these again
with the number of open/close cycles set high (say, 100) and the
number of rewrites set to 0 and also to 1? Most of the time spent is
actually spent overwriting the files so by reducing or eliminating
that aspect it might be easier to get a handle on the actual
performance difference.

--
Jon


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-01 16:34:46
Message-ID: 1372696486.19747.36.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2013-06-30 at 16:09 -0400, Andrew Dunstan wrote:
> It was originally generated. Since then it's been maintained by hand.

What is the procedure for maintaining it by hand? Why are
HAVE_POSIX_SIGNALS and HAVE_SYNC_FILE_RANGE in there (though commented
out), but not HAVE_POSIX_FADVISE?

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-01 16:55:51
Message-ID: 1372697751.19747.51.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2013-06-30 at 18:55 -0400, Greg Smith wrote:
> This makes platform level testing a lot easier, thanks. Attached is an
> updated copy of that program with some error checking. If the files it
> creates already existed, the code didn't notice, and a series of write
> errors happened. If you set the test up right it's not a problem, but
> it's better if a bad setup is caught. I wrapped the whole test with a
> shell script, also attached, which insures the right test sequence and
> checks.

Thank you.

> That's glibc helpfully converting your call to posix_fallocate into
> small writes, because the OS doesn't provide a better way in that
> kernel. It's not hard to imagine this being slower than what the WAL
> code is doing right now. I'm not worried about correctness issues
> anymore, but my gut paranoia about this not working as expected on older
> systems was justified. Everyone who thought I was just whining owes me
> a cookie.

So your theory is that it may be slower because there are twice as many
syscalls (one per 4K page rather than one per 8K page)? Interesting
observation.

> This is what I plan to benchmark specifically next.

In the interest of keeping this patch moving forward, do you have an
estimate for when this testing will be complete?

> If the
> posix_fallocate approach is actually slower than what's done now when
> it's not getting kernel acceleration, which is the case on RHEL5 era
> kernels, we might need to make the configure time test more complicated.
> Whether posix_fallocate is defined isn't sensitive enough; on Linux it
> may be the case that this only is usable when fallocate() is also there.

I'd say that if posix_fallocate is slower than the existing code on
pretty much any platform, we shouldn't commit the patch at all. I would
be quite surprised if that was the case, however.

Regards,
Jeff Davis


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-01 17:13:22
Message-ID: CAHGQGwGnYcX=Rp4f-YQqamHcLYMcskUxXa+k_3nm1VnOVdX3+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 2, 2013 at 1:55 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Sun, 2013-06-30 at 18:55 -0400, Greg Smith wrote:
>> This makes platform level testing a lot easier, thanks. Attached is an
>> updated copy of that program with some error checking. If the files it
>> creates already existed, the code didn't notice, and a series of write
>> errors happened. If you set the test up right it's not a problem, but
>> it's better if a bad setup is caught. I wrapped the whole test with a
>> shell script, also attached, which insures the right test sequence and
>> checks.
>
> Thank you.
>
>> That's glibc helpfully converting your call to posix_fallocate into
>> small writes, because the OS doesn't provide a better way in that
>> kernel. It's not hard to imagine this being slower than what the WAL
>> code is doing right now. I'm not worried about correctness issues
>> anymore, but my gut paranoia about this not working as expected on older
>> systems was justified. Everyone who thought I was just whining owes me
>> a cookie.
>
> So your theory is that it may be slower because there are twice as many
> syscalls (one per 4K page rather than one per 8K page)? Interesting
> observation.
>
>> This is what I plan to benchmark specifically next.
>
> In the interest of keeping this patch moving forward, do you have an
> estimate for when this testing will be complete?
>
>> If the
>> posix_fallocate approach is actually slower than what's done now when
>> it's not getting kernel acceleration, which is the case on RHEL5 era
>> kernels, we might need to make the configure time test more complicated.
>> Whether posix_fallocate is defined isn't sensitive enough; on Linux it
>> may be the case that this only is usable when fallocate() is also there.
>
> I'd say that if posix_fallocate is slower than the existing code on
> pretty much any platform, we shouldn't commit the patch at all.

Even in that case, if a user can easily know which platform posix_fallocate
should be used in, we can commit the patch with the configurable GUC
parameter.

Regards,

--
Fujii Masao


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-01 18:17:26
Message-ID: 1372702646.19747.75.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2013-07-02 at 02:13 +0900, Fujii Masao wrote:
> Even in that case, if a user can easily know which platform posix_fallocate
> should be used in, we can commit the patch with the configurable GUC
> parameter.

I disagree here. We're not talking about a huge win; this speedup may
not even be detectable on a running system.

I think Robert summarized the reason for the patch best: "I mean, if
posix_fallocate() is faster, then it's just faster, right?". But if we
need a new GUC, and DBAs now have one more thing they need to test about
their platform, then that argument goes out the window.

Regards,
Jeff Davis


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-01 19:44:19
Message-ID: CA+TgmoYjkHh1+QV48q9orWc8APwEYK-6LHFyubsLtG-BFC5MnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 1, 2013 at 2:17 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Tue, 2013-07-02 at 02:13 +0900, Fujii Masao wrote:
>> Even in that case, if a user can easily know which platform posix_fallocate
>> should be used in, we can commit the patch with the configurable GUC
>> parameter.
>
> I disagree here. We're not talking about a huge win; this speedup may
> not even be detectable on a running system.
>
> I think Robert summarized the reason for the patch best: "I mean, if
> posix_fallocate() is faster, then it's just faster, right?". But if we
> need a new GUC, and DBAs now have one more thing they need to test about
> their platform, then that argument goes out the window.

Yeah. If the patch isn't going to be a win on RHEL 5, I'd consider
that a good reason to scrap it for now and revisit it in 3 years.
There are still a LOT of people running RHEL 5, and the win isn't big
enough to engineer a more complex solution. But ultimately RHEL 5,
like any other old system, will go away.

However, Greg's latest testing results seem to show that there isn't
much to worry about, so we may be fine anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-01 20:04:46
Message-ID: 51D1E0DE.9090009@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/1/13 12:34 PM, Jeff Davis wrote:
> On Sun, 2013-06-30 at 16:09 -0400, Andrew Dunstan wrote:
>> It was originally generated. Since then it's been maintained by hand.
>
> What is the procedure for maintaining it by hand?

Edit away.

> Why are
> HAVE_POSIX_SIGNALS and HAVE_SYNC_FILE_RANGE in there (though commented
> out), but not HAVE_POSIX_FADVISE?

Because maintaining it by hand is prone to inconsistencies. I wouldn't
worry about it.


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-01 21:07:36
Message-ID: 51D1EF98.5090603@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/1/13 3:44 PM, Robert Haas wrote:
> Yeah. If the patch isn't going to be a win on RHEL 5, I'd consider
> that a good reason to scrap it for now and revisit it in 3 years.
> There are still a LOT of people running RHEL 5, and the win isn't big
> enough to engineer a more complex solution.

I'm still testing, expect to have this wrapped up on my side by
tomorrow. So much of the runtime here is the file setup/close that
having a 2:1 difference in number of writes, what happens on the old
platforms, it is hard to get excited about.

I don't think the complexity to lock out RHEL5 here is that bad even if
it turns out to be a good idea. Just add another configure check for
fallocate, and on Linux if it's not there don't use posix_fallocate
either. Maybe 5 lines of macro code? RHEL5 sure isn't going anyway
anytime soon, but at the same time there won't be that many 9.4
deployments on that version.

I've been digging into the main situation where this feature helps, and
it won't be easy to duplicate in a benchmark situation. Using Linux's
fallocate works as a hint that the whole 16MB should be allocated at
once, and therefore together on disk if feasible. The resulting WAL
files should be less prone to fragmentation. That's actually the
biggest win of this approach, but I can't easily duplicate the sort of
real-world fragmentation I see on live servers here. Given that, I'm
leaning toward saying that unless there's a clear regression on older
platforms, above the noise floor, this is still the right thing to do.

I fully agree that this needs to fully automatic--no GUC--before it's
worth committing. If we can't figure out the right thing to do now,
there's little hope anyone else will in a later tuning expedition.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-05 06:50:01
Message-ID: 1373007001.19747.155.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2013-07-01 at 00:52 -0400, Greg Smith wrote:
> On 6/30/13 9:28 PM, Jon Nelson wrote:
> > The performance of the latter (new) test sometimes seems to perform
> > worse and sometimes seems to perform better (usually worse) than
> > either of the other two. In all cases, posix_fallocate performs
> > better, but I don't have a sufficiently old kernel to test with.
>
> This updated test program looks reliable now. The numbers are very
> tight when I'd expect them to be, and there's nowhere with the huge
> differences I saw in the earlier test program.
>
> Here's results from a few sets of popular older platforms:

...

> The win for posix_fallocate is there in most cases, but it's pretty hard
> to see in these older systems. That could be OK. As long as the
> difference is no more than noise, and that is the case, this could be
> good enough to commit. If there are significantly better results on the
> new platforms, the old ones need to just not get worse.

I ran my own little test on my workstation[1] with the attached
programs. One does what we do now, another mimics the glibc emulation
you described earlier, and another uses posix_fallocate(). It does an
allocation phase, an fsync, a single rewrite, and then another fsync.
The program runs this 64 times for 64 different 16MB files.

write1 and write2 are almost exactly even at 25.5s. write3 is about
14.5s, which is a pretty big improvement.

So, my simple conclusion is that glibc emulation should be about the
same as what we're doing now, so there's no reason to avoid it. That
means, if posix_fallocate() is present, we should use it, because it's
either the same (if emulated in glibc) or significantly faster (if
implemented in the kernel).

If that is your conclusion, as well, it looks like this patch is about
ready for commit. What do you think?

Regards,
Jeff Davis

[1] workstation specs (ubuntu 12.10, ext4):
$ uname -a
Linux jdavis 3.5.0-34-generic #55-Ubuntu SMP Thu Jun 6 20:18:19 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux

Attachment Content-Type Size
write1.c text/x-csrc 576 bytes
write3.c text/x-csrc 552 bytes
write2.c text/x-csrc 591 bytes

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-05 07:23:20
Message-ID: 51D67468.1080705@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/5/13 2:50 AM, Jeff Davis wrote:
> So, my simple conclusion is that glibc emulation should be about the
> same as what we're doing now, so there's no reason to avoid it. That
> means, if posix_fallocate() is present, we should use it, because it's
> either the same (if emulated in glibc) or significantly faster (if
> implemented in the kernel).

That's what I'm seeing everywhere too. I'm happy that we've spent
enough time chasing after potential issues without finding anything now.
Pull out the GUC that was added for default and this is ready to commit.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-05 13:21:01
Message-ID: CAKuK5J0JObq_y-LF311PVOaN-qx16O5j1BVNeErVdCQyE+WryQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 5, 2013 at 2:23 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 7/5/13 2:50 AM, Jeff Davis wrote:
>>
>> So, my simple conclusion is that glibc emulation should be about the
>> same as what we're doing now, so there's no reason to avoid it. That
>> means, if posix_fallocate() is present, we should use it, because it's
>> either the same (if emulated in glibc) or significantly faster (if
>> implemented in the kernel).
>
>
> That's what I'm seeing everywhere too. I'm happy that we've spent enough
> time chasing after potential issues without finding anything now. Pull out
> the GUC that was added for default and this is ready to commit.

Wonderful. Is removing the GUC something that I should do or should
that be done by somebody that knows more about what they are doing? (I
am happy to give it a go!)

Should the small test program that I made also be included somewhere
in the source tree?

--
Jon


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-05 18:27:49
Message-ID: 1373048869.5630.2.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2013-07-05 at 08:21 -0500, Jon Nelson wrote:
> Wonderful. Is removing the GUC something that I should do or should
> that be done by somebody that knows more about what they are doing? (I
> am happy to give it a go!)

I'll take care of that.

> Should the small test program that I made also be included somewhere
> in the source tree?

I don't think that's necessary, the fact that it's in the mailing list
archives is enough.

Regards,
Jeff Davis