Re: 9.4 regression

Lists: pgsql-hackers
From: Thom Brown <thom(at)linux(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: 9.4 regression
Date: 2013-08-07 16:21:01
Message-ID: CAA-aLv7tYHDzMGg4HtDZh0RQZjJc2v2weJ-Obm4yvkw6ePe9Qw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

I recently tried a simple benchmark to see how far 9.4 had come since
8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for
performance. After checking 9.2 and 9.3 (as per Kevin Grittner's
suggestion), I found that those were fine, so the issue must be in
9.4devel. I used identical configurations for each test, and used
9.1's pgbench to ensure changes in pgbench didn't affect the outcome.
The common config changes were:

max_connections = 500
shared_buffers = 4GB
effective_cache_size = 12GB
random_page_cost = 2.0
cpu_tuple_cost = 0.03
wal_buffers = 32MB
work_mem = 100MB
maintenance_work_mem = 512MB
checkpoint_segments = 32
checkpoint_timeout = 15min
checkpoint_completion_target = 0.8
commit_delay = 50
commit_siblings = 15

System info:
8GB DDR3 RAM (yes, I know the config isn't optimal here)
64-bit Linux Mint 15 (3.8.0 kernel)
ext4

Only build option used was --enable-depend. I did have
--enable-cassert for the shorter 5 min benchmarks, but was removed for
the 30 min tests.

Here are the results:

pgbench -j 80 -c 80 -T 300:

8.4 - 535.990042
9.2 - 820.798141
9.3 - 828.395498
9.4 - 197.851720

pgbench -j 20 -c 20 -T 300:

8.4 - 496.529343
9.2 - 569.626729
9.3 - 575.831264
9.4 - 385.658893

pgbench -j 20 -c 400 -T 300:

8.4 - 580.186868
9.2 - 824.590252
9.3 - 784.638848
9.4 - 524.493308

These were only run for 5 minutes each, but subsequent ones were run
for 30 mins. All tests were run with -s 20.

Given a few key commits Robert Haas directed me to, I narrowed down
the regression to a time period, so benchmarked a few select commits.

pgbench -j 80 -c 80 -T 1800:

8.4: 812.482108
9.4 HEAD: 355.397658
9.4 e5592c (9th July): 356.485625
9.4 537227 (7th July): 365.992518
9.4 9b2543 (7th July): 362.587339
9.4 269e78 (5th July): 359.439143
9.4 8800d8 (5th July): 821.933082
9.4 568d41 (2nd July): 822.991160

269e78 was the commit immediately after 8800d8, so it appears that
introduced the regression.

"Use posix_fallocate() for new WAL files, where available."

Ironically, that was intended to be a performance improvement.

Regards

Thom


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Thom Brown <thom(at)linux(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-07 16:49:58
Message-ID: 20130807164958.GD4503@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-08-07 17:21:01 +0100, Thom Brown wrote:
> Only build option used was --enable-depend. I did have
> --enable-cassert for the shorter 5 min benchmarks, but was removed for
> the 30 min tests.

> pgbench -j 80 -c 80 -T 300:
>
> 8.4 - 535.990042
> 9.2 - 820.798141
> 9.3 - 828.395498
> 9.4 - 197.851720
e>
> pgbench -j 80 -c 80 -T 1800:
>
> 8.4: 812.482108
> 9.4 HEAD: 355.397658
> 9.4 e5592c (9th July): 356.485625
> 9.4 537227 (7th July): 365.992518
> 9.4 9b2543 (7th July): 362.587339
> 9.4 269e78 (5th July): 359.439143
> 9.4 8800d8 (5th July): 821.933082
> 9.4 568d41 (2nd July): 822.991160

The differences between those runs look to small for enable/disable
cassert to me. Are you you properly rebuilt for that?

> 269e78 was the commit immediately after 8800d8, so it appears that
> introduced the regression.
>
> "Use posix_fallocate() for new WAL files, where available."

This is curious. Could you either run a longer test before/after the
commit or reduce checkpoint_timeout to something like 3min?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Thom Brown <thom(at)linux(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-07 16:54:07
Message-ID: CAKuK5J2gK=a8vi=XXAELU=GtgELxnyTOLCktMFQvesK2T3Hc1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown <thom(at)linux(dot)com> wrote:
> Hi all,
>
> I recently tried a simple benchmark to see how far 9.4 had come since
> 8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for
> performance. After checking 9.2 and 9.3 (as per Kevin Grittner's
> suggestion), I found that those were fine, so the issue must be in
> 9.4devel. I used identical configurations for each test, and used
> 9.1's pgbench to ensure changes in pgbench didn't affect the outcome.
> The common config changes were:

...

> 8.4: 812.482108
> 9.4 HEAD: 355.397658
> 9.4 e5592c (9th July): 356.485625
> 9.4 537227 (7th July): 365.992518
> 9.4 9b2543 (7th July): 362.587339
> 9.4 269e78 (5th July): 359.439143
> 9.4 8800d8 (5th July): 821.933082
> 9.4 568d41 (2nd July): 822.991160
>
> 269e78 was the commit immediately after 8800d8, so it appears that
> introduced the regression.
>
> "Use posix_fallocate() for new WAL files, where available."
>
> Ironically, that was intended to be a performance improvement.

Would it be possible for you to download, compile, and run the test
program as described and located in this email:

http://www.postgresql.org/message-id/CAKuK5J1AcML-1cGBhnRzED-vh4oG+8HkmFhy2ECa-8JBJ-6qbQ@mail.gmail.com

I also wonder if there is a problem with the 3.8.0 kernel specifically.

--
Jon


From: Thom Brown <thom(at)linux(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-07 16:54:09
Message-ID: CAA-aLv4N6=yZ+3tgCpXAX65U8EFserxs8OCjA=iYCTy0tUo9hw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7 August 2013 17:49, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-08-07 17:21:01 +0100, Thom Brown wrote:
>> Only build option used was --enable-depend. I did have
>> --enable-cassert for the shorter 5 min benchmarks, but was removed for
>> the 30 min tests.
>
>> pgbench -j 80 -c 80 -T 300:
>>
>> 8.4 - 535.990042
>> 9.2 - 820.798141
>> 9.3 - 828.395498
>> 9.4 - 197.851720
> e>
>> pgbench -j 80 -c 80 -T 1800:
>>
>> 8.4: 812.482108
>> 9.4 HEAD: 355.397658
>> 9.4 e5592c (9th July): 356.485625
>> 9.4 537227 (7th July): 365.992518
>> 9.4 9b2543 (7th July): 362.587339
>> 9.4 269e78 (5th July): 359.439143
>> 9.4 8800d8 (5th July): 821.933082
>> 9.4 568d41 (2nd July): 822.991160
>
> The differences between those runs look to small for enable/disable
> cassert to me. Are you you properly rebuilt for that?
>
>> 269e78 was the commit immediately after 8800d8, so it appears that
>> introduced the regression.
>>
>> "Use posix_fallocate() for new WAL files, where available."
>
> This is curious. Could you either run a longer test before/after the
> commit or reduce checkpoint_timeout to something like 3min?

Okay, I'll rerun the test for both those commits at 1 hour each with
checkpoint_timeout set at 3mins, but with all other configuration
settings the same

--
Thom


From: Thom Brown <thom(at)linux(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-07 17:42:08
Message-ID: CAA-aLv5Gz+mSyxN5gjB960Tj6Bj3k+eiyw3TXRWhESb+tCqLXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7 August 2013 17:54, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown <thom(at)linux(dot)com> wrote:
>> Hi all,
>>
>> I recently tried a simple benchmark to see how far 9.4 had come since
>> 8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for
>> performance. After checking 9.2 and 9.3 (as per Kevin Grittner's
>> suggestion), I found that those were fine, so the issue must be in
>> 9.4devel. I used identical configurations for each test, and used
>> 9.1's pgbench to ensure changes in pgbench didn't affect the outcome.
>> The common config changes were:
>
> ...
>
>> 8.4: 812.482108
>> 9.4 HEAD: 355.397658
>> 9.4 e5592c (9th July): 356.485625
>> 9.4 537227 (7th July): 365.992518
>> 9.4 9b2543 (7th July): 362.587339
>> 9.4 269e78 (5th July): 359.439143
>> 9.4 8800d8 (5th July): 821.933082
>> 9.4 568d41 (2nd July): 822.991160
>>
>> 269e78 was the commit immediately after 8800d8, so it appears that
>> introduced the regression.
>>
>> "Use posix_fallocate() for new WAL files, where available."
>>
>> Ironically, that was intended to be a performance improvement.
>
> Would it be possible for you to download, compile, and run the test
> program as described and located in this email:
>
> http://www.postgresql.org/message-id/CAKuK5J1AcML-1cGBhnRzED-vh4oG+8HkmFhy2ECa-8JBJ-6qbQ@mail.gmail.com

I shall do after the 2 x 1 hour benchmarks are complete.

> I also wonder if there is a problem with the 3.8.0 kernel specifically.

Well my laptop has the same kernel (and also 64-bit Linux Mint), so
took 3 quick sample benchmarks on those two commits, and I get the
following (all without --enable-cassert):

269e78: 1162.593665 / 1158.644302 / 1153.955611
8800d8: 2446.087618 / 2443.797252 / 2321.876431

And running your test program gives the following (again, just on my laptop):

for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
method: classic. 1 open/close iterations, 1 rewrite in 0.6380s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s
method: classic. 2 open/close iterations, 1 rewrite in 1.2908s
method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s
method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s
method: classic. 5 open/close iterations, 1 rewrite in 3.1419s
method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s
method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s
method: classic. 10 open/close iterations, 1 rewrite in 6.2912s
method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s
method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s
method: classic. 100 open/close iterations, 1 rewrite in 67.4174s
method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s
method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s

--
Thom


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Thom Brown <thom(at)linux(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-07 17:49:21
Message-ID: CAKuK5J3t7MvwspzGhmh28HepfTLWT2urZFiOiCH-gjOeu3TAsw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 7, 2013 at 12:42 PM, Thom Brown <thom(at)linux(dot)com> wrote:
> On 7 August 2013 17:54, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
>> On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown <thom(at)linux(dot)com> wrote:
>>> Hi all,
>>>
>>> I recently tried a simple benchmark to see how far 9.4 had come since
>>> 8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for
>>> performance. After checking 9.2 and 9.3 (as per Kevin Grittner's
>>> suggestion), I found that those were fine, so the issue must be in
>>> 9.4devel. I used identical configurations for each test, and used
>>> 9.1's pgbench to ensure changes in pgbench didn't affect the outcome.
>>> The common config changes were:
>>
>> ...
>>
>>> 8.4: 812.482108
>>> 9.4 HEAD: 355.397658
>>> 9.4 e5592c (9th July): 356.485625
>>> 9.4 537227 (7th July): 365.992518
>>> 9.4 9b2543 (7th July): 362.587339
>>> 9.4 269e78 (5th July): 359.439143
>>> 9.4 8800d8 (5th July): 821.933082
>>> 9.4 568d41 (2nd July): 822.991160
>>>
>>> 269e78 was the commit immediately after 8800d8, so it appears that
>>> introduced the regression.
>>>
>>> "Use posix_fallocate() for new WAL files, where available."
>>>
>>> Ironically, that was intended to be a performance improvement.
>>
>> Would it be possible for you to download, compile, and run the test
>> program as described and located in this email:
>>
>> http://www.postgresql.org/message-id/CAKuK5J1AcML-1cGBhnRzED-vh4oG+8HkmFhy2ECa-8JBJ-6qbQ@mail.gmail.com
>
> I shall do after the 2 x 1 hour benchmarks are complete.
>
>> I also wonder if there is a problem with the 3.8.0 kernel specifically.
>
> Well my laptop has the same kernel (and also 64-bit Linux Mint), so
> took 3 quick sample benchmarks on those two commits, and I get the
> following (all without --enable-cassert):
>
> 269e78: 1162.593665 / 1158.644302 / 1153.955611
> 8800d8: 2446.087618 / 2443.797252 / 2321.876431
>
> And running your test program gives the following (again, just on my laptop):
>
> for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
> method: classic. 1 open/close iterations, 1 rewrite in 0.6380s
> method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s
> method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s
> method: classic. 2 open/close iterations, 1 rewrite in 1.2908s
> method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s
> method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s
> method: classic. 5 open/close iterations, 1 rewrite in 3.1419s
> method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s
> method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s
> method: classic. 10 open/close iterations, 1 rewrite in 6.2912s
> method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s
> method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s
> method: classic. 100 open/close iterations, 1 rewrite in 67.4174s
> method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s
> method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s

OK. That's interesting, and a good data point.

One thing you could try manually disabling the use of posix_fallocate in 269e78.
After running ./configure --stuff-here
edit src/include/pg_config.h and comment out the following line (on or
around line 374)
#define HAVE_POSIX_FALLOCATE 1

*then* build postgresql and see if the performance hit is still there.

--
Jon


From: Thom Brown <thom(at)linux(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-07 18:04:43
Message-ID: CAA-aLv6OzvPDDR6nDdHzP73m4C+nux4NSk1VyiQ4yOy9x95LqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7 August 2013 18:49, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> On Wed, Aug 7, 2013 at 12:42 PM, Thom Brown <thom(at)linux(dot)com> wrote:
>> for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
>> method: classic. 1 open/close iterations, 1 rewrite in 0.6380s
>> method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s
>> method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s
>> method: classic. 2 open/close iterations, 1 rewrite in 1.2908s
>> method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s
>> method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s
>> method: classic. 5 open/close iterations, 1 rewrite in 3.1419s
>> method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s
>> method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s
>> method: classic. 10 open/close iterations, 1 rewrite in 6.2912s
>> method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s
>> method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s
>> method: classic. 100 open/close iterations, 1 rewrite in 67.4174s
>> method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s
>> method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s
>
> OK. That's interesting, and a good data point.
>
> One thing you could try manually disabling the use of posix_fallocate in 269e78.
> After running ./configure --stuff-here
> edit src/include/pg_config.h and comment out the following line (on or
> around line 374)
> #define HAVE_POSIX_FALLOCATE 1
>
> *then* build postgresql and see if the performance hit is still there.

Okay, done that. The TPS increases again:

2308.807568 / 2554.264572 / 2563.190204

And I did run ./configure... before removing the line, and built it
after the change.

--
Thom


From: Thom Brown <thom(at)linux(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-07 19:23:55
Message-ID: CAA-aLv5qOJ7TAVK7QZmxPfs_kzaaBAcX9qpPX7=Dx5weNLm0sA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7 August 2013 17:54, Thom Brown <thom(at)linux(dot)com> wrote:
> On 7 August 2013 17:49, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> On 2013-08-07 17:21:01 +0100, Thom Brown wrote:
>>> Only build option used was --enable-depend. I did have
>>> --enable-cassert for the shorter 5 min benchmarks, but was removed for
>>> the 30 min tests.
>>
>>> pgbench -j 80 -c 80 -T 300:
>>>
>>> 8.4 - 535.990042
>>> 9.2 - 820.798141
>>> 9.3 - 828.395498
>>> 9.4 - 197.851720
>> e>
>>> pgbench -j 80 -c 80 -T 1800:
>>>
>>> 8.4: 812.482108
>>> 9.4 HEAD: 355.397658
>>> 9.4 e5592c (9th July): 356.485625
>>> 9.4 537227 (7th July): 365.992518
>>> 9.4 9b2543 (7th July): 362.587339
>>> 9.4 269e78 (5th July): 359.439143
>>> 9.4 8800d8 (5th July): 821.933082
>>> 9.4 568d41 (2nd July): 822.991160
>>
>> The differences between those runs look to small for enable/disable
>> cassert to me. Are you you properly rebuilt for that?
>>
>>> 269e78 was the commit immediately after 8800d8, so it appears that
>>> introduced the regression.
>>>
>>> "Use posix_fallocate() for new WAL files, where available."
>>
>> This is curious. Could you either run a longer test before/after the
>> commit or reduce checkpoint_timeout to something like 3min?
>
> Okay, I'll rerun the test for both those commits at 1 hour each with
> checkpoint_timeout set at 3mins, but with all other configuration
> settings the same

Results
(checkpoint_timeout = 3min)

pgbench -j 80 -c 80 -T 3600

269e78: 606.268013
8800d8: 779.583129

Jon, here are the test results you asked for:

$ for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
method: classic. 1 open/close iterations, 1 rewrite in 0.9157s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.5314s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6018s
method: classic. 2 open/close iterations, 1 rewrite in 1.1417s
method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6505s
method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.8900s
method: classic. 5 open/close iterations, 1 rewrite in 3.6490s
method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.9841s
method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1053s
method: classic. 10 open/close iterations, 1 rewrite in 5.7562s
method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2015s
method: glibc emulation. 10 open/close iterations, 1 rewrite in 7.1426s
method: classic. 100 open/close iterations, 1 rewrite in 64.9483s
method: posix_fallocate. 100 open/close iterations, 1 rewrite in 36.3748s
method: glibc emulation. 100 open/close iterations, 1 rewrite in 64.8386s

--
Thom


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-07 21:18:13
Message-ID: 1375910293.3381.YahooMailNeo@web162905.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thom Brown <thom(at)linux(dot)com> wrote:

> pgbench -j 80 -c 80 -T 3600
>
> 269e78: 606.268013
> 8800d8: 779.583129

As another data point I duplicated Thom's original tests:

max_connections = 500
shared_buffers = 4GB
effective_cache_size = 12GB
random_page_cost = 2.0
cpu_tuple_cost = 0.03
wal_buffers = 32MB
work_mem = 100MB
maintenance_work_mem = 512MB
checkpoint_segments = 32
checkpoint_timeout = 15min
checkpoint_completion_target = 0.8
commit_delay = 50
commit_siblings = 15

pgbench -i -s 20 pgbench
pgbench -j 80 -c 80 -T 1800 pgbench
(note the high contention pgbench setup)

On this system:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
  (hyperthreading enabled)
2 x 15k RPS drives in software RAID 1, ext4, write cache disabled
3.5.0-37-generic #58-Ubuntu SMP Mon Jul 8 22:07:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

commit 8800d8061dd151d6556f5f8d58f8211fd830169f
number of transactions actually processed: 1298864
tps = 721.384612 (including connections establishing)
tps = 721.388938 (excluding connections establishing)

commit 269e780822abb2e44189afaccd6b0ee7aefa7ddd
number of transactions actually processed: 875900
tps = 486.524741 (including connections establishing)
tps = 486.527891 (excluding connections establishing)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Greg Stark <stark(at)mit(dot)edu>
To: Thom Brown <thom(at)linux(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-07 22:40:17
Message-ID: CAM-w4HNms75Rkbxegu5Aw=2Kcg-Pu5H4LbaqfJeA_fzaLkrVFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Did you report information about the system affected? What filesystem is it
on? If it's ext4 does it have extents enabled?

I think on ext3 or ext4 without extents it won't have any benefit but it
shouldn't really be any slower either since the libc implementation is very
similar to what we used to do.

I wouldn't expect the benefit from it to have a huge effect either. The
main benefit is actually the reduced fragmentation so it would be pretty
dependent on the filesystem it's on anyways. Avoiding stalls when creating
new wal files is also nice but shouldn't happen often enough n your
benchmarks to cause an effect on the average numbers.


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-07 22:44:45
Message-ID: CAKuK5J00j4SOyE-b=k4Vx0OUZ3-U+9uBV=o-NArTW6xh43_g+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 7, 2013 at 4:18 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> Thom Brown <thom(at)linux(dot)com> wrote:
>
>> pgbench -j 80 -c 80 -T 3600
>>
>> 269e78: 606.268013
>> 8800d8: 779.583129

I have also been running some tests and - as yet - they are
inconclusive. What I can say about them so far is that - at times -
they agree with these results and do show a performance hit.

I took the settings as posted and adjusted them slightly for my much
lower-powered personal laptop, changing checkpoint_completion_target
to 1.0 and checkpoint_timeout to 1min.

I am testing with the latest git HEAD, turning fallocate support on
and off by editing xlog.c directly. Furthermore, before each run I
would try to reduce the number of existing WAL files by making a "pre"
run with checkpoint_segments = 3 before changing it to 32.

For reasons I don't entirely understand, when WAL files are being
created (rather than recycled) I found a performance hit, but when
they were being recycled I got a slight performance gain (this may be
due to reduced fragmentation) but the gain was never more than 10% and
frequently less than that.

I can't explain - yet (if ever!) - why my initial tests (and many of
those done subsequently) showed improvement - it may very well have
something to do with how the code interacts with these settings.

--
Jon


From: Thom Brown <thom(at)linux(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-07 23:04:01
Message-ID: CAA-aLv52sD-2b3_9cDHoNtcvrw=8szTmbV1PM834U3hDSY57ow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7 August 2013 23:40, Greg Stark <stark(at)mit(dot)edu> wrote:
> Did you report information about the system affected? What filesystem is it
> on? If it's ext4 does it have extents enabled?

Yes, ext4. It's using whatever the default options are, but running
lsattr on my data dir shows that extents are being used.
--
Thom


From: Thom Brown <thom(at)linux(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-08 00:40:24
Message-ID: CAA-aLv4MiCiNc1m9JcPY80pZmez-jZ5BkrJshC+vAfhaRS4Nug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 8 August 2013 00:04, Thom Brown <thom(at)linux(dot)com> wrote:
> On 7 August 2013 23:40, Greg Stark <stark(at)mit(dot)edu> wrote:
>> Did you report information about the system affected? What filesystem is it
>> on? If it's ext4 does it have extents enabled?
>
> Yes, ext4. It's using whatever the default options are, but running
> lsattr on my data dir shows that extents are being used.

I've just created an ext2 partition (so no extents), and did some
quick benchmarks of those 2 commits. This is on the original system
(not my laptop), with the same config settings as before. All tests
initialised with -s 20 and tested with -j 80 -c 80.

30 seconds:

8800d8: 1327.983018 / 1317.848514
269e78: 1388.009819 / 1285.929100

5 minutes:

8800d8: 1744.446592 / 1729.602706
269e78: 1677.561277 / 1677.967873

10 minutes:

8800d8: 1753.147609 / 1806.600335
269e78: 1815.862842 / 1804.906769

So this appears to be okay.

--
Thom


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.4 regression
Date: 2013-08-08 00:58:06
Message-ID: 5202ED1E.7040807@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/07/2013 05:40 PM, Thom Brown wrote:
> On 8 August 2013 00:04, Thom Brown <thom(at)linux(dot)com> wrote:
>> On 7 August 2013 23:40, Greg Stark <stark(at)mit(dot)edu> wrote:
>>> Did you report information about the system affected? What filesystem is it
>>> on? If it's ext4 does it have extents enabled?
>>
>> Yes, ext4. It's using whatever the default options are, but running
>> lsattr on my data dir shows that extents are being used.

Would it be worth testing XFS? If so, any particular options I should use?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Thom Brown <thom(at)linux(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-08 03:05:12
Message-ID: 20130808030512.GC6350@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-08-07 20:23:55 +0100, Thom Brown wrote:
> >>> 269e78 was the commit immediately after 8800d8, so it appears that
> >>> introduced the regression.
> >>>
> >>> "Use posix_fallocate() for new WAL files, where available."
> >>
> >> This is curious. Could you either run a longer test before/after the
> >> commit or reduce checkpoint_timeout to something like 3min?
> >
> > Okay, I'll rerun the test for both those commits at 1 hour each with
> > checkpoint_timeout set at 3mins, but with all other configuration
> > settings the same
>
> Results
> (checkpoint_timeout = 3min)
>
> pgbench -j 80 -c 80 -T 3600
>
> 269e78: 606.268013
> 8800d8: 779.583129

Ok, so the performance difference is lower. But, it seems to be still to
high to be explaineable by performance differences during
initialization/fallocate.
In a very quick test, I don't see the same on my laptop. I am currently
travelling and I don't have convenient access to anything else.

Could you:
- run filefrag on a couple of wal segments of both clusters after the
test and post the results here?
- enable log_checkpoints, post the lines spat out together with the results
- run pgbench without reinitializing the cluster/pgbench tables
inbetween?

Basically, I'd like to know whether its the initialization that's slow
(measurable because we're regularly creating new files because of a too
low checkpoint_segments) or whether it's the actual writes.

> Jon, here are the test results you asked for:
>
> $ for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
> method: classic. 1 open/close iterations, 1 rewrite in 0.9157s
> method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.5314s
> method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6018s
> method: classic. 2 open/close iterations, 1 rewrite in 1.1417s
> method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6505s
> method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.8900s
> method: classic. 5 open/close iterations, 1 rewrite in 3.6490s
> method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.9841s
> method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1053s
> method: classic. 10 open/close iterations, 1 rewrite in 5.7562s
> method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2015s
> method: glibc emulation. 10 open/close iterations, 1 rewrite in 7.1426s
> method: classic. 100 open/close iterations, 1 rewrite in 64.9483s
> method: posix_fallocate. 100 open/close iterations, 1 rewrite in 36.3748s
> method: glibc emulation. 100 open/close iterations, 1 rewrite in 64.8386s

Ok, this seems to indicate that fallocate works nicely. Jon, wasn't
there a version of the test that rewrote the file afterwards?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-08 04:55:22
Message-ID: CAKuK5J3amFQijNqFNR+QDuMjuHweLT2YSeuW5VMQh+jgbwUbHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 7, 2013 at 10:05 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-08-07 20:23:55 +0100, Thom Brown wrote:
>> >>> 269e78 was the commit immediately after 8800d8, so it appears that
>> >>> introduced the regression.
>> >>>
>> >>> "Use posix_fallocate() for new WAL files, where available."
>> >>
>> >> This is curious. Could you either run a longer test before/after the
>> >> commit or reduce checkpoint_timeout to something like 3min?
>> >
>> > Okay, I'll rerun the test for both those commits at 1 hour each with
>> > checkpoint_timeout set at 3mins, but with all other configuration
>> > settings the same
>>
>> Results
>> (checkpoint_timeout = 3min)
>>
>> pgbench -j 80 -c 80 -T 3600
>>
>> 269e78: 606.268013
>> 8800d8: 779.583129
>
> Ok, so the performance difference is lower. But, it seems to be still to
> high to be explaineable by performance differences during
> initialization/fallocate.
> In a very quick test, I don't see the same on my laptop. I am currently
> travelling and I don't have convenient access to anything else.
>
> Could you:
> - run filefrag on a couple of wal segments of both clusters after the
> test and post the results here?

For me, there is no meaningful difference, but I have a relatively
fresh filesystem (ext4) with lots of free space.

> - enable log_checkpoints, post the lines spat out together with the results
> - run pgbench without reinitializing the cluster/pgbench tables
> inbetween?

When I do this (as I note below), the performance difference (for me)
disappears.

> Basically, I'd like to know whether its the initialization that's slow
> (measurable because we're regularly creating new files because of a too
> low checkpoint_segments) or whether it's the actual writes.

What I've found so far is very confusing.
I start by using initdb to initialize a temporary cluster, copy in a
postgresql.conf (with the modifcations from Thom Brown tweaked for my
small laptop), start the cluster, create a test database, initialize
it with pgbench, and then run. I'm also only running for two minutes
at this time.

Every time I test, the non-fallocate version is faster. I modifed
xlog.c to use posix_fallocate (or not) based on an environment
variable.
Once the WAL files have been rewritten at least once, then it doesn't
seem to matter which method is used to allocate them (although
posix_fallocate seems to have a slight edge). I even went to far as to
modify the code to posix_fallocate the file *and then zero it out
anyway*, and it was almost as fast as zeroing alone, and faster than
using posix_fallocate alone.

>> Jon, here are the test results you asked for:
>>
>> $ for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
>> method: classic. 1 open/close iterations, 1 rewrite in 0.9157s
>> method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.5314s
>> method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6018s
>> method: classic. 2 open/close iterations, 1 rewrite in 1.1417s
>> method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6505s
>> method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.8900s
>> method: classic. 5 open/close iterations, 1 rewrite in 3.6490s
>> method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.9841s
>> method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1053s
>> method: classic. 10 open/close iterations, 1 rewrite in 5.7562s
>> method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2015s
>> method: glibc emulation. 10 open/close iterations, 1 rewrite in 7.1426s
>> method: classic. 100 open/close iterations, 1 rewrite in 64.9483s
>> method: posix_fallocate. 100 open/close iterations, 1 rewrite in 36.3748s
>> method: glibc emulation. 100 open/close iterations, 1 rewrite in 64.8386s
>
> Ok, this seems to indicate that fallocate works nicely. Jon, wasn't
> there a version of the test that rewrote the file afterwards?

Yes. If you use a different number besides '1' as the third argument
in the command line above, it will rewrite the file that many times.

--
Jon


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-08 06:18:44
Message-ID: 20130808061844.GB14729@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-08-07 23:55:22 -0500, Jon Nelson wrote:
> > - enable log_checkpoints, post the lines spat out together with the results
> > - run pgbench without reinitializing the cluster/pgbench tables
> > inbetween?
>
> When I do this (as I note below), the performance difference (for me)
> disappears.

Ok. In this case the files have been written to and fsync()ed before the
test is run.

> > Basically, I'd like to know whether its the initialization that's slow
> > (measurable because we're regularly creating new files because of a too
> > low checkpoint_segments) or whether it's the actual writes.
>
> What I've found so far is very confusing.
> I start by using initdb to initialize a temporary cluster, copy in a
> postgresql.conf (with the modifcations from Thom Brown tweaked for my
> small laptop), start the cluster, create a test database, initialize
> it with pgbench, and then run. I'm also only running for two minutes
> at this time.

Hm. In a 2 minutes run there won't be any checkpoints..

> Every time I test, the non-fallocate version is faster. I modifed
> xlog.c to use posix_fallocate (or not) based on an environment
> variable.
> Once the WAL files have been rewritten at least once, then it doesn't
> seem to matter which method is used to allocate them (although
> posix_fallocate seems to have a slight edge). I even went to far as to
> modify the code to posix_fallocate the file *and then zero it out
> anyway*, and it was almost as fast as zeroing alone, and faster than
> using posix_fallocate alone.

Ok, so it seems there's performance issues with allocating the data even
though the allocation supposedly has already been made. Access by
several backends may play a role here.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-08 15:28:55
Message-ID: CAKuK5J3-bj3ugphNW1ENXVEeiprEAF2EfY6D3jd3TpRr1BcvwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

A follow-up.
I found this thread that seems to explain some things:

http://comments.gmane.org/gmane.comp.file-systems.ext4/33024

Short version: if we are writing into the "middle" of the
newly-fallocated file on ext4 (with extents) the extent tree can
fragment badly, causing poor performance due to extent merging.

I also ran some tests with xfs, and the results were even stranger:

xfs performed very slightly better with fallocate vs. without.
xfs (fallocate) 216 tps vs. (no fallocate) 206
ext4 (no fallocate) 605 vs (fallocate) 134.

I made an ext4 filesystem without extents using the same block device
as above - a real, single hard drive with nothing else on it.
On this filesystem, the performance remained the same (or - perhaps -
improved very slightly): 633tps (with fallocate) vs. 607.

Using the following additions postgresql.conf:

max_connections = 500
shared_buffers = 1GB
effective_cache_size = 1GB
random_page_cost = 2.0
cpu_tuple_cost = 0.03
wal_buffers = 32MB
work_mem = 100MB
maintenance_work_mem = 512MB
commit_delay = 50
commit_siblings = 15
checkpoint_segments = 256
log_checkpoints = on

and this partial script for testing:

pg_ctl initdb -D tt -w
cp ~/postgresql.conf tt
pg_ctl -D tt -w start
createdb -p 54320 pgb
pgbench -s 20 -p 54320 -d pgb -i
pgbench -j 80 -c 80 -T 120 -p 54320 pgb
pg_ctl -D tt -w stop

--
Jon


From: Thom Brown <thom(at)linux(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-08 16:31:36
Message-ID: CAA-aLv5pG738sWfBEa5+Hh_c23UU50Rjpzi+AbXwOXnaf3__ZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 8 August 2013 04:05, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-08-07 20:23:55 +0100, Thom Brown wrote:
>> >>> 269e78 was the commit immediately after 8800d8, so it appears that
>> >>> introduced the regression.
>> >>>
>> >>> "Use posix_fallocate() for new WAL files, where available."
>> >>
>> >> This is curious. Could you either run a longer test before/after the
>> >> commit or reduce checkpoint_timeout to something like 3min?
>> >
>> > Okay, I'll rerun the test for both those commits at 1 hour each with
>> > checkpoint_timeout set at 3mins, but with all other configuration
>> > settings the same
>>
>> Results
>> (checkpoint_timeout = 3min)
>>
>> pgbench -j 80 -c 80 -T 3600
>>
>> 269e78: 606.268013
>> 8800d8: 779.583129
>
> Ok, so the performance difference is lower. But, it seems to be still to
> high to be explaineable by performance differences during
> initialization/fallocate.
> In a very quick test, I don't see the same on my laptop. I am currently
> travelling and I don't have convenient access to anything else.
>
> Could you:
> - run filefrag on a couple of wal segments of both clusters after the
> test and post the results here?

I ran a benchmark for 10 minutes. I ran filefrag against all files in
pg_xlog and they all say "1 extent found"

> - enable log_checkpoints, post the lines spat out together with the
results

LOG: database system was shut down at 2013-08-08 16:38:28 BST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=0.041 s, sync=0.034 s,
total=0.132 s; sync files=3, longest=0.017 s, average=0.011 s
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=0.015 s, sync=0.000 s,
total=0.071 s; sync files=0, longest=0.000 s, average=0.000 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 39137 buffers (7.5%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=143.757 s, sync=5.659 s,
total=149.620 s; sync files=44, longest=3.814 s, average=0.128 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 21141 buffers (4.0%); 0 transaction log
file(s) added, 0 removed, 15 recycled; write=143.082 s, sync=13.144 s,
total=156.421 s; sync files=13, longest=10.336 s, average=1.011 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 27702 buffers (5.3%); 0 transaction log
file(s) added, 0 removed, 15 recycled; write=143.739 s, sync=4.470 s,
total=148.337 s; sync files=12, longest=2.545 s, average=0.372 s
LOG: checkpoint starting: time

> - run pgbench without reinitializing the cluster/pgbench tables
> inbetween?

Set 1 (10 mins each):
First run after initialisation: 289.410795
Re-run without re-initialisation: 519.694545

These additional lines were added to the log for the re-run:

LOG: checkpoint complete: wrote 14793 buffers (2.8%); 0 transaction log
file(s) added, 0 removed, 16 recycled; write=143.299 s, sync=2.318 s,
total=145.691 s; sync files=13, longest=1.910 s, average=0.178 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 37236 buffers (7.1%); 0 transaction log
file(s) added, 0 removed, 13 recycled; write=143.410 s, sync=7.363 s,
total=150.981 s; sync files=16, longest=4.871 s, average=0.460 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 28271 buffers (5.4%); 0 transaction log
file(s) added, 0 removed, 20 recycled; write=143.597 s, sync=11.761 s,
total=155.472 s; sync files=13, longest=8.494 s, average=0.904 s
LOG: checkpoint starting: time

Set 2 (10 mins each):
First run after initialisation: 295.925985
Re-run without re-initialisation: 503.239516

Here's the log for set 2 after the first run:

LOG: database system was shut down at 2013-08-08 17:06:41 BST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=0.041 s, sync=0.034 s,
total=0.153 s; sync files=3, longest=0.017 s, average=0.011 s
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=0.014 s, sync=0.000 s,
total=0.057 s; sync files=0, longest=0.000 s, average=0.000 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 39148 buffers (7.5%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=143.322 s, sync=5.519 s,
total=149.256 s; sync files=44, longest=3.691 s, average=0.125 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 21447 buffers (4.1%); 0 transaction log
file(s) added, 0 removed, 15 recycled; write=143.161 s, sync=13.459 s,
total=156.843 s; sync files=13, longest=10.915 s, average=1.035 s
LOG: checkpoint starting: time

After the re-run in set 2, these lines were added:

LOG: checkpoint complete: wrote 26609 buffers (5.1%); 0 transaction log
file(s) added, 0 removed, 15 recycled; write=143.253 s, sync=6.588 s,
total=150.036 s; sync files=15, longest=2.606 s, average=0.439 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 16931 buffers (3.2%); 0 transaction log
file(s) added, 0 removed, 16 recycled; write=143.171 s, sync=16.489 s,
total=160.014 s; sync files=14, longest=12.660 s, average=1.177 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 35345 buffers (6.7%); 0 transaction log
file(s) added, 0 removed, 14 recycled; write=143.170 s, sync=10.596 s,
total=153.977 s; sync files=16, longest=7.996 s, average=0.662 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 27907 buffers (5.3%); 0 transaction log
file(s) added, 0 removed, 19 recycled; write=143.184 s, sync=12.562 s,
total=155.969 s; sync files=14, longest=7.844 s, average=0.897 s
LOG: checkpoint starting: time

I rechecked WAL file fragmentation after each run, and it always says "1
extent found".

--
Thom


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Subject: Re: 9.4 regression
Date: 2013-08-08 19:50:33
Message-ID: 5203F689.1010204@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/08/2013 05:28 PM, Jon Nelson wrote:
> A follow-up.
> I found this thread that seems to explain some things:
>
> http://comments.gmane.org/gmane.comp.file-systems.ext4/33024
>
> Short version: if we are writing into the "middle" of the
> newly-fallocated file on ext4 (with extents) the extent tree can
> fragment badly, causing poor performance due to extent merging.
>
> I also ran some tests with xfs, and the results were even stranger:
>
> xfs performed very slightly better with fallocate vs. without.
> xfs (fallocate) 216 tps vs. (no fallocate) 206
> ext4 (no fallocate) 605 vs (fallocate) 134.
>
> I made an ext4 filesystem without extents using the same block device
> as above - a real, single hard drive with nothing else on it.
> On this filesystem, the performance remained the same (or - perhaps -
> improved very slightly): 633tps (with fallocate) vs. 607.
>
>
> Using the following additions postgresql.conf:
>
> max_connections = 500
> shared_buffers = 1GB
> effective_cache_size = 1GB
> random_page_cost = 2.0
> cpu_tuple_cost = 0.03
> wal_buffers = 32MB
> work_mem = 100MB
> maintenance_work_mem = 512MB
> commit_delay = 50
> commit_siblings = 15
> checkpoint_segments = 256
> log_checkpoints = on
>
> and this partial script for testing:
>
> pg_ctl initdb -D tt -w
> cp ~/postgresql.conf tt
> pg_ctl -D tt -w start
> createdb -p 54320 pgb
> pgbench -s 20 -p 54320 -d pgb -i
Just an idea - can you check if using a fillfactor different form 100
changes anything

pgbench -s 20 -p 54320 -d pgb -F 90 -i

> pgbench -j 80 -c 80 -T 120 -p 54320 pgb
> pg_ctl -D tt -w stop

That is, does extending tables and indexes to add updated tuples play
any role here

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Hannu Krosing <hannu(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-08 21:12:06
Message-ID: CAKuK5J0s9ZikBxUTdLW4zF=yXYrBzk0WwiM=-fX_xJ3Hes5jrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com> wrote:
> On 08/08/2013 05:28 PM, Jon Nelson wrote:
...

> Just an idea - can you check if using a fillfactor different form 100
> changes anything
>
> pgbench -s 20 -p 54320 -d pgb -F 90 -i
>
>
>> pgbench -j 80 -c 80 -T 120 -p 54320 pgb
>> pg_ctl -D tt -w stop
>
> That is, does extending tables and indexes to add updated tuples play
> any role here

I gave it a go - it didn't make any difference at all.
At this point I'm convinced that the issue is a pathological case in
ext4. The performance impact disappears as soon as the unwritten
extent(s) are written to with real data. Thus, even though allocating
files with posix_fallocate is - frequently - orders of magnitude
quicker than doing it with write(2), the subsequent re-write can be
more expensive. At least, that's what I'm gathering from the various
threads. Why this issue didn't crop up in earlier testing and why I
can't seem to make test_fallocate do it (even when I modify
test_fallocate to write to the newly-allocated file in a mostly-random
fashion) has me baffled. Should this feature be reconsidered?

--
Jon


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-08 21:24:26
Message-ID: 20130808212426.GB23572@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 8, 2013 at 04:12:06PM -0500, Jon Nelson wrote:
> On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com> wrote:
> > On 08/08/2013 05:28 PM, Jon Nelson wrote:
> ...
>
> > Just an idea - can you check if using a fillfactor different form 100
> > changes anything
> >
> > pgbench -s 20 -p 54320 -d pgb -F 90 -i
> >
> >
> >> pgbench -j 80 -c 80 -T 120 -p 54320 pgb
> >> pg_ctl -D tt -w stop
> >
> > That is, does extending tables and indexes to add updated tuples play
> > any role here
>
> I gave it a go - it didn't make any difference at all.
> At this point I'm convinced that the issue is a pathological case in
> ext4. The performance impact disappears as soon as the unwritten
> extent(s) are written to with real data. Thus, even though allocating
> files with posix_fallocate is - frequently - orders of magnitude
> quicker than doing it with write(2), the subsequent re-write can be
> more expensive. At least, that's what I'm gathering from the various
> threads. Why this issue didn't crop up in earlier testing and why I
> can't seem to make test_fallocate do it (even when I modify
> test_fallocate to write to the newly-allocated file in a mostly-random
> fashion) has me baffled. Should this feature be reconsidered?

How much slower would it be if we wrote it with zeros after
posix_fallocate() --- that would still give use single extents. Has
anyone tested to see if the write without test_fallocate() still gives
us one extent?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-08 21:33:05
Message-ID: CAKuK5J2PqwyJXgNN9P0GKyBESH67pPLGacNAE8JnoodWWyr6SQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 8, 2013 at 4:24 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Thu, Aug 8, 2013 at 04:12:06PM -0500, Jon Nelson wrote:
>> On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com> wrote:
>> > On 08/08/2013 05:28 PM, Jon Nelson wrote:
>> ...
>>
>> > Just an idea - can you check if using a fillfactor different form 100
>> > changes anything
>> >
>> > pgbench -s 20 -p 54320 -d pgb -F 90 -i
>> >
>> >
>> >> pgbench -j 80 -c 80 -T 120 -p 54320 pgb
>> >> pg_ctl -D tt -w stop
>> >
>> > That is, does extending tables and indexes to add updated tuples play
>> > any role here
>>
>> I gave it a go - it didn't make any difference at all.
>> At this point I'm convinced that the issue is a pathological case in
>> ext4. The performance impact disappears as soon as the unwritten
>> extent(s) are written to with real data. Thus, even though allocating
>> files with posix_fallocate is - frequently - orders of magnitude
>> quicker than doing it with write(2), the subsequent re-write can be
>> more expensive. At least, that's what I'm gathering from the various
>> threads. Why this issue didn't crop up in earlier testing and why I
>> can't seem to make test_fallocate do it (even when I modify
>> test_fallocate to write to the newly-allocated file in a mostly-random
>> fashion) has me baffled. Should this feature be reconsidered?
>
> How much slower would it be if we wrote it with zeros after
> posix_fallocate() --- that would still give use single extents. Has
> anyone tested to see if the write without test_fallocate() still gives
> us one extent?

Actually, I did that test - I found that there was no longer a
performance drop and possibly a slight performance benefit.
I couldn't quite parse your last sentence, I assume you mean "does
skipping the posix_fallocate altogether still produce one extent?"
Sadly, I can't really test that properly as the ext4 filesystem I'm
testing on is quite fresh and has lots and lots of free space.

--
Jon


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-08 21:42:02
Message-ID: 8037.1375998122@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> writes:
> At this point I'm convinced that the issue is a pathological case in
> ext4. The performance impact disappears as soon as the unwritten
> extent(s) are written to with real data. Thus, even though allocating
> files with posix_fallocate is - frequently - orders of magnitude
> quicker than doing it with write(2), the subsequent re-write can be
> more expensive. At least, that's what I'm gathering from the various
> threads. Why this issue didn't crop up in earlier testing and why I
> can't seem to make test_fallocate do it (even when I modify
> test_fallocate to write to the newly-allocated file in a mostly-random
> fashion) has me baffled.

Does your test program use all the same writing options that the real
WAL writes do (like O_DIRECT)?

> Should this feature be reconsidered?

Well, ext4 isn't the whole world, but it's sure a big part of it.
The real point though is that obviously we didn't do enough performance
testing, so we'd better do more before deciding what to do.

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-08 21:46:18
Message-ID: 20130808214618.GC23572@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 8, 2013 at 04:33:05PM -0500, Jon Nelson wrote:
> > How much slower would it be if we wrote it with zeros after
> > posix_fallocate() --- that would still give use single extents. Has
> > anyone tested to see if the write without test_fallocate() still gives
> > us one extent?
>
> Actually, I did that test - I found that there was no longer a
> performance drop and possibly a slight performance benefit.
> I couldn't quite parse your last sentence, I assume you mean "does
> skipping the posix_fallocate altogether still produce one extent?"
> Sadly, I can't really test that properly as the ext4 filesystem I'm
> testing on is quite fresh and has lots and lots of free space.

Yes, that was the question. I meant to say:

Has anyone tested to see if the write without posix_fallocate() still
creates one extent per 16MB file, meaning the posix_fallocate()
is unnecessary?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-08 21:46:46
Message-ID: CAKuK5J1ViL=JDHXVP0TO8ACF-HD+vgWf=d_or1UZbAFu-gK54A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> writes:
>> At this point I'm convinced that the issue is a pathological case in
>> ext4. The performance impact disappears as soon as the unwritten
>> extent(s) are written to with real data. Thus, even though allocating
>> files with posix_fallocate is - frequently - orders of magnitude
>> quicker than doing it with write(2), the subsequent re-write can be
>> more expensive. At least, that's what I'm gathering from the various
>> threads. Why this issue didn't crop up in earlier testing and why I
>> can't seem to make test_fallocate do it (even when I modify
>> test_fallocate to write to the newly-allocated file in a mostly-random
>> fashion) has me baffled.
>
> Does your test program use all the same writing options that the real
> WAL writes do (like O_DIRECT)?

I believe so.

From xlog.c:

/* do not use get_sync_bit() here --- want to fsync only at end of fill */
fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
S_IRUSR | S_IWUSR);

and from the test program:

fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);

PG_BINARY expands to 0 on non-Windows. I also tried using O_WRONLY in
xlog.c without change.

--
Jon


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-08 22:23:49
Message-ID: 8351.1376000629@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> writes:
> On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Does your test program use all the same writing options that the real
>> WAL writes do (like O_DIRECT)?

> I believe so.

>> From xlog.c:

> /* do not use get_sync_bit() here --- want to fsync only at end of fill */
> fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
> S_IRUSR | S_IWUSR);

> and from the test program:

> fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);

Maybe I misunderstood, but I thought the performance complaint had to do
with the actual writes of WAL data, not with the pre-fill. That is, you
should not just be measuring how long the pre-fill takes, but what is the
speed of real writes to the file later on (which will be using
get_sync_bit, for various values of the sync settings).

regards, tom lane


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-08 22:25:42
Message-ID: CAKuK5J3jOZF6_xuMz0e7iYCpgSh00tm6jO-qwXjSodXP0RgECg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 8, 2013 at 5:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> writes:
>> On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Does your test program use all the same writing options that the real
>>> WAL writes do (like O_DIRECT)?
>
>> I believe so.
>
>>> From xlog.c:
>
>> /* do not use get_sync_bit() here --- want to fsync only at end of fill */
>> fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
>> S_IRUSR | S_IWUSR);
>
>> and from the test program:
>
>> fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);
>
> Maybe I misunderstood, but I thought the performance complaint had to do
> with the actual writes of WAL data, not with the pre-fill. That is, you
> should not just be measuring how long the pre-fill takes, but what is the
> speed of real writes to the file later on (which will be using
> get_sync_bit, for various values of the sync settings).

Ah, no, I misunderstood your question.
I'm fairly certain the test program doesn't open up files with any
sort of sync. bit set.
I'll have to see what PostgreSQL is using, exactly, and get back to you.

--
Jon


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-08 22:44:05
Message-ID: CAKuK5J0C5KAw0=uGiZN3nOL+pf6h4PkJ9n+J3RCzifCwObZxGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 8, 2013 at 5:25 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> On Thu, Aug 8, 2013 at 5:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> writes:
>>> On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> Does your test program use all the same writing options that the real
>>>> WAL writes do (like O_DIRECT)?
>>
>>> I believe so.
>>
>>>> From xlog.c:
>>
>>> /* do not use get_sync_bit() here --- want to fsync only at end of fill */
>>> fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
>>> S_IRUSR | S_IWUSR);
>>
>>> and from the test program:
>>
>>> fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);
>>
>> Maybe I misunderstood, but I thought the performance complaint had to do
>> with the actual writes of WAL data, not with the pre-fill. That is, you
>> should not just be measuring how long the pre-fill takes, but what is the
>> speed of real writes to the file later on (which will be using
>> get_sync_bit, for various values of the sync settings).
>
> Ah, no, I misunderstood your question.
> I'm fairly certain the test program doesn't open up files with any
> sort of sync. bit set.
> I'll have to see what PostgreSQL is using, exactly, and get back to you.

My reading of xlog.c seems to indicate that if the sync mode is
fdatasync, then the file is not opened with any special sync bit, but
that an fdatasync occurs at the end. That is also what the test
program does.

--
Jon


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-09 02:27:31
Message-ID: 20130809022731.GL14729@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-08-08 16:12:06 -0500, Jon Nelson wrote:
> On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com> wrote:
> > On 08/08/2013 05:28 PM, Jon Nelson wrote:
> ...
>
> > Just an idea - can you check if using a fillfactor different form 100
> > changes anything
> >
> > pgbench -s 20 -p 54320 -d pgb -F 90 -i
> >
> >
> >> pgbench -j 80 -c 80 -T 120 -p 54320 pgb
> >> pg_ctl -D tt -w stop
> >
> > That is, does extending tables and indexes to add updated tuples play
> > any role here
>
> I gave it a go - it didn't make any difference at all.
> At this point I'm convinced that the issue is a pathological case in
> ext4. The performance impact disappears as soon as the unwritten
> extent(s) are written to with real data. Thus, even though allocating
> files with posix_fallocate is - frequently - orders of magnitude
> quicker than doing it with write(2), the subsequent re-write can be
> more expensive. At least, that's what I'm gathering from the various
> threads.

> Why this issue didn't crop up in earlier testing and why I
> can't seem to make test_fallocate do it (even when I modify
> test_fallocate to write to the newly-allocated file in a mostly-random
> fashion) has me baffled.

It might be kernel version specific and concurrency seems to play a
role. If you reproduce the problem, could you run a "perf record -ga" to
collect a systemwide profile?

There's some more things to test:
- is the slowdown dependent on the scale? I.e is it visible with -j 1 -c
1?
- Does it also occur in synchronous_commit=off configurations? Those
don't fdatasync() from so many backends, that might play a role.
- do bulkloads see it? E.g. the initial pgbench load?

> Should this feature be reconsidered?

Let's try to get to the ground of this for a bit more. From the outset
we've said that we're using this as a testbed for extending the
fallocate() usage.
It's not like 9.4 beta1 is going to be released anytime soon.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-09 03:58:42
Message-ID: CAKuK5J0UUQh25HcK6cGgXi2kCdFjfQSWvhOu9th-FoVqJhSaWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 8, 2013 at 9:27 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-08-08 16:12:06 -0500, Jon Nelson wrote:
...

>> At this point I'm convinced that the issue is a pathological case in
>> ext4. The performance impact disappears as soon as the unwritten
>> extent(s) are written to with real data. Thus, even though allocating
>> files with posix_fallocate is - frequently - orders of magnitude
>> quicker than doing it with write(2), the subsequent re-write can be
>> more expensive. At least, that's what I'm gathering from the various
>> threads.
>
>
>> Why this issue didn't crop up in earlier testing and why I
>> can't seem to make test_fallocate do it (even when I modify
>> test_fallocate to write to the newly-allocated file in a mostly-random
>> fashion) has me baffled.
>
> It might be kernel version specific and concurrency seems to play a
> role. If you reproduce the problem, could you run a "perf record -ga" to
> collect a systemwide profile?

Finally, an excuse to learn how to use 'perf'! I'll try to provide
that info when I am able.

> There's some more things to test:
> - is the slowdown dependent on the scale? I.e is it visible with -j 1 -c
> 1?

scale=1 (-j 1 -c 1):
with fallocate: 685 tps
without: 727

scale=20
with fallocate: 129
without: 402

scale=40
with fallocate: 163
without: 511

> - Does it also occur in synchronous_commit=off configurations? Those
> don't fdatasync() from so many backends, that might play a role.

With synchronous_commit=off, the performance is vastly improved.
Interestingly, the fallocate case is (immaterially) faster than the
non-fallocate case: 3766tps vs 3700tps.

I tried a few other wal_sync_methods besides the default of fdatasync,
all with scale=80.

fsync:
198 tps (with fallocate) vs 187.

open_sync:
195 tps (with fallocate) vs. 192

> - do bulkloads see it? E.g. the initial pgbench load?

time pgbench -s 200 -p 54320 -d pgb -i

with fallocate: 2m47s
without: 2m50s

Hopefully the above is useful.

--
Jon


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: 9.4 regression
Date: 2013-08-09 06:20:02
Message-ID: 20130809062002.GN14729@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-08-08 22:58:42 -0500, Jon Nelson wrote:
> On Thu, Aug 8, 2013 at 9:27 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2013-08-08 16:12:06 -0500, Jon Nelson wrote:
> ...
>
> >> At this point I'm convinced that the issue is a pathological case in
> >> ext4. The performance impact disappears as soon as the unwritten
> >> extent(s) are written to with real data. Thus, even though allocating
> >> files with posix_fallocate is - frequently - orders of magnitude
> >> quicker than doing it with write(2), the subsequent re-write can be
> >> more expensive. At least, that's what I'm gathering from the various
> >> threads.
> >
> >
> >> Why this issue didn't crop up in earlier testing and why I
> >> can't seem to make test_fallocate do it (even when I modify
> >> test_fallocate to write to the newly-allocated file in a mostly-random
> >> fashion) has me baffled.
> >
> > It might be kernel version specific and concurrency seems to play a
> > role. If you reproduce the problem, could you run a "perf record -ga" to
> > collect a systemwide profile?
>
> Finally, an excuse to learn how to use 'perf'! I'll try to provide
> that info when I am able.

Running perf record as above during the first minute and then doing a
perf report > somefile (redirected to get the noninteractive version)
should get you started.

> > There's some more things to test:
> > - is the slowdown dependent on the scale? I.e is it visible with -j 1 -c
> > 1?
>
> scale=1 (-j 1 -c 1):
> with fallocate: 685 tps
> without: 727
>
> scale=20
> with fallocate: 129
> without: 402
>
> scale=40
> with fallocate: 163
> without: 511

Ok, so there's some clear correlation with the amount of writers.

> > - Does it also occur in synchronous_commit=off configurations? Those
> > don't fdatasync() from so many backends, that might play a role.
>
> With synchronous_commit=off, the performance is vastly improved.
> Interestingly, the fallocate case is (immaterially) faster than the
> non-fallocate case: 3766tps vs 3700tps.

That's interesting because in the synchronous_commit=off case most of
the writing and syncing should be done by the wal writer. So there's
another hint that there's some scalability issue causing place,
presumably in the kernel.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Thom Brown <thom(at)linux(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-14 20:36:38
Message-ID: 20130814203638.GA7124@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 7, 2013 at 07:04:43PM +0100, Thom Brown wrote:
> On 7 August 2013 18:49, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> > On Wed, Aug 7, 2013 at 12:42 PM, Thom Brown <thom(at)linux(dot)com> wrote:
> >> for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
> >> method: classic. 1 open/close iterations, 1 rewrite in 0.6380s
> >> method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s
> >> method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s
> >> method: classic. 2 open/close iterations, 1 rewrite in 1.2908s
> >> method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s
> >> method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s
> >> method: classic. 5 open/close iterations, 1 rewrite in 3.1419s
> >> method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s
> >> method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s
> >> method: classic. 10 open/close iterations, 1 rewrite in 6.2912s
> >> method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s
> >> method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s
> >> method: classic. 100 open/close iterations, 1 rewrite in 67.4174s
> >> method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s
> >> method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s
> >
> > OK. That's interesting, and a good data point.
> >
> > One thing you could try manually disabling the use of posix_fallocate in 269e78.
> > After running ./configure --stuff-here
> > edit src/include/pg_config.h and comment out the following line (on or
> > around line 374)
> > #define HAVE_POSIX_FALLOCATE 1
> >
> > *then* build postgresql and see if the performance hit is still there.
>
> Okay, done that. The TPS increases again:
>
> 2308.807568 / 2554.264572 / 2563.190204
>
> And I did run ./configure... before removing the line, and built it
> after the change.

Where are we on this issue?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-15 17:08:57
Message-ID: CAKuK5J3R-oBh+9f23Ko-0-gt5Zi1REgg7ng-awQuUsgiY2B7GQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Where are we on this issue?

I've been able to replicate it pretty easily with PostgreSQL and
continue to look into it. I've contacted Theodore Ts'o and have gotten
some useful information, however I'm unable to replicate the behavior
with the test program (even one that's been modified). What I've
learned is:

- XLogWrite appears to take approx. 2.5 times longer when writing to a
file allocated with posix_fallocate, but only the first time the file
contents are overwritten. This is partially explained by how ext4
handles extents and uninitialized data, but 2.5x is MUCH more
expensive than anticipated or expected here.
- Writing zeroes to a file allocated with posix_fallocate (essentially
adding a posix_fallocate step before the usual write-zeroes-in-a-loop
approach) not only doesn't seem to hurt performance, it seems to help
or at least have parity, *and* the space is guaranteed to exist on
disk. At the very least that seems useful.

--
Jon


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-16 20:57:25
Message-ID: 20130816205725.GE8972@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 15, 2013 at 12:08:57PM -0500, Jon Nelson wrote:
> > Where are we on this issue?
>
> I've been able to replicate it pretty easily with PostgreSQL and
> continue to look into it. I've contacted Theodore Ts'o and have gotten
> some useful information, however I'm unable to replicate the behavior
> with the test program (even one that's been modified). What I've
> learned is:
>
> - XLogWrite appears to take approx. 2.5 times longer when writing to a
> file allocated with posix_fallocate, but only the first time the file
> contents are overwritten. This is partially explained by how ext4
> handles extents and uninitialized data, but 2.5x is MUCH more
> expensive than anticipated or expected here.
> - Writing zeroes to a file allocated with posix_fallocate (essentially
> adding a posix_fallocate step before the usual write-zeroes-in-a-loop
> approach) not only doesn't seem to hurt performance, it seems to help
> or at least have parity, *and* the space is guaranteed to exist on
> disk. At the very least that seems useful.

Is it time to revert this patch until we know more?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-19 19:40:07
Message-ID: CAKuK5J0cG43gy2xOD5=GOhdBKsJB-0XzwOPZfYYZeSG3pWnHOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 16, 2013 at 3:57 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Thu, Aug 15, 2013 at 12:08:57PM -0500, Jon Nelson wrote:
>> > Where are we on this issue?
>>
>> I've been able to replicate it pretty easily with PostgreSQL and
>> continue to look into it. I've contacted Theodore Ts'o and have gotten
>> some useful information, however I'm unable to replicate the behavior
>> with the test program (even one that's been modified). What I've
>> learned is:
>>
>> - XLogWrite appears to take approx. 2.5 times longer when writing to a
>> file allocated with posix_fallocate, but only the first time the file
>> contents are overwritten. This is partially explained by how ext4
>> handles extents and uninitialized data, but 2.5x is MUCH more
>> expensive than anticipated or expected here.
>> - Writing zeroes to a file allocated with posix_fallocate (essentially
>> adding a posix_fallocate step before the usual write-zeroes-in-a-loop
>> approach) not only doesn't seem to hurt performance, it seems to help
>> or at least have parity, *and* the space is guaranteed to exist on
>> disk. At the very least that seems useful.
>
> Is it time to revert this patch until we know more?

While I'm not qualified to say, my inclination is to say yes. It can
always be added back later. The only caveat there would be that -
perhaps - a small modification of the patch would be warranted.
Specifically, with with posix_fallocate, I saw no undesirable behavior
when the (newly allocated) file was manually zeroed anyway. The only
advantages (that I can see) to doing it this way versus not using
posix_fallocate at all is (a) a potential reduction in the number of
extents and (b) the space is guaranteed to be on disk if
posix_fallocate succeeds. My reading of the patch is that even if
posix_fallocate fails due to out of space conditions, we will still
try to create the file by writing out zeroes, so perhaps the
out-of-disk-space scenario isn't all that useful anyway.

I'm awaiting more information from Theodore Ts'o, but I don't expect
things to materially change in the near future.

--
Jon


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-19 19:49:10
Message-ID: 20130819194910.GF26775@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-08-19 14:40:07 -0500, Jon Nelson wrote:
> On Fri, Aug 16, 2013 at 3:57 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > On Thu, Aug 15, 2013 at 12:08:57PM -0500, Jon Nelson wrote:
> >> > Where are we on this issue?
> >>
> >> I've been able to replicate it pretty easily with PostgreSQL and
> >> continue to look into it. I've contacted Theodore Ts'o and have gotten
> >> some useful information, however I'm unable to replicate the behavior
> >> with the test program (even one that's been modified). What I've
> >> learned is:
> >>
> >> - XLogWrite appears to take approx. 2.5 times longer when writing to a
> >> file allocated with posix_fallocate, but only the first time the file
> >> contents are overwritten. This is partially explained by how ext4
> >> handles extents and uninitialized data, but 2.5x is MUCH more
> >> expensive than anticipated or expected here.
> >> - Writing zeroes to a file allocated with posix_fallocate (essentially
> >> adding a posix_fallocate step before the usual write-zeroes-in-a-loop
> >> approach) not only doesn't seem to hurt performance, it seems to help
> >> or at least have parity, *and* the space is guaranteed to exist on
> >> disk. At the very least that seems useful.
> >
> > Is it time to revert this patch until we know more?
>
> While I'm not qualified to say, my inclination is to say yes. It can
> always be added back later. The only caveat there would be that -
> perhaps - a small modification of the patch would be warranted.
> Specifically, with with posix_fallocate, I saw no undesirable behavior
> when the (newly allocated) file was manually zeroed anyway. The only
> advantages (that I can see) to doing it this way versus not using
> posix_fallocate at all is (a) a potential reduction in the number of
> extents

I vote for adapting the patch to additionally zero out the file via
write(). In your tests that seemed to perform at least as good as the
old method... It also has the advantage that we can use it a littlebit
more as a testbed for possibly using it for heap extensions one day.
We're pretty early in the cycle, so I am not worried about this too much...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-20 01:27:57
Message-ID: 20130820012757.GT2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> I vote for adapting the patch to additionally zero out the file via
> write(). In your tests that seemed to perform at least as good as the
> old method... It also has the advantage that we can use it a littlebit
> more as a testbed for possibly using it for heap extensions one day.
> We're pretty early in the cycle, so I am not worried about this too much...

I dunno, I'm pretty disappointed that this doesn't actually improve
things. Just following this casually, it looks like it might be some
kind of locking issue in the kernel that's causing it to be slower; or
at least some code path that isn't exercise terribly much and therefore
hasn't been given the love that it should.

Definitely interested in what Ts'o says, but if we can't figure out why
it's slower *without* writing out the zeros, I'd say we punt on this
until Linux and the other OS folks improve the situation.

Thanks,

Stephen


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-08-21 01:44:33
Message-ID: 20130821014433.GA8193@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 19, 2013 at 09:27:57PM -0400, Stephen Frost wrote:
> * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> > I vote for adapting the patch to additionally zero out the file via
> > write(). In your tests that seemed to perform at least as good as the
> > old method... It also has the advantage that we can use it a littlebit
> > more as a testbed for possibly using it for heap extensions one day.
> > We're pretty early in the cycle, so I am not worried about this too much...
>
> I dunno, I'm pretty disappointed that this doesn't actually improve
> things. Just following this casually, it looks like it might be some
> kind of locking issue in the kernel that's causing it to be slower; or
> at least some code path that isn't exercise terribly much and therefore
> hasn't been given the love that it should.
>
> Definitely interested in what Ts'o says, but if we can't figure out why
> it's slower *without* writing out the zeros, I'd say we punt on this
> until Linux and the other OS folks improve the situation.

Agreed. Anyone with an affected kernel really can't be doing
performance tests right now, and that isn't good.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: 9.4 regression
Date: 2013-09-04 15:01:30
Message-ID: 1378306890.23979.18.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2013-08-08 at 17:42 -0400, Tom Lane wrote:
> Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> writes:
> > At this point I'm convinced that the issue is a pathological case in
> > ext4. The performance impact disappears as soon as the unwritten
> > extent(s) are written to with real data. Thus, even though allocating
> > files with posix_fallocate is - frequently - orders of magnitude
> > quicker than doing it with write(2), the subsequent re-write can be
> > more expensive. At least, that's what I'm gathering from the various
> > threads. Why this issue didn't crop up in earlier testing and why I
> > can't seem to make test_fallocate do it (even when I modify
> > test_fallocate to write to the newly-allocated file in a mostly-random
> > fashion) has me baffled.
>
> Does your test program use all the same writing options that the real
> WAL writes do (like O_DIRECT)?
>
> > Should this feature be reconsidered?
>
> Well, ext4 isn't the whole world, but it's sure a big part of it.
> The real point though is that obviously we didn't do enough performance
> testing, so we'd better do more before deciding what to do.

Greg Smith,

Do you have some thoughts about this? More tests we should run? We may
need to revert the patch.

Regards,
Jeff Davis


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndQuadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: 9.4 regression
Date: 2013-09-04 15:04:47
Message-ID: 20130904150447.GD11189@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-09-04 08:01:30 -0700, Jeff Davis wrote:
> On Thu, 2013-08-08 at 17:42 -0400, Tom Lane wrote:
> > Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> writes:
> > > At this point I'm convinced that the issue is a pathological case in
> > > ext4. The performance impact disappears as soon as the unwritten
> > > extent(s) are written to with real data. Thus, even though allocating
> > > files with posix_fallocate is - frequently - orders of magnitude
> > > quicker than doing it with write(2), the subsequent re-write can be
> > > more expensive. At least, that's what I'm gathering from the various
> > > threads. Why this issue didn't crop up in earlier testing and why I
> > > can't seem to make test_fallocate do it (even when I modify
> > > test_fallocate to write to the newly-allocated file in a mostly-random
> > > fashion) has me baffled.
> >
> > Does your test program use all the same writing options that the real
> > WAL writes do (like O_DIRECT)?
> >
> > > Should this feature be reconsidered?
> >
> > Well, ext4 isn't the whole world, but it's sure a big part of it.
> > The real point though is that obviously we didn't do enough performance
> > testing, so we'd better do more before deciding what to do.
>
> Greg Smith,
>
> Do you have some thoughts about this? More tests we should run? We may
> need to revert the patch.

I'd vote for adding zeroing *after* the fallocate() first. That's what's
suggested by kernel hackers and what several other large applications
do. As it looks like it's what we would have to do if we ever get to use
fallocate for relation extension where we would have actual benefits
from it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndQuadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: 9.4 regression
Date: 2013-09-04 15:15:37
Message-ID: 20130904151537.GF2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> I'd vote for adding zeroing *after* the fallocate() first. That's what's
> suggested by kernel hackers and what several other large applications
> do. As it looks like it's what we would have to do if we ever get to use
> fallocate for relation extension where we would have actual benefits
> from it.

Does that actually end up doing anything different from what we were
doing pre-patch here? At best, it *might* end up using a larger extent,
but unless we can actually be confident that it does, I'm not convinced
the additional complexity is worth it and would rather see this simply
reverted.

One might ask why the kernel guys aren't doing this themselves or
figuring out why it's necessary to make it worthwhile.

Thanks,

Stephen


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndQuadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: 9.4 regression
Date: 2013-09-04 15:23:25
Message-ID: 20130904152325.GF11189@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-09-04 11:15:37 -0400, Stephen Frost wrote:
> * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> > I'd vote for adding zeroing *after* the fallocate() first. That's what's
> > suggested by kernel hackers and what several other large applications
> > do. As it looks like it's what we would have to do if we ever get to use
> > fallocate for relation extension where we would have actual benefits
> > from it.
>
> Does that actually end up doing anything different from what we were
> doing pre-patch here? At best, it *might* end up using a larger extent,
> but unless we can actually be confident that it does, I'm not convinced
> the additional complexity is worth it and would rather see this simply
> reverted.

It does end up with larger extents. That's why e.g. several of the
virtualization solutions do it for container files.

I was sceptical of the idea of using fallocate() for WAL file allocation
from the get go and voiced that repeatedly, but the argument was made
that it's a convenient testcase to familiarize ourselves with
fallocate(). As that argument won we should stick to it and learn the
ropes. Part of that is zeroing, so we should do that.

It's a single write() + error checking, so I don't really see much
complexity that way.

> One might ask why the kernel guys aren't doing this themselves or
> figuring out why it's necessary to make it worthwhile.

So, if I understood Ted correctly the reason it's slower is that
fallocate() onl marks the extents as containing 'zero' without actually
touching the data. When then a partial write into the area is done the
rest of the page has to be actively zeroed on disk.
I am not sure that explains all the slowdown, but given that according
to tests published in this thread, the additional zeroing helps to get
at the former + small win state...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: 9.4 regression
Date: 2013-09-04 15:26:59
Message-ID: 2781.1378308419@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
>> I'd vote for adding zeroing *after* the fallocate() first. That's what's
>> suggested by kernel hackers and what several other large applications
>> do. As it looks like it's what we would have to do if we ever get to use
>> fallocate for relation extension where we would have actual benefits
>> from it.

> Does that actually end up doing anything different from what we were
> doing pre-patch here? At best, it *might* end up using a larger extent,
> but unless we can actually be confident that it does, I'm not convinced
> the additional complexity is worth it and would rather see this simply
> reverted.

> One might ask why the kernel guys aren't doing this themselves or
> figuring out why it's necessary to make it worthwhile.

The larger picture is that that isn't the committed behavior,
but a different one, one which would need performance testing.

At this point, I vote for reverting the patch and allowing it to be
resubmitted for a fresh round of testing with the zeroing added.
And this time we'll need to do the testing more carefully.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: 9.4 regression
Date: 2013-09-04 16:52:09
Message-ID: CA+TgmoY=L54r=uX2hPrrdhxuzAxPPFGfGuakBJyjOHE=6CE9zg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 4, 2013 at 11:26 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Stephen Frost <sfrost(at)snowman(dot)net> writes:
>> * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
>>> I'd vote for adding zeroing *after* the fallocate() first. That's what's
>>> suggested by kernel hackers and what several other large applications
>>> do. As it looks like it's what we would have to do if we ever get to use
>>> fallocate for relation extension where we would have actual benefits
>>> from it.
>
>> Does that actually end up doing anything different from what we were
>> doing pre-patch here? At best, it *might* end up using a larger extent,
>> but unless we can actually be confident that it does, I'm not convinced
>> the additional complexity is worth it and would rather see this simply
>> reverted.
>
>> One might ask why the kernel guys aren't doing this themselves or
>> figuring out why it's necessary to make it worthwhile.
>
> The larger picture is that that isn't the committed behavior,
> but a different one, one which would need performance testing.
>
> At this point, I vote for reverting the patch and allowing it to be
> resubmitted for a fresh round of testing with the zeroing added.
> And this time we'll need to do the testing more carefully.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndQuadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: 9.4 regression
Date: 2013-09-05 07:41:28
Message-ID: 522835A8.1040701@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/09/05 0:04), Andres Freund wrote:
> I'd vote for adding zeroing *after* the fallocate() first.
+1, with FALLOC_FL_KEEP_SIZE flag.

At least, fallocate with FALLOC_FL_KEEP_SIZE flag is faster than nothing in my
developing sorted checkpoint. I adopted it to relation file, so I don't know
about WAL file.

With FALLOC_FL_KEEP_SIZE flag does not change file size and not do zero fill. It
seems to only get continuous physical disk space in file system.

Here is draft patch.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachment Content-Type Size
wal-fallocate_keep_size_v0.patch text/x-diff 628 bytes

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-09-05 21:24:10
Message-ID: 20130905212410.GD27195@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 19, 2013 at 09:27:57PM -0400, Stephen Frost wrote:
> * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> > I vote for adapting the patch to additionally zero out the file via
> > write(). In your tests that seemed to perform at least as good as the
> > old method... It also has the advantage that we can use it a littlebit
> > more as a testbed for possibly using it for heap extensions one day.
> > We're pretty early in the cycle, so I am not worried about this too much...
>
> I dunno, I'm pretty disappointed that this doesn't actually improve
> things. Just following this casually, it looks like it might be some
> kind of locking issue in the kernel that's causing it to be slower; or
> at least some code path that isn't exercise terribly much and therefore
> hasn't been given the love that it should.
>
> Definitely interested in what Ts'o says, but if we can't figure out why
> it's slower *without* writing out the zeros, I'd say we punt on this
> until Linux and the other OS folks improve the situation.

FYI, the patch has been reverted.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Thom Brown <thom(at)linux(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-10-24 10:41:11
Message-ID: CAA-aLv7ywgGub=S-T4enPDya8azEKC9XhNx8bTF8jJEHgqOFiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5 September 2013 22:24, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Mon, Aug 19, 2013 at 09:27:57PM -0400, Stephen Frost wrote:
>> * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
>> > I vote for adapting the patch to additionally zero out the file via
>> > write(). In your tests that seemed to perform at least as good as the
>> > old method... It also has the advantage that we can use it a littlebit
>> > more as a testbed for possibly using it for heap extensions one day.
>> > We're pretty early in the cycle, so I am not worried about this too much...
>>
>> I dunno, I'm pretty disappointed that this doesn't actually improve
>> things. Just following this casually, it looks like it might be some
>> kind of locking issue in the kernel that's causing it to be slower; or
>> at least some code path that isn't exercise terribly much and therefore
>> hasn't been given the love that it should.
>>
>> Definitely interested in what Ts'o says, but if we can't figure out why
>> it's slower *without* writing out the zeros, I'd say we punt on this
>> until Linux and the other OS folks improve the situation.
>
> FYI, the patch has been reverted.

Is there an updated patch available for this? And did anyone hear from Ts'o?

--
Thom


From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Thom Brown <thom(at)linux(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 regression
Date: 2013-10-24 19:35:49
Message-ID: CAKuK5J1wRUWaB3sbTKtAVRarzhNt4YY+YUh2FtCiXhDWFsXb0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 24, 2013 at 5:41 AM, Thom Brown <thom(at)linux(dot)com> wrote:
> On 5 September 2013 22:24, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> On Mon, Aug 19, 2013 at 09:27:57PM -0400, Stephen Frost wrote:
>>> * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
>>> > I vote for adapting the patch to additionally zero out the file via
>>> > write(). In your tests that seemed to perform at least as good as the
>>> > old method... It also has the advantage that we can use it a littlebit
>>> > more as a testbed for possibly using it for heap extensions one day.
>>> > We're pretty early in the cycle, so I am not worried about this too much...
>>>
>>> I dunno, I'm pretty disappointed that this doesn't actually improve
>>> things. Just following this casually, it looks like it might be some
>>> kind of locking issue in the kernel that's causing it to be slower; or
>>> at least some code path that isn't exercise terribly much and therefore
>>> hasn't been given the love that it should.
>>>
>>> Definitely interested in what Ts'o says, but if we can't figure out why
>>> it's slower *without* writing out the zeros, I'd say we punt on this
>>> until Linux and the other OS folks improve the situation.
>>
>> FYI, the patch has been reverted.
>
> Is there an updated patch available for this? And did anyone hear from Ts'o?

After the patch was reverted, it was not re-submitted. I have tried 3
or 4 times to get more info out of Ts'o , without luck.

--
Jon