Quick Links

DELETE vs TRUNCATE explanation

Lists:	pgsql-hackerspgsql-performance

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	DELETE vs TRUNCATE explanation
Date:	2012-07-11 00:37:24
Message-ID:	4FFCCAC4.4030503@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

After seeing a few discussions here and on Stack Overflow I've put
together a quick explanation of why "DELETE FROM table;" may be faster
than "TRUNCATE table" for people doing unit testing on lots of tiny
tables, people who're doing this so often they care how long it takes.

I'd love it if a few folks who know the guts were to take a look and
verify its correctness:

http://stackoverflow.com/a/11423886/398670

--
Craig Ringer

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-11 05:22:27
Message-ID:	CAAZKuFZ9tYeC1K5ZayZgztm3re_7ytbsEYFa0wwXirB6Hfm+EQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Tue, Jul 10, 2012 at 5:37 PM, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> wrote:
> Hi
>
> After seeing a few discussions here and on Stack Overflow I've put together
> a quick explanation of why "DELETE FROM table;" may be faster than "TRUNCATE
> table" for people doing unit testing on lots of tiny tables, people who're
> doing this so often they care how long it takes.
>
> I'd love it if a few folks who know the guts were to take a look and verify
> its correctness:

I haven't said this before, but think it every time someone asks me
about this, so I'll say it now:

This is a papercut that should be solved with improved mechanics.
TRUNCATE should simply be very nearly the fastest way to remove data
from a table while retaining its type information, and if that means
doing DELETE without triggers when the table is small, then it should.
The only person who could thwart me is someone who badly wants their
128K table to be exactly 8 or 0K, which seems unlikely given the 5MB
of catalog anyway.

Does that sound reasonable? As in, would anyone object if TRUNCATE
learned this behavior?

--
fdr

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-11 14:05:48
Message-ID:	23853.1342015548@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Daniel Farina <daniel(at)heroku(dot)com> writes:
> TRUNCATE should simply be very nearly the fastest way to remove data
> from a table while retaining its type information, and if that means
> doing DELETE without triggers when the table is small, then it should.
> The only person who could thwart me is someone who badly wants their
> 128K table to be exactly 8 or 0K, which seems unlikely given the 5MB
> of catalog anyway.

> Does that sound reasonable? As in, would anyone object if TRUNCATE
> learned this behavior?

Yes, I will push back on that.

(1) We don't need the extra complexity.

(2) I don't believe that you know where the performance crossover point
would be (according to what metric, anyway?).

(3) The performance of the truncation itself should not be viewed in
isolation; subsequent behavior also needs to be considered. An example
of possible degradation is that index bloat would no longer be
guaranteed to be cleaned up over a series of repeated truncations.
(You might argue that if the table is small then the indexes couldn't
be very bloated, but I don't think that holds up over a long series.)

IOW, I think it's fine as-is. I'd certainly wish to see many more
than one complainant before we expend effort in this area.

regards, tom lane

From:	"ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-11 14:19:54
Message-ID:	20120711141954.GJ15829@aart.rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Jul 11, 2012 at 10:05:48AM -0400, Tom Lane wrote:
> Daniel Farina <daniel(at)heroku(dot)com> writes:
> > TRUNCATE should simply be very nearly the fastest way to remove data
> > from a table while retaining its type information, and if that means
> > doing DELETE without triggers when the table is small, then it should.
> > The only person who could thwart me is someone who badly wants their
> > 128K table to be exactly 8 or 0K, which seems unlikely given the 5MB
> > of catalog anyway.
>
> > Does that sound reasonable? As in, would anyone object if TRUNCATE
> > learned this behavior?
>
> Yes, I will push back on that.
>
> (1) We don't need the extra complexity.
>
> (2) I don't believe that you know where the performance crossover point
> would be (according to what metric, anyway?).
>
> (3) The performance of the truncation itself should not be viewed in
> isolation; subsequent behavior also needs to be considered. An example
> of possible degradation is that index bloat would no longer be
> guaranteed to be cleaned up over a series of repeated truncations.
> (You might argue that if the table is small then the indexes couldn't
> be very bloated, but I don't think that holds up over a long series.)
>
> IOW, I think it's fine as-is. I'd certainly wish to see many more
> than one complainant before we expend effort in this area.
>
> regards, tom lane
>

+1 TRUNCATE needs to keep the same properties independent of the size
of the table. Smearing it into a DELETE would not be good at all. If
there are optimizations that can be done to keep its current behavior,
those might be possible, but the complexity may not be worthwhile for
a relative corner case.

Regards,
Ken

From:	Matthew Woodcraft <matthew(at)woodcraft(dot)me(dot)uk>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-11 18:10:37
Message-ID:	20120711181037.GF11608@golux.woodcraft.me.uk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Tom Lane wrote:
> (3) The performance of the truncation itself should not be viewed in
> isolation; subsequent behavior also needs to be considered. An example
> of possible degradation is that index bloat would no longer be
> guaranteed to be cleaned up over a series of repeated truncations.
> (You might argue that if the table is small then the indexes couldn't
> be very bloated, but I don't think that holds up over a long series.)
>
> IOW, I think it's fine as-is. I'd certainly wish to see many more
> than one complainant before we expend effort in this area.

I think a documentation change would be worthwhile.

At the moment the TRUNCATE page says, with no caveats, that it is faster than
unqualified DELETE.

It surprised me to find that this wasn't true (with 7.2, again with small
tables in a testsuite), and evidently it's still surprising people today.

-M-

From:	Craig James <cjames(at)emolecules(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-11 20:18:32
Message-ID:	CAFwQ8rcR5mimJO5FumcYzkgoxkNiP3g3XOaHB1H4oHe-QnLRPQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Jul 11, 2012 at 7:05 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Daniel Farina <daniel(at)heroku(dot)com> writes:
> > TRUNCATE should simply be very nearly the fastest way to remove data
> > from a table while retaining its type information, and if that means
> > doing DELETE without triggers when the table is small, then it should.
> > The only person who could thwart me is someone who badly wants their
> > 128K table to be exactly 8 or 0K, which seems unlikely given the 5MB
> > of catalog anyway.
>
> > Does that sound reasonable? As in, would anyone object if TRUNCATE
> > learned this behavior?
>
> Yes, I will push back on that.
>
> (1) We don't need the extra complexity.
>
> (2) I don't believe that you know where the performance crossover point
> would be (according to what metric, anyway?).
>
> (3) The performance of the truncation itself should not be viewed in
> isolation; subsequent behavior also needs to be considered. An example
> of possible degradation is that index bloat would no longer be
> guaranteed to be cleaned up over a series of repeated truncations.
> (You might argue that if the table is small then the indexes couldn't
> be very bloated, but I don't think that holds up over a long series.)
>
> IOW, I think it's fine as-is. I'd certainly wish to see many more
> than one complainant before we expend effort in this area.
>

It strikes me as a contrived case rather than a use case. What sort of app
repeatedly fills and truncates a small table thousands of times ... other
than a test app to see whether you can do it or not?

The main point of truncate is to provide a more efficient mechanism to
delete all data from large tables. If your app developers don't know within
a couple orders of magnitude how much data your tables hold, and can't
figure out whether to use delete or truncate, I can't find much sympathy in
my heart.

Craig

From:	Shaun Thomas <sthomas(at)optionshouse(dot)com>
To:	Craig James <cjames(at)emolecules(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-11 20:47:21
Message-ID:	4FFDE659.7040104@optionshouse.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/11/2012 03:18 PM, Craig James wrote:

> It strikes me as a contrived case rather than a use case. What sort of
> app repeatedly fills and truncates a small table thousands of times ...
> other than a test app to see whether you can do it or not?

Test systems. Any company with even a medium-size QA environment will
have continuous integration systems that run unit tests on a trash
database hundreds or thousands of times through the day. Aside from
dropping/creating the database via template, which would be *really*
slow, truncate is the easiest/fastest way to reset between tests.

If TRUNCATE suddenly started defaulting to DELETE on small table-sets
and several iterations led to exponential index growth, that would be
rather unfortunate.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	sthomas(at)optionshouse(dot)com
Cc:	Craig James <cjames(at)emolecules(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-11 21:04:39
Message-ID:	4FFDEA67.6000801@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/11/2012 04:47 PM, Shaun Thomas wrote:
> On 07/11/2012 03:18 PM, Craig James wrote:
>
>> It strikes me as a contrived case rather than a use case. What sort of
>> app repeatedly fills and truncates a small table thousands of times ...
>> other than a test app to see whether you can do it or not?
>
> Test systems. Any company with even a medium-size QA environment will
> have continuous integration systems that run unit tests on a trash
> database hundreds or thousands of times through the day. Aside from
> dropping/creating the database via template, which would be *really*
> slow, truncate is the easiest/fastest way to reset between tests.

Why is recreating the test db from a (populated) template going to be
slower than truncating all the tables and repopulating from an external
source? I had a client who achieved a major improvement in speed and
reduction in load by moving to this method of test db setup.

cheers

andrew

From:	Mark Thornton <mthornton(at)optrak(dot)com>
To:	pgsql-performance(at)postgresql(dot)org, cjames(at)emolecules(dot)com
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-11 21:32:33
Message-ID:	4FFDF0F1.9010005@optrak.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 11/07/12 21:18, Craig James wrote:
>
> It strikes me as a contrived case rather than a use case. What sort
> of app repeatedly fills and truncates a small table thousands of times
> ... other than a test app to see whether you can do it or not?
If I have a lot of data which updates/inserts an existing table but I
don't know if a given record will be an update or an insert, then I
write all the 'new' data to a temporary table and then use sql
statements to achieve the updates and inserts on the existing table.

Is there a better way of doing this in standard SQL?

Mark

From:	Craig James <cjames(at)emolecules(dot)com>
To:	Mark Thornton <mthornton(at)optrak(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-11 22:09:56
Message-ID:	CAFwQ8rdZWMHvmEEPyOKCgC0bMtPhaRGiY1nSVLA_tORAk1SVPg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Jul 11, 2012 at 2:32 PM, Mark Thornton <mthornton(at)optrak(dot)com> wrote:

> On 11/07/12 21:18, Craig James wrote:
>
>>
>> It strikes me as a contrived case rather than a use case. What sort of
>> app repeatedly fills and truncates a small table thousands of times ...
>> other than a test app to see whether you can do it or not?
>>
> If I have a lot of data which updates/inserts an existing table but I
> don't know if a given record will be an update or an insert, then I write
> all the 'new' data to a temporary table and then use sql statements to
> achieve the updates and inserts on the existing table.
>
> Is there a better way of doing this in standard SQL?
>

If it's a single session, use a temporary table. It is faster to start
with (temp tables aren't logged), and it's automatically dropped at the end
of the session (or at the end of the transaction if that's what you
specified when you created it). This doesn't work if your insert/update
spans more than one session.

Another trick that works (depending on how big your tables are) is to scan
the primary key before you start, and build a hash table of the keys. That
instantly tells you whether each record should be an insert or update.

Craig

>
> Mark
>
>
>

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-11 22:51:40
Message-ID:	CAAZKuFYsK+ZQuC=zf0xE=M-LNa02tB7RBy3CHEUyq7LWCkK6jQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Jul 11, 2012 at 7:05 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Daniel Farina <daniel(at)heroku(dot)com> writes:
>> TRUNCATE should simply be very nearly the fastest way to remove data
>> from a table while retaining its type information, and if that means
>> doing DELETE without triggers when the table is small, then it should.
>> The only person who could thwart me is someone who badly wants their
>> 128K table to be exactly 8 or 0K, which seems unlikely given the 5MB
>> of catalog anyway.
>
>> Does that sound reasonable? As in, would anyone object if TRUNCATE
>> learned this behavior?
>
> Yes, I will push back on that.
>
> (1) We don't need the extra complexity.

Well, a "need" is justified by the gains, no? It seems like this
follows from the thoughts presented afterwards, so I'll discuss those.

> (2) I don't believe that you know where the performance crossover point
> would be (according to what metric, anyway?).

Nope. I don't. But an exact crossover is a level of precision I don't
really need, because here are where things stand on a completely
unremarkable test suite on the closest project to me that meets the
"regular web-app" profile case:

With en-masse DELETE:
rake 41.89s user 3.08s system 76% cpu 58.629 total

With TRUNCATE:
rake 49.86s user 2.93s system 5% cpu 15:17.88 total

15x slower. This is a Macbook Air with full disk encryption and SSD
disk with fsync off, e.g. a very typical developer configuration.
This is a rather small schema -- probably a half a dozen tables, and
probably about a dozen indexes. This application is entirely
unremarkable in its test-database workload: it wants to load a few
records, do a few things, and then clear those handful of records.

> (3) The performance of the truncation itself should not be viewed in
> isolation; subsequent behavior also needs to be considered. An example
> of possible degradation is that index bloat would no longer be
> guaranteed to be cleaned up over a series of repeated truncations.
> (You might argue that if the table is small then the indexes couldn't
> be very bloated, but I don't think that holds up over a long series.)

I'm not entirely convinced to the mechanism, it was simply the most
obvious one, but I bet a one that is better in every respect is also
possible. It did occur to me that bloat might be a sticky point.

> IOW, I think it's fine as-is. I'd certainly wish to see many more
> than one complainant before we expend effort in this area.

I've seen way more than one complaint, and I'm quite sure there are
thousands of man hours (or more) spent on people who don't even know
to complain about such atrocious performance (or maybe it's so bad
that most people run a web search and find out, probably being left
really annoyed from having to yak shave as a result). In spite of how
familiar I am with Postgres and its mailing lists, I have glossed over
this for a long time, just thinking "wow, that really sucks" and only
now -- by serendipity of having skimmed this post -- have seen fit to
complain on behalf of quite a few rounds of dispensing workaround
advice to other people. It's only when this was brought to the fore
of my mind did I stop to consider how much wasted time I've seen in
people trying to figure this out over and over again (granted, they
tend to remember after the first time).

Perhaps a doc fix is all we need (TRUNCATE is constant-time on large
tables, but can be very slow compared to DELETE on small tables), but
I completely and enthusiastically reject any notion from people
calling this "contrived" or an "edge case," because people writing
software against PostgreSQL that have unit tests have this use case
constantly, often dozens or even hundreds of times a day.

What I don't know is how many people figure out that they should use
DELETE instead, and after how long. Even though the teams I work with
are very familiar with many of the finer points of Postgres, doing
some probing for the first time took a little while.

If we're going to live with it, I contest that we should own it as a
real and substantial weakness for development productivity, and not
sweep it under the rug as some "contrived" or "corner" case.

--
fdr

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Matthew Woodcraft <matthew(at)woodcraft(dot)me(dot)uk>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-12 01:23:16
Message-ID:	4FFE2704.7020106@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/12/2012 02:10 AM, Matthew Woodcraft wrote:
> I think a documentation change would be worthwhile. At the moment the
> TRUNCATE page says, with no caveats, that it is faster than
> unqualified DELETE.

+1 to updating the docs to reflect the fact that TRUNCATE may have a
higher fixed cost than DELETE FROM table; but also prevents bloat.

It's a weird little corner case, but with database-backed unit testing
it's going to become a more significant one whether or not it feels like
it makes any sense.

--
Craig Ringer

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-12 01:26:14
Message-ID:	4FFE27B6.7000500@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/11/2012 01:22 PM, Daniel Farina wrote:
> On Tue, Jul 10, 2012 at 5:37 PM, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> wrote:
>> Hi
>>
>> After seeing a few discussions here and on Stack Overflow I've put together
>> a quick explanation of why "DELETE FROM table;" may be faster than "TRUNCATE
>> table" for people doing unit testing on lots of tiny tables, people who're
>> doing this so often they care how long it takes.
>>
>> I'd love it if a few folks who know the guts were to take a look and verify
>> its correctness:
> I haven't said this before, but think it every time someone asks me
> about this, so I'll say it now:
>
> This is a papercut that should be solved with improved mechanics.
> TRUNCATE should simply be very nearly the fastest way to remove data
> from a table while retaining its type information, and if that means
> doing DELETE without triggers when the table is small, then it should.
> The only person who could thwart me is someone who badly wants their
> 128K table to be exactly 8 or 0K, which seems unlikely given the 5MB
> of catalog anyway.
>
> Does that sound reasonable? As in, would anyone object if TRUNCATE
> learned this behavior?
Yep, I'd object. It's more complicated and less predictable. Also, as I
strongly and repeatedly highlighted in my post, DELETE FROM table; does
a different job to TRUNCATE. You'd at minimum need the effect of DELETE
followed by a VACUUM on the table and its indexes to be acceptable and
avoid the risk of rapid table + index bloat - and that'd be lots slower
than a TRUNCATE. You could be clever and lock the table then DELETE and
set xmax at the same time I guess, but I suspect that'd be a bit of work
and still wouldn't take care of the indexes.

It's also too complicated, not least because AFAIK util commands and
CRUD commands go through very different paths in PostgreSQL.

I guess you could propose and post a prototype patch for a new command
that tried to empty the table via whatever method it thought would be
fastest. Such a new command wouldn't be bound by the accepted and
expected rules followed by TRUNCATE so it could vary its behaviour based
on the table, doing a real truncate on big tables and a
delete-then-vaccum on small tables. I suspect you'd land up writing the
fairly complicated code for the potentially multi-table
delete-and-vaccum yourself.

Honestly, though, it might be much better to start with "how can
TRUNCATE of empty or near-empty tables be made faster?" and start
examining where the time goes.

--
Craig Ringer

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-12 01:41:54
Message-ID:	4FFE2B62.2040502@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/12/2012 06:51 AM, Daniel Farina wrote:
> 15x slower. This is a Macbook Air with full disk encryption and SSD
> disk with fsync off, e.g. a very typical developer configuration.
Don't use full disk encryption for throwaway test data if you care about
how long those tests take. It's a lot like tuning the engine in your car
while ignoring the fact that the handbrake is jammed on and you're
dragging a parachute. Use a ramdisk or un-encrypted partition, something
that doesn't take three weeks to fsync().

That said, this performance gap makes me wonder if TRUNCATE is forcing
metadata synchronisation even with fsync=off, causing the incredibly
glacially awesomely slow disk access of your average FDE system to kick
in, possibly even once per table or even once per file (index, table,
toast, etc). If so, it may be worth:

- Allowing TRUNCATE to skip synchronization when fsync=off. Pg is
already allowed to eat all your data if it feels like it in this
configuration, so there's no point flushing filesystem metadata to make
sure files are really swapped.

- When fsync=on, trying to flush all changes to all files out at once
rather than once per file as it could be doing (haven't checked) right
now. How to do this without also flushing all other pending I/O on the
whole system (with a global "sync()") would be somewhat OS/filesystem
dependent, unfortunately.

You could help progress this issue constructively by doing some
profiling on your system, tracing Pg's system calls, and determining
what exactly it's doing with DELETE vs TRUNCATE and where the time goes.
On Linux you'd use OProfile for this and on Solaris you'd use DTrace.
Dunno what facilities Mac OS X has but there must be something similar.

Once you've determined why it's slow, you have a useful starting point
for making it faster, first for test systems with fsync=off then, once
that's tracked down, maybe for robust systems with fsync=on.

> I've seen way more than one complaint, and I'm quite sure there are
> thousands of man hours (or more) spent on people who don't even know
> to complain about such atrocious performance (or maybe it's so bad
> that most people run a web search and find out, probably being left
> really annoyed from having to yak shave as a result).
I suspect you're right - as DB based unit testing becomes more
commonplace this is turning up a lot more. As DB unit tests were first
really popular in the ruby/rails crowd they've probably seen the most
pain, but as someone who doesn't move in those circles I wouldn't have
known. They certainly don't seem to have been making noise about it
here, and I've only recently seen some SO questions about it.

> Perhaps a doc fix is all we need (TRUNCATE is constant-time on large
> tables, but can be very slow compared to DELETE on small tables), but
> I completely and enthusiastically reject any notion from people
> calling this "contrived" or an "edge case," because people writing
> software against PostgreSQL that have unit tests have this use case
> constantly, often dozens or even hundreds of times a day.
I have to agree with this - it may have been an edge case in the past,
but it's becoming mainstream and is worth being aware of.

That said, the group of people who care about this most are not well
represented as active contributors to PostgreSQL. I'd love it if you
could help start to change that by stepping in and taking a little time
to profile exactly what's going on with your system so we can learn
what, exactly, is slow.

--
Craig Ringer

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-12 06:12:35
Message-ID:	CAAZKuFZdVTx35o4c0=ZJLf5Rd+Z3pEHqj__eO+5+q9eRy3bVgw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Jul 11, 2012 at 6:41 PM, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> wrote:
> On 07/12/2012 06:51 AM, Daniel Farina wrote:
>>
>> 15x slower. This is a Macbook Air with full disk encryption and SSD
>> disk with fsync off, e.g. a very typical developer configuration.
>
> Don't use full disk encryption for throwaway test data if you care about how
> long those tests take. It's a lot like tuning the engine in your car while
> ignoring the fact that the handbrake is jammed on and you're dragging a
> parachute. Use a ramdisk or un-encrypted partition, something that doesn't
> take three weeks to fsync().

No. Full disk encryption is not that slow. And as we see, there is a
workaround that works "just fine" (maybe it could be faster, who
knows?) in this exact configuration. The greater problem is more
likely to be HFS+, the file system.

If someone produces and gets adoption of a wonderfully packaged
test-configurations of Postgres using a ram-based block device that
somehow have a good user experience living alongside the persistent
version, this problem can go away completely. In fact, that would be
*phenomenal*, because so many things could be so much faster. But
that's surprisingly challenging: for example, last I checked,
Postgres.app, principally written by one of my colleagues, does *not*
disable fsync because we don't know of a great way to communicate the
relaxed expectations of durability, even though Postgres.app is
targeted towards developers: for example, it does not run until you
log in, so it's more like a foreground application. Maybe if the
connection had an option that said "x-test=true", or
something...deposit your idea here.

Until then, this is an at the level of an is-ought problem: there is
no immediate to even moderately distant future where people are not
going to click the full disk encryption button their OS vendor gives
them (nor should they *not* click that: people love to download bits
of data from production to their local machine to figure out problems,
and I think the world is a better place for it), and people are going
to use HFS+ in large numbers, so talking about how many people "just"
ought to reconfigure is tantamount to blaming the victim, especially
when we have a sound and workable workaround in hand to at least prove
definitively that the problem is not intractable.

> That said, this performance gap makes me wonder if TRUNCATE is forcing
> metadata synchronisation even with fsync=off, causing the incredibly
> glacially awesomely slow disk access of your average FDE system to kick in,
> possibly even once per table or even once per file (index, table, toast,
> etc).

Lousy file system is my guess. HFS is not that great. I bet ext3
would be a reasonable model of this amount of pain as well.

> You could help progress this issue constructively by doing some profiling on
> your system, tracing Pg's system calls, and determining what exactly it's
> doing with DELETE vs TRUNCATE and where the time goes. On Linux you'd use
> OProfile for this and on Solaris you'd use DTrace. Dunno what facilities Mac
> OS X has but there must be something similar.

I'm sure I could, but first I want to put to complete rest the notion
that this is an "edge case." It's only an edge case if the only
database you have runs in production. An understanding by more people
that this is a problem of at least moderate impact is a good first
step. I'll ask some of my more Macintosh-adept colleagues for advice.

>> I've seen way more than one complaint, and I'm quite sure there are
>> thousands of man hours (or more) spent on people who don't even know
>> to complain about such atrocious performance (or maybe it's so bad
>> that most people run a web search and find out, probably being left
>> really annoyed from having to yak shave as a result).
>
> I suspect you're right - as DB based unit testing becomes more commonplace
> this is turning up a lot more. As DB unit tests were first really popular in
> the ruby/rails crowd they've probably seen the most pain, but as someone who
> doesn't move in those circles I wouldn't have known. They certainly don't
> seem to have been making noise about it here, and I've only recently seen
> some SO questions about it.

Well, here's another anecdotal data point to show how this can sneak
under the radar: because this was a topic of discussion in the office
today, a colleague in the Department of Data discovered his 1.5 minute
testing cycle could be cut to thirty seconds. We conservatively
estimate he runs the tests 30 times a day when working on his project,
and probably more. Multiply that over a few weeks (not even counting
the cost of more broken concentration) and we're talking a real loss
of productivity and satisfaction.

Here's an example of a person that works on a Postgres-oriented
project at his day job, has multi-year experience with it, and can
write detailed articles like these:
https://devcenter.heroku.com/articles/postgresql-concurrency . If he
didn't know to get this right without having it called out as a
caveat, what number of people have but the most slim chance? Our best
asset is probably the relative obscurity of TRUNCATE vs. DELETE for
those who are less familiar with the system.

I'm sure he would have found it eventually when starting to profile
his tests when they hit the 3-4 minute mark, although he might just as
easily said "well, TRUNCATE, that's the fast one...nothing to do
there...".

> That said, the group of people who care about this most are not well
> represented as active contributors to PostgreSQL. I'd love it if you could
> help start to change that by stepping in and taking a little time to profile
> exactly what's going on with your system so we can learn what, exactly, is
> slow.

It's not my platform of choice, per se, but on my Ubuntu Precise on
ext4 with fsync off and no disk encryption:

$ rake
55.37user 2.36system 1:15.33elapsed 76%CPU (0avgtext+0avgdata
543120maxresident)k
0inputs+2728outputs (0major+85691minor)pagefaults 0swaps

$ rake
53.85user 1.97system 2:04.38elapsed 44%CPU (0avgtext+0avgdata
547904maxresident)k
0inputs+2640outputs (0major+100226minor)pagefaults 0swaps

Which is a not-as-pathetic slowdown, but still pretty substantial,
being somewhat shy of 2x. I'll ask around for someone who is
Macintosh-OS-inclined (not as a user, but as a developer) about a good
way to get a profile.

--
fdr

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Daniel Farina <daniel(at)heroku(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-12 07:45:53
Message-ID:	4FFE80B1.2000403@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/12/2012 02:12 PM, Daniel Farina wrote:
> On Wed, Jul 11, 2012 at 6:41 PM, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> wrote:
>> On 07/12/2012 06:51 AM, Daniel Farina wrote:
>>> 15x slower. This is a Macbook Air with full disk encryption and SSD
>>> disk with fsync off, e.g. a very typical developer configuration.
>> Don't use full disk encryption for throwaway test data if you care about how
>> long those tests take. It's a lot like tuning the engine in your car while
>> ignoring the fact that the handbrake is jammed on and you're dragging a
>> parachute. Use a ramdisk or un-encrypted partition, something that doesn't
>> take three weeks to fsync().
> No. Full disk encryption is not that slow. And as we see, there is a
> workaround that works "just fine" (maybe it could be faster, who
> knows?) in this exact configuration. The greater problem is more
> likely to be HFS+, the file system.

The two are somewhat hand in hand in any case.

"Three weeks" is of course hyperbole. Nonetheless, I haven't seen a full
disk encryption system that doesn't dramatically slow down synchronous
operations by forcing a lot more work to be done than would be the case
without disk encryption. Perhaps the Mac OS X / HFS+ solution is an
exception to this, but I doubt it.

Given a small program that repeats the following sequence:

- Creates a file
- Writes few bytes to it
- fsync()s and closes it
- deletes it
- fsync()s the directory to ensure the metadata change is flushed

... and times it, it'd be interesting to do test runs with and without
encryption on HFS+.

> But
> that's surprisingly challenging: for example, last I checked,
> Postgres.app, principally written by one of my colleagues, does *not*
> disable fsync because we don't know of a great way to communicate the
> relaxed expectations of durability, even though Postgres.app is
> targeted towards developers

I think this is an issue of developer and user responsibility. Proper
test/dev separation from production, and a bit of thought, is all it
takes. After all, Pg can't stop you running your unit tests (full of all
those slow TRUNCATEs) against your production database, either.
Durability isn't worth a damn if you just deleted all your data.

About the only technical aid I can see for this would be some kind of
GUC that the app could proactively check against. Set it to "production"
for your production DB, and "test" for your throwaways. If the unit
tests see "production" they refuse to run; if the app proper sees "test"
it warns about data durability. Have it default to unset or "test" so
admins must explicitly set it to "production".

Handily, this is already possible. You can add whatever custom GUCs you
want. If you want to make your unit tests require that a GUC called
"stage.is_production" be off in order to run, just add to postgresql.conf:

custom_variable_classes = 'stage'
stage.is_production = off

now, you can see the new GUC:

regress=# SHOW stage.is_production;
stage.is_production
---------------------
off
(1 row)

... so your unit tests and app can check for it. Since you're producing
custom installers, this is something you can bundle as part of the
generated postgresql.conf for easy differentiation between test and
production DBs.

If requirements like this were integrated into common unit testing
frameworks some of these worries would go away. That's not something Pg
cane make happen, though.

How would you want to see it work? How would you solve this problem?

> Until then, this is an at the level of an is-ought problem: there is
> no immediate to even moderately distant future where people are not
> going to click the full disk encryption button their OS vendor gives
> them (nor should they *not* click that: people love to download bits
> of data from production to their local machine to figure out problems,
> and I think the world is a better place for it), and people are going
> to use HFS+ in large numbers, so talking about how many people "just"
> ought to reconfigure is tantamount to blaming the victim, especially
> when we have a sound and workable workaround in hand to at least prove
> definitively that the problem is not intractable.

Yes, people do work on production data in test envs, and FDE is overall
a plus. I'd rather they not turn it off - and rather they not have to.
That's why I suggested using a ramdisk as an alternative; it's
completely non-durable and just gets tossed out, so there's no more
worry about data leakage than there is for access to the disk cache
buffered in RAM or the mounted disks of a FDE machine when it's unlocked.

Setting up Pg to run off a ramdisk isn't a one-click trivial operation,
and it sounds like the group you're mainly interested in are the
database-as-a-utility crowd that prefer not to see, think about, or
touch the database directly, hence Postgres.app etc. If so this is much
more of a packaging problem than a core Pg problem. I take your point
about needing to be able to indicate lack of durability to clients, but
think it's relatively easily done with a custom GUC as shown above.

Of course, Pg on a ramdisk has other issues that quickly become apparent
when you "COPY" that 2GB CSV file into your DB...

> Lousy file system is my guess. HFS is not that great. I bet ext3 would
> be a reasonable model of this amount of pain as well.

Hey, HFS+ Journaled/Extended, which is all that you're ever likely to
see, is merely bad :-P

The original HFS, now that was a monster. Not-so-fond memories of
regular Norton tools defrag runs resurfacing from my Mac OS 7 days...

> I'm sure I could, but first I want to put to complete rest the notion
> that this is an "edge case." It's only an edge case if the only
> database you have runs in production. An understanding by more people
> that this is a problem of at least moderate impact is a good first
> step. I'll ask some of my more Macintosh-adept colleagues for advice.

That'd be great; as this is an issue having real world impact, people
with mac equipment and knowledge need to get involved in helping to
solve it. It's not confined to mac, but seems to be worse there.

The other way you could help would be by providing canned self-contained
test cases that can be used to demonstrate the big performance gaps
you're reporting and test them on other platforms / OSes / file systems.
Something with a "I've never used Ruby" quickstart.

> Here's an example of a person that works on a Postgres-oriented
> project at his day job, has multi-year experience with it, and can
> write detailed articles like these:
> https://devcenter.heroku.com/articles/postgresql-concurrency . If he
> didn't know to get this right without having it called out as a
> caveat, what number of people have but the most slim chance? Our best
> asset is probably the relative obscurity of TRUNCATE vs. DELETE for
> those who are less familiar with the system.

Yep. This whole issue was new to me until last week too. I run tests
against my DB but it's fast enough here. In any case, for my tests other
costs are greatly more significant than a few fractions of a second
difference in one DB operation. Clearly that's not the case for some DB
unit testing designs.

Other than ruby/rails/rake, what other systems are you aware of that're
affected by these issues? I'm not dismissing ruby, I just want to know
if you know of other groups or techs that're ALSO affected.

> Which is a not-as-pathetic slowdown, but still pretty substantial,
> being somewhat shy of 2x. I'll ask around for someone who is
> Macintosh-OS-inclined (not as a user, but as a developer) about a good
> way to get a profile.

That'd be great. Get them onto the list and involved, because if you
want to see this improved it's going to take some back and forth and
someone who can interpret the profile results, test changes, etc.

I only have a limited ability and willingness to drive this forward; I
have to focus on other things. You'll need to be willing to be proactive
and push this a bit. Figuring out what part of truncation is taking the
time would be a big plus, as would determining how much worse FDE makes
it vs an unencrypted disk.

Hopefully others are interested and following along too.

--
Craig Ringer

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-12 19:15:15
Message-ID:	CAMkU=1ydaypgAeZxyaMc=KhJJw5riGMBiLpvSqOO42=qArRZmQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Jul 11, 2012 at 3:51 PM, Daniel Farina <daniel(at)heroku(dot)com> wrote:
>
> Nope. I don't. But an exact crossover is a level of precision I don't
> really need, because here are where things stand on a completely
> unremarkable test suite on the closest project to me that meets the
> "regular web-app" profile case:
>
> With en-masse DELETE:
> rake 41.89s user 3.08s system 76% cpu 58.629 total
>
> With TRUNCATE:
> rake 49.86s user 2.93s system 5% cpu 15:17.88 total
>
> 15x slower. This is a Macbook Air with full disk encryption and SSD
> disk with fsync off, e.g. a very typical developer configuration.

What is shared_buffers?

> This is a rather small schema -- probably a half a dozen tables, and
> probably about a dozen indexes. This application is entirely
> unremarkable in its test-database workload: it wants to load a few
> records, do a few things, and then clear those handful of records.

How many rounds of truncation does one rake do? I.e. how many
truncations are occurring over the course of that 1 minute or 15
minutes?

Cheers,

Jeff

From:	Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-performance(at)postgresql(dot)org <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-12 23:21:13
Message-ID:	32078D4B40DD40ECB739D0ECD7D95996@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hi,

I work with Daniel Farina and was the other engineer who "discovered" this, once again. That is, I got bit by it and have been running TRUNCATE on my test suites for years.

On Thursday, July 12, 2012 at 12:15 PM, Jeff Janes wrote:

> On Wed, Jul 11, 2012 at 3:51 PM, Daniel Farina <daniel(at)heroku(dot)com (mailto:daniel(at)heroku(dot)com)> wrote:
> >
> > Nope. I don't. But an exact crossover is a level of precision I don't
> > really need, because here are where things stand on a completely
> > unremarkable test suite on the closest project to me that meets the
> > "regular web-app" profile case:
> >
> > With en-masse DELETE:
> > rake 41.89s user 3.08s system 76% cpu 58.629 total
> >
> > With TRUNCATE:
> > rake 49.86s user 2.93s system 5% cpu 15:17.88 total
> >
> > 15x slower. This is a Macbook Air with full disk encryption and SSD
> > disk with fsync off, e.g. a very typical developer configuration.
> >
>
>
> What is shared_buffers?

1600kB

Not sure this will make much difference with such small data, but of course I could be dead wrong here.
>
> > This is a rather small schema -- probably a half a dozen tables, and
> > probably about a dozen indexes. This application is entirely
> > unremarkable in its test-database workload: it wants to load a few
> > records, do a few things, and then clear those handful of records.
> >
>
>
> How many rounds of truncation does one rake do? I.e. how many
> truncations are occurring over the course of that 1 minute or 15
> minutes?
>
>

All tables are cleared out after every test. On this particular project, I'm running 200+ tests in 1.5 minutes (or 30 seconds with DELETE instead of TRUNCATE). For another, bigger project it's running 1700+ tests in about a minute. You can do the math from there.

I'd say this is not atypical at all, so I too encourage teaching TRUNCATE about small tables and optimizing for that, as well as a section in the docs about postgres tweaks for test suites. I'm sure many people have done independent research in this area, and it'd be great to have it documented in one place.

-Harold
>
>
> Cheers,
>
> Jeff
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org (mailto:pgsql-performance(at)postgresql(dot)org))
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
>

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-07-13 01:00:49
Message-ID:	CAMkU=1zJK6WiS9-6eO8rNmRpKSSB4=uw2mTsD=k64mNh__=Dcg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Jul 12, 2012 at 4:21 PM, Harold A. Giménez
<harold(dot)gimenez(at)gmail(dot)com> wrote:
>
> > What is shared_buffers?
>
>
> 1600kB

That is really small, so the buffer flushing should not be a problem.
Unless you mean 1600MB.

> > > This is a rather small schema -- probably a half a dozen tables, and
> > > probably about a dozen indexes. This application is entirely
> > > unremarkable in its test-database workload: it wants to load a few
> > > records, do a few things, and then clear those handful of records.
> >
> > How many rounds of truncation does one rake do? I.e. how many
> > truncations are occurring over the course of that 1 minute or 15
> > minutes?
>
> All tables are cleared out after every test. On this particular project, I'm
> running 200+ tests in 1.5 minutes (or 30 seconds with DELETE instead of
> TRUNCATE). For another, bigger project it's running 1700+ tests in about a
> minute. You can do the math from there.

so 1700 rounds * 18 relations = truncates 30,600 per minute.

That is actually faster than I get truncates to go when I am purely
limited by CPU.

I think the problem is in the Fsync Absorption queue. Every truncate
adds a FORGET_RELATION_FSYNC to the queue, and processing each one of
those leads to sequential scanning the checkpointer's pending ops hash
table, which is quite large. It is almost entirely full of other
requests which have already been canceled, but it still has to dig
through them all. So this is essentially an N^2 operation.

I'm not sure why we don't just delete the entry instead of marking it
as cancelled. It looks like the only problem is that you can't delete
an entry other than the one just returned by hash_seq_search. Which
would be fine, as that is the entry that we would want to delete;
except that mdsync might have a different hash_seq_search open, and so
it wouldn't be safe to delete.

If the segno was taken out of the hash key and handled some other way,
then the forgetting could be done with a simple hash look up rather
than a full scan.

Maybe we could just turn off the pending ops table altogether when
fsync=off, but since fsync is PGC_SIGHUP it is not clear how you could
safely turn it back on.

Cheers,

Jeff

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-13 04:55:22
Message-ID:	CAMkU=1w7vLA63hf-+Uc_j61CvOdKixsOHph9d0H-aLNBYgfX=w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

I've moved this thread from performance to hackers.

The topic was poor performance when truncating lots of small tables
repeatedly on test environments with fsync=off.

On Thu, Jul 12, 2012 at 6:00 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> I think the problem is in the Fsync Absorption queue. Every truncate
> adds a FORGET_RELATION_FSYNC to the queue, and processing each one of
> those leads to sequential scanning the checkpointer's pending ops hash
> table, which is quite large. It is almost entirely full of other
> requests which have already been canceled, but it still has to dig
> through them all. So this is essentially an N^2 operation.

My attached Proof of Concept patch reduces the run time of the
benchmark at the end of this message from 650sec to 84sec,
demonstrating that this is in fact the problem. Which doesn't mean
that my patch is the right answer to it, of course.

(The delete option is still faster than truncate, coming in at around 55sec)

> I'm not sure why we don't just delete the entry instead of marking it
> as cancelled. It looks like the only problem is that you can't delete
> an entry other than the one just returned by hash_seq_search. Which
> would be fine, as that is the entry that we would want to delete;
> except that mdsync might have a different hash_seq_search open, and so
> it wouldn't be safe to delete.
>
> If the segno was taken out of the hash key and handled some other way,
> then the forgetting could be done with a simple hash look up rather
> than a full scan.

The above two ideas might be the better solution, as they would work
even when fsync=on. Since BBU are becoming so popular I think the
fsync queue could be a problem even with fsync on if the fsync is fast
enough. But I don't immediately know how to implement them.

> Maybe we could just turn off the pending ops table altogether when
> fsync=off, but since fsync is PGC_SIGHUP it is not clear how you could
> safely turn it back on.

Now that I think about it, I don't see how turning fsync from off to
on can ever be known to be safe, until a system wide sync has
intervened. After all a segment that was dirtied and added to the
pending ops table while fsync=off might also be removed from the
pending ops table the microsecond before fsync is turned on, so how is
that different from never adding it in the first place?

The attached Proof Of Concept patch implements this in two ways, one
of which is commented out. The commented out way omits the overhead
of sending the request to the checkpointer in the first place, but
breaks modularity a bit.

The benchmark used on 9.3devel head is:

fsync=off, all other defaults.

## one time initialization
perl -le 'print "create schema foo$_; create table foo$_.foo$_ (k
integer, v integer);" $ARGV[0]..$ARGV[0]+$ARGV[1]-1' 0 10 |psql

## actual benchmark.
perl -le 'print "set client_min_messages=warning;";
foreach (1..10000) {
print "BEGIN;\n";
print "insert into foo$_.foo$_ select * from
generate_series(1,10); " foreach $ARGV[0]..$ARGV[0]+$ARGV[1]-1;
print "COMMIT;\nBEGIN;\n";
print "truncate table foo$_.foo$_; " foreach
$ARGV[0]..$ARGV[0]+$ARGV[1]-1;
#print "delete from foo$_.foo$_; " foreach
$ARGV[0]..$ARGV[0]+$ARGV[1]-1;
print "COMMIT;\n"
} ' 0 10 | time psql > /dev/null

Cheers,

Jeff

Attachment	Content-Type	Size
fsync_queue_POC.patch	application/octet-stream	1.1 KB

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-15 00:10:18
Message-ID:	CAMkU=1yLXvODRZZ_=fgrEeJfk2tvZPTTD-8n8BwrAhNz_WBT0A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Jul 12, 2012 at 9:55 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> I've moved this thread from performance to hackers.
>
> The topic was poor performance when truncating lots of small tables
> repeatedly on test environments with fsync=off.
>
> On Thu, Jul 12, 2012 at 6:00 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
>> I think the problem is in the Fsync Absorption queue. Every truncate
>> adds a FORGET_RELATION_FSYNC to the queue, and processing each one of
>> those leads to sequential scanning the checkpointer's pending ops hash
>> table, which is quite large. It is almost entirely full of other
>> requests which have already been canceled, but it still has to dig
>> through them all. So this is essentially an N^2 operation.
...
>
>> I'm not sure why we don't just delete the entry instead of marking it
>> as cancelled. It looks like the only problem is that you can't delete
>> an entry other than the one just returned by hash_seq_search. Which
>> would be fine, as that is the entry that we would want to delete;
>> except that mdsync might have a different hash_seq_search open, and so
>> it wouldn't be safe to delete.

The attached patch addresses this problem by deleting the entry when
it is safe to do so, and flagging it as canceled otherwise.

I thought of using has_seq_scans to determine when it is safe, but
dynahash.c does not make that function public, and I was afraid it
might be too slow, anyway.

So instead I used a static variable, plus the knowledge that the only
time there are two scans on the table is when mdsync starts one and
then calls RememberFsyncRequest indirectly. There is one other place
that does a seq scan, but there is no way for control to pass from
that loop to reach RememberFsyncRequest.

I've added code to disclaim the scan if mdsync errors out. I don't
think that this should a problem because at that point the scan object
is never going to be used again, so if its internal state gets screwed
up it shouldn't matter. However, I wonder if it should also call
hash_seq_term, otherwise the pending ops table will be permanently
prevented from expanding (this is a pre-existing condition, not to do
with my patch). Since I don't know what can make mdsync error out
without being catastrophic, I don't know how to test this out.

One concern is that if the ops table ever does become bloated, it can
never recover while under load. The bloated table will cause mdsync
to take a long time to run, and as long as mdsync is in the call stack
the antibloat feature is defeated--so we have crossed a tipping point
and cannot get back. I don't see that occurring in the current use
case, however. With my current benchmark, the anti-bloat is effective
enough that mdsync never takes very long to execute, so a virtuous
circle exists.

As an aside, the comments in dynahash.c seem to suggest that one can
always delete the entry returned by hash_seq_search, regardless of the
existence of other sequential searches. I'm pretty sure that this is
not true. Also, shouldn't this contract about when one is allowed to
delete entries be in the hsearch.h file, rather than the dynahash.c
file?

Also, I still wonder if it is worth memorizing fsyncs (under
fsync=off) that may or may not ever take place. Is there any
guarantee that we can make by doing so, that couldn't be made
otherwise?

Cheers,

Jeff

Attachment	Content-Type	Size
FsyncRequest_delete_v1.patch	application/octet-stream	4.5 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-15 18:29:27
Message-ID:	19949.1342376967@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> On Thu, Jul 12, 2012 at 9:55 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> The topic was poor performance when truncating lots of small tables
>> repeatedly on test environments with fsync=off.
>>
>> On Thu, Jul 12, 2012 at 6:00 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>> I think the problem is in the Fsync Absorption queue. Every truncate
>>> adds a FORGET_RELATION_FSYNC to the queue, and processing each one of
>>> those leads to sequential scanning the checkpointer's pending ops hash
>>> table, which is quite large. It is almost entirely full of other
>>> requests which have already been canceled, but it still has to dig
>>> through them all. So this is essentially an N^2 operation.

> The attached patch addresses this problem by deleting the entry when
> it is safe to do so, and flagging it as canceled otherwise.

I don't like this patch at all. It seems ugly and not terribly safe,
and it won't help at all when the checkpointer is in the midst of an
mdsync scan, which is a nontrivial part of its cycle.

I think what we ought to do is bite the bullet and refactor the
representation of the pendingOps table. What I'm thinking about
is reducing the hash key to just RelFileNodeBackend + ForkNumber,
so that there's one hashtable entry per fork, and then storing a
bitmap to indicate which segment numbers need to be sync'd. At
one gigabyte to the bit, I think we could expect the bitmap would
not get terribly large. We'd still have a "cancel" flag in each
hash entry, but it'd apply to the whole relation fork not each
segment.

If we did this then the FORGET_RELATION_FSYNC code path could use
a hashtable lookup instead of having to traverse the table
linearly; and that would get rid of the O(N^2) performance issue.
The performance of FORGET_DATABASE_FSYNC might still suck, but
DROP DATABASE is a pretty heavyweight operation anyhow.

I'm willing to have a go at coding this design if it sounds sane.
Comments?

> Also, I still wonder if it is worth memorizing fsyncs (under
> fsync=off) that may or may not ever take place. Is there any
> guarantee that we can make by doing so, that couldn't be made
> otherwise?

Yeah, you have a point there. It's not real clear that switching fsync
from off to on is an operation that we can make any guarantees about,
short of executing something like the code recently added to initdb
to force-sync the entire PGDATA tree. Perhaps we should change fsync
to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
skip forwarding fsync requests when it's off?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-15 22:37:12
Message-ID:	24158.1342391832@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

... btw, in the penny wise and pound foolish department, I observe that
smgrdounlink calls mdunlink separately for each possibly existing fork
of a relation to be dropped. That means we are queuing a separate fsync
queue entry for each fork, and could immediately save a factor of four
in FORGET_RELATION_FSYNC traffic if we were to redefine those queue
entries as applying to all forks. The only reason to have a per-fork
variant, AFAICS, is for smgrdounlinkfork(), which is used nowhere and
exists only because I was too chicken to remove the functionality
outright in commit ece01aae479227d9836294b287d872c5a6146a11. But given
that we know the fsync queue can be a bottleneck, my vote is to refactor
mdunlink to apply to all forks and send only one message.

I am also wondering whether it's really necessary to send fsync request
messages for backend-local relations. If rnode.backend says it's local,
can't we skip sending the fsync request? All local relations are
flush-on-crash anyway, no?

regards, tom lane

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 00:22:59
Message-ID:	50035EE3.8030607@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/16/2012 02:29 AM, Tom Lane wrote:
> Yeah, you have a point there. It's not real clear that switching fsync
> from off to on is an operation that we can make any guarantees about,
> short of executing something like the code recently added to initdb
> to force-sync the entire PGDATA tree.

There's one way that doesn't have any housekeeping cost to Pg. It's
pretty bad manners if there's anybody other than Pg on the system though:

sync()

Let the OS do the housekeeping.

It's possible to do something similar on Windows, in that there are
utilities for the purpose:

http://technet.microsoft.com/en-us/sysinternals/bb897438.aspx

This probably uses:

http://msdn.microsoft.com/en-us/library/s9xk9ehd%28VS.71%29.aspx

from COMMODE.OBJ (unfortunate name), which has existed since win98.

> Perhaps we should change fsync
> to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
> skip forwarding fsync requests when it's off?

Personally, I didn't even know it was runtime switchable.

fsync=off is much less necessary with async commits, group commit via
commit delay, WAL improvements, etc. To me it's mostly of utility when
testing, particularly on SSDs. I don't see a DB restart requirement as a
big issue. It'd be interesting to see what -general has to say, if there
are people depending on this.

If it's necessary to retain the ability to runtime switch it, making it
a somewhat rude sync() in exchange for boosted performance the rest of
the time may well be worthwhile anyway. It'd be interesting to see.

All this talk of synchronisation is making me really frustrated that
there seems to be very poor support in OSes for syncing a set of files
in a single pass, potentially saving a lot of time and thrashing. A way
to relax the ordering guarantee from "Files are synced in the order
fsync() is called on each" to "files are all synced when this call
completes" would be great. I've been running into this issue in some
non-Pg-related work and it's been bugging me.

--
Craig Ringer

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 01:37:41
Message-ID:	27245.1342402661@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> writes:
> On 07/16/2012 02:29 AM, Tom Lane wrote:
>> Yeah, you have a point there. It's not real clear that switching fsync
>> from off to on is an operation that we can make any guarantees about,
>> short of executing something like the code recently added to initdb
>> to force-sync the entire PGDATA tree.

> There's one way that doesn't have any housekeeping cost to Pg. It's
> pretty bad manners if there's anybody other than Pg on the system though:
> sync()

Yeah, I thought about that: if we could document that issuing a manual
sync after turning fsync on leaves you in a guaranteed-good state once
the sync is complete, it'd probably be fine. However, I'm not convinced
that we could promise that with a straight face. In the first place,
PG has only very weak guarantees about how quickly all processes in the
system will absorb a GUC update. In the second place, I'm not entirely
sure that there aren't race conditions around checkpoints and the fsync
request queue (particularly if we do what Jeff is suggesting and
suppress queuing requests at the upstream end). It might be all right,
or it might be all right after expending some work, but the whole thing
is not an area where I think anyone wants to spend time. I think it'd
be much safer to document that the correct procedure is "stop the
database, do a manual sync, enable fsync in postgresql.conf, restart the
database". And if that's what we're documenting, we lose little or
nothing by marking fsync as PGC_POSTMASTER.

regards, tom lane

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 01:43:02
Message-ID:	500371A6.5010906@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/16/2012 09:37 AM, Tom Lane wrote:
>> There's one way that doesn't have any housekeeping cost to Pg. It's
>> pretty bad manners if there's anybody other than Pg on the system though:
>> sync()
> Yeah, I thought about that: if we could document that issuing a manual
> sync after turning fsync on leaves you in a guaranteed-good state once
> the sync is complete, it'd probably be fine. However, I'm not convinced
> that we could promise that with a straight face. In the first place,
> PG has only very weak guarantees about how quickly all processes in the
> system will absorb a GUC update. In the second place, I'm not entirely
> sure that there aren't race conditions around checkpoints and the fsync
> request queue (particularly if we do what Jeff is suggesting and
> suppress queuing requests at the upstream end). It might be all right,
> or it might be all right after expending some work, but the whole thing
> is not an area where I think anyone wants to spend time. I think it'd
> be much safer to document that the correct procedure is "stop the
> database, do a manual sync, enable fsync in postgresql.conf, restart the
> database". And if that's what we're documenting, we lose little or
> nothing by marking fsync as PGC_POSTMASTER.
Sounds reasonable to me; I tend to view fsync=off as a testing feature
anyway. Will clone onto -general and see if anyone yells.

--
Craig Ringer

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Proposed change for 9.3(?): Require full restart to change fsync parameter, not just pg_ctl reload
Date:	2012-07-16 01:54:44
Message-ID:	50037464.7020503@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hi all

Some performance improvements have been proposed - probably for 9.3 -
that will mean the `fsync' GUC can only be changed with a full cluster
restart. See quoted, at end of message.

It is currently possible to change `fsync' by altering postgresql.conf
and issuing a `pg_ctl reload' . It is not clear how safe this really is
even now, and changes proposed to reduce the amount of expensive
bookkeeping done when fsync is set to 'off' will make it even less safe.
Consequently, it is proposed that the ability to change the fsync
setting while Pg is running be removed.

fsync=off is very unsafe anyway, and these days production setups are
able to get similar results with async commits and group commit.

Is there anyone here relying on being able to change fsync=off to
fsync=on at runtime? If so, what for, and what does it gain you over use
of group/async commit?

For related discussion see the -hackers thread:

"DELETE vs TRUNCATE explanation"

http://archives.postgresql.org/message-id/CAMkU=1yLXvODRZZ_=fgrEeJfk2tvZPTTD-8n8BwrAhNz_WBT0A@mail.gmail.com

and the background threads:

"PostgreSQL db, 30 tables with number of rows < 100 (not huge) - the
fastest way to clean each non-empty table and reset unique identifier
column of empty ones."

http://archives.postgresql.org/message-id/CAFXpGYbgmZYij4TgCbOF24-usoiDD0ASQeaVAkYtB7E2TYm8Wg@mail.gmail.com

"DELETE vs TRUNCATE explanation"

http://archives.postgresql.org/message-id/4FFCCAC4.4030503@ringerc.id.au

On 07/16/2012 09:37 AM, Tom Lane wrote:
> Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> writes:
>> On 07/16/2012 02:29 AM, Tom Lane wrote:
>>> Yeah, you have a point there. It's not real clear that switching fsync
>>> from off to on is an operation that we can make any guarantees about,
>>> short of executing something like the code recently added to initdb
>>> to force-sync the entire PGDATA tree.
>
>> There's one way that doesn't have any housekeeping cost to Pg. It's
>> pretty bad manners if there's anybody other than Pg on the system though:
>> sync()
>
> Yeah, I thought about that: if we could document that issuing a manual
> sync after turning fsync on leaves you in a guaranteed-good state once
> the sync is complete, it'd probably be fine. However, I'm not convinced
> that we could promise that with a straight face. In the first place,
> PG has only very weak guarantees about how quickly all processes in the
> system will absorb a GUC update. In the second place, I'm not entirely
> sure that there aren't race conditions around checkpoints and the fsync
> request queue (particularly if we do what Jeff is suggesting and
> suppress queuing requests at the upstream end). It might be all right,
> or it might be all right after expending some work, but the whole thing
> is not an area where I think anyone wants to spend time. I think it'd
> be much safer to document that the correct procedure is "stop the
> database, do a manual sync, enable fsync in postgresql.conf, restart the
> database". And if that's what we're documenting, we lose little or
> nothing by marking fsync as PGC_POSTMASTER.
>
> regards, tom lane
>

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 15:58:59
Message-ID:	CA+TgmoaTf=zjP37-oG7fr1VbGfuvHSfpajgLZ65EeSJX2mmquA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Sun, Jul 15, 2012 at 2:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I think what we ought to do is bite the bullet and refactor the
> representation of the pendingOps table. What I'm thinking about
> is reducing the hash key to just RelFileNodeBackend + ForkNumber,
> so that there's one hashtable entry per fork, and then storing a
> bitmap to indicate which segment numbers need to be sync'd. At
> one gigabyte to the bit, I think we could expect the bitmap would
> not get terribly large. We'd still have a "cancel" flag in each
> hash entry, but it'd apply to the whole relation fork not each
> segment.

I think this is a good idea.

>> Also, I still wonder if it is worth memorizing fsyncs (under
>> fsync=off) that may or may not ever take place. Is there any
>> guarantee that we can make by doing so, that couldn't be made
>> otherwise?
>
> Yeah, you have a point there. It's not real clear that switching fsync
> from off to on is an operation that we can make any guarantees about,
> short of executing something like the code recently added to initdb
> to force-sync the entire PGDATA tree. Perhaps we should change fsync
> to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
> skip forwarding fsync requests when it's off?

I am emphatically opposed to making fsync PGC_POSTMASTER. Being able
to change parameters on the fly without having to shut down the system
is important, and we should be looking for ways to make it possible to
change more things on-the-fly, not arbitrarily restricting GUCs that
already exist. This is certainly one I've changed on the fly, and I'm
willing to bet there are real-world users out there who have done the
same (e.g. to survive an unexpected load spike).

I would argue that such a change adds no measure of safety, anyway.
Suppose we have the following sequence of events, starting with
fsync=off:

T0: write
T1: checkpoint (fsync of T0 skipped since fsync=off)
T2: write
T3: fsync=on
T4: checkpoint (fsync of T2 performed)

Why is it OK to fsync the write at T2 but not the one at T0? In order
for the system to become crash-safe, the user will need to guarantee,
at some point following T3, that the entire OS buffer cache has been
flushed to disk. Whether or not the fsync of T2 happened is
irrelevant. Had we chosen not to send an fsync request at all at time
T2, the user's obligations following T3 would be entirely unchanged.
Thus, I see no reason why we need to restrict the fsync setting in
order to implement the proposed optimization.

But, at a broader level, I am not very excited about this
optimization. It seems to me that if this is hurting enough to be
noticeable, then it's hurting us when fsync=on as well, and we had
maybe think a little harder about how to cut down on the IPC overhead.
If the bgwriter comm lock is contended, we could partition it - e.g.
by giving each backend a small queue protected by the backendLock,
which is flushed into the main queue when it fills and harvested by
the bgwriter once per checkpoint cycle. (This is the same principle
as the fast-path locking stuff that we used to eliminate lmgr
contention on short read-only queries in 9.2.) If we only fix it for
the fsync=off case, then what about people who are running with
fsync=on but have extremely fast fsyncs? Most of us probably don't
have the hardware to test that today but it's certainly out there and
will probably become more common in the future.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 16:08:39
Message-ID:	9603.1342454919@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sun, Jul 15, 2012 at 2:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Yeah, you have a point there. It's not real clear that switching fsync
>> from off to on is an operation that we can make any guarantees about,
>> short of executing something like the code recently added to initdb
>> to force-sync the entire PGDATA tree. Perhaps we should change fsync
>> to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
>> skip forwarding fsync requests when it's off?

> I would argue that such a change adds no measure of safety, anyway.

Well, yes it does, and the reason was explained further down in the
thread: since we have no particular guarantees as to how quickly
postmaster children will absorb postgresql.conf updates, there could be
individual processes still running with fsync = off long after the user
thinks he's turned it on. A forced restart solves that. I believe the
reason for the current coding in the fsync queuing stuff is so that you
only have to worry about how long it takes the checkpointer to notice
the GUC change, and not any random backend that's running a forty-hour
query.

> But, at a broader level, I am not very excited about this
> optimization. It seems to me that if this is hurting enough to be
> noticeable, then it's hurting us when fsync=on as well, and we had
> maybe think a little harder about how to cut down on the IPC overhead.

Uh, that's exactly what's under discussion. Not sending useless fsync
requests when fsync is off is just one part of it; a part that happens
to be quite useful for some test scenarios, even if not so much for
production. (IIRC, the original complainant in this thread was running
fsync off.)

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 16:26:06
Message-ID:	CA+Tgmoa4ta60Y-rZpEPDRCQKJV8-Uqe0P610Jgj2q7e+fondig@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Jul 16, 2012 at 12:08 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Sun, Jul 15, 2012 at 2:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Yeah, you have a point there. It's not real clear that switching fsync
>>> from off to on is an operation that we can make any guarantees about,
>>> short of executing something like the code recently added to initdb
>>> to force-sync the entire PGDATA tree. Perhaps we should change fsync
>>> to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
>>> skip forwarding fsync requests when it's off?
>
>> I would argue that such a change adds no measure of safety, anyway.
>
> Well, yes it does, and the reason was explained further down in the
> thread: since we have no particular guarantees as to how quickly
> postmaster children will absorb postgresql.conf updates, there could be
> individual processes still running with fsync = off long after the user
> thinks he's turned it on. A forced restart solves that. I believe the
> reason for the current coding in the fsync queuing stuff is so that you
> only have to worry about how long it takes the checkpointer to notice
> the GUC change, and not any random backend that's running a forty-hour
> query.

Hrmf, I guess that's a fair point. But if we believe that reasoning
then I think it's an argument for sending fsync requests even when
fsync=off, not for making fsync PGC_POSTMASTER. Or maybe we could
store the current value of the fsync flag in shared memory somewhere
and have backends check it before deciding whether to enqueue a
request. With proper use of memory barriers it should be possible to
make this work without requiring a lock.

>> But, at a broader level, I am not very excited about this
>> optimization. It seems to me that if this is hurting enough to be
>> noticeable, then it's hurting us when fsync=on as well, and we had
>> maybe think a little harder about how to cut down on the IPC overhead.
>
> Uh, that's exactly what's under discussion. Not sending useless fsync
> requests when fsync is off is just one part of it; a part that happens
> to be quite useful for some test scenarios, even if not so much for
> production. (IIRC, the original complainant in this thread was running
> fsync off.)

My point is that if sending fsync requests is cheap enough, then not
sending them won't save anything meaningful. And I don't see why it
can't be made just that cheap, thereby benefiting people with fsync=on
as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 16:36:31
Message-ID:	11952.1342456591@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Jul 16, 2012 at 12:08 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Uh, that's exactly what's under discussion. Not sending useless fsync
>> requests when fsync is off is just one part of it; a part that happens
>> to be quite useful for some test scenarios, even if not so much for
>> production. (IIRC, the original complainant in this thread was running
>> fsync off.)

> My point is that if sending fsync requests is cheap enough, then not
> sending them won't save anything meaningful.

Well, that argument is exactly why the code is designed the way it is...
but we are now finding out that sending useless fsync requests isn't as
cheap as all that.

The larger point here, in any case, is that I don't believe anyone wants
to expend a good deal of skull sweat and possibly performance on
ensuring that transitioning from fsync off to fsync on in an active
database is a reliable operation. It does not seem like something we
are ever going to recommend, and we have surely got nine hundred ninety
nine other things that are more useful to spend development time on.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 16:53:02
Message-ID:	CA+TgmoYRXMWQ_DGmmqu3Zp8oW3Lt-9Lc1q3EHaV_ZmjWqaPkuA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Jul 16, 2012 at 12:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Jul 16, 2012 at 12:08 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Uh, that's exactly what's under discussion. Not sending useless fsync
>>> requests when fsync is off is just one part of it; a part that happens
>>> to be quite useful for some test scenarios, even if not so much for
>>> production. (IIRC, the original complainant in this thread was running
>>> fsync off.)
>
>> My point is that if sending fsync requests is cheap enough, then not
>> sending them won't save anything meaningful.
>
> Well, that argument is exactly why the code is designed the way it is...
> but we are now finding out that sending useless fsync requests isn't as
> cheap as all that.

I agree, but I think the problem can be solved for a pretty modest
amount of effort without needing to make fsync PGC_POSTMASTER. Your
proposal to refactor the pendingOpsTable representation seems like it
will help a lot. Perhaps you should do that first and then we can
reassess.

> The larger point here, in any case, is that I don't believe anyone wants
> to expend a good deal of skull sweat and possibly performance on
> ensuring that transitioning from fsync off to fsync on in an active
> database is a reliable operation. It does not seem like something we
> are ever going to recommend, and we have surely got nine hundred ninety
> nine other things that are more useful to spend development time on.

We may not recommend it, but I am sure that people will do it anyway,
and requiring them to bounce the server in that situation seems
unfortunate, especially since it will also require them to bounce the
server in order to go the other direction.

In my view, the elephant in the room here is that it's dramatically
inefficient for every backend to send an fsync request on every block
write. For many users, in many workloads, all of those requests will
be for just a tiny handful of relation segments. The fsync queue
compaction code works as well as it does for precisely that reason -
when it triggers, we typically can compact a list of thousands or
millions of entries down to less than two dozen. In other words, as I
see it, the issue here is not so much that 100% of the fsync requests
are useless when fsync=off, but rather that 99.9% of them are useless
even when fsync=on.

In any case, I'm still of the opinion that we ought to try making one
fix (your proposed refactoring of the pendingOpsTable) and then see
where we're at.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 16:57:39
Message-ID:	13011.1342457859@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> In my view, the elephant in the room here is that it's dramatically
> inefficient for every backend to send an fsync request on every block
> write.

Yeah. This was better before the decision was taken to separate
bgwriter from checkpointer; before that, only local communication was
involved for the bulk of write operations (or at least so we hope).
I remain less than convinced that that split was really a great idea.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 18:39:58
Message-ID:	CA+TgmoaEPLo5fmBX1JSkGK3umAamFVghwY7qDb4_QqzteZgUPA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Jul 16, 2012 at 12:57 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> In my view, the elephant in the room here is that it's dramatically
>> inefficient for every backend to send an fsync request on every block
>> write.
>
> Yeah. This was better before the decision was taken to separate
> bgwriter from checkpointer; before that, only local communication was
> involved for the bulk of write operations (or at least so we hope).
> I remain less than convinced that that split was really a great idea.

Unfortunately, there are lots of important operations (like bulk
loading, SELECT * FROM bigtable, and VACUUM notverybigtable) that
inevitably end up writing out their own dirty buffers. And even when
the background writer does write something, it's not always clear that
this is a positive thing. Here's Greg Smith commenting on the
more-is-worse phenonmenon:

http://archives.postgresql.org/pgsql-hackers/2012-02/msg00564.php

Jeff Janes and I came up with what I believe to be a plausible
explanation for the problem:

http://archives.postgresql.org/pgsql-hackers/2012-03/msg00356.php

I kinda think we ought to be looking at fixing that for 9.2, and
perhaps even back-patching further, but nobody else seemed terribly
excited about it.

At any rate, I'm somewhat less convinced that the split was a good
idea than I was when we did it, mostly because we haven't really gone
anywhere with it subsequently. But I do think there's a good argument
that any process which is responsible for running a system call that
can take >30 seconds to return had better not be responsible for
anything else that matters very much. If background writing is one of
the things we do that doesn't matter very much, then we need to figure
out what's wrong with it (see above) and make it matter more. If it
already matters, then it needs to happen continuously and not get
suppressed while other tasks (like long fsyncs) are happening, at
least not without some evidence that such suppression is the right
choice from a performance standpoint.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 19:03:03
Message-ID:	15645.1342465383@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Unfortunately, there are lots of important operations (like bulk
> loading, SELECT * FROM bigtable, and VACUUM notverybigtable) that
> inevitably end up writing out their own dirty buffers. And even when
> the background writer does write something, it's not always clear that
> this is a positive thing. Here's Greg Smith commenting on the
> more-is-worse phenonmenon:

> http://archives.postgresql.org/pgsql-hackers/2012-02/msg00564.php

> Jeff Janes and I came up with what I believe to be a plausible
> explanation for the problem:

> http://archives.postgresql.org/pgsql-hackers/2012-03/msg00356.php

> I kinda think we ought to be looking at fixing that for 9.2, and
> perhaps even back-patching further, but nobody else seemed terribly
> excited about it.

I'd be fine with back-patching something like that into 9.2 if we had
(a) a patch and (b) experimental evidence that it made things better.
Unless I missed something, we have neither. Also, I read the above
two messages to say that you, Greg, and Jeff have three different ideas
about exactly what should be done, which is less than comforting for
a last-minute patch...

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 19:18:53
Message-ID:	15955.1342466333@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> At any rate, I'm somewhat less convinced that the split was a good
> idea than I was when we did it, mostly because we haven't really gone
> anywhere with it subsequently.

BTW, while we are on the subject: hasn't this split completely broken
the statistics about backend-initiated writes? I don't see anything
in ForwardFsyncRequest that distinguishes whether it's being called in
the bgwriter or a regular backend.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 19:26:38
Message-ID:	CA+TgmoYnqPf03rYJo3G6dMrqShQF2iMtfpYBPgWdxtEdacaX-w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> At any rate, I'm somewhat less convinced that the split was a good
>> idea than I was when we did it, mostly because we haven't really gone
>> anywhere with it subsequently.
>
> BTW, while we are on the subject: hasn't this split completely broken
> the statistics about backend-initiated writes?

Yes, it seems to have done just that. The comment for
ForwardFsyncRequest is a few bricks short of a load too:

* Whenever a backend is compelled to write directly to a relation
* (which should be seldom, if the checkpointer is getting its job done),
* the backend calls this routine to pass over knowledge that the relation
* is dirty and must be fsync'd before next checkpoint. We also use this
* opportunity to count such writes for statistical purposes.

Line 2 seems to have been mechanically changed from "background
writer" to "checkpointer", but of course it should still say
"background writer" in this case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-16 19:46:07
Message-ID:	25169.1342467967@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Yes, it seems to have done just that. The comment for
> ForwardFsyncRequest is a few bricks short of a load too:
> ...
> Line 2 seems to have been mechanically changed from "background
> writer" to "checkpointer", but of course it should still say
> "background writer" in this case.

Yeah, found that one already (it's probably my fault).

Will see about fixing the stats in a separate patch. I just wanted to
know if the issue had been dealt with in some non-obvious fashion.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-17 22:56:50
Message-ID:	9247.1342565810@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> BTW, while we are on the subject: hasn't this split completely broken
>> the statistics about backend-initiated writes?

> Yes, it seems to have done just that.

So I went to fix this in the obvious way (attached), but while testing
it I found that the number of buffers_backend events reported during
a regression test run barely changed; which surprised the heck out of
me, so I dug deeper. The cause turns out to be extremely scary:
ForwardFsyncRequest isn't getting called at all in the bgwriter process,
because the bgwriter process has a pendingOpsTable. So it just queues
its fsync requests locally, and then never acts on them, since it never
runs any checkpoints anymore.

This implies that nobody has done pull-the-plug testing on either HEAD
or 9.2 since the checkpointer split went in (2011-11-01), because even
a modicum of such testing would surely have shown that we're failing to
fsync a significant fraction of our write traffic.

Furthermore, I would say that any performance testing done since then,
if it wasn't looking at purely read-only scenarios, isn't worth the
electrons it's written on. In particular, any performance gain that
anybody might have attributed to the checkpointer splitup is very
probably hogwash.

This is not giving me a warm feeling about our testing practices.

As far as fixing the bug is concerned, the reason for the foulup
is that mdinit() looks to IsBootstrapProcessingMode() to decide
whether to create a pendingOpsTable. That probably was all right
when it was coded, but what it means today is that *any* process
started via AuxiliaryProcessMain will have one; thus not only do
bgwriters have one, but so do walwriter and walreceiver processes;
which might not represent a bug today but it's pretty scary anyway.
I think we need to fix that so it's more directly dependent on the
auxiliary process type. We can't use flags set by the respective
FooMain() functions, such as am_bg_writer, because mdinit is called
from BaseInit() which happens before reaching those functions.
My suggestion is that bootstrap.c ought to make the process's
AuxProcType value available and then mdinit should consult that to
decide what to do. (Having done that, we might consider getting rid
of the "retail" process-type flags am_bg_writer etc.)

regards, tom lane

Attachment	Content-Type	Size
bgwriter-stats-fix-incomplete.patch	text/x-patch	3.1 KB

From:	Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-17 23:48:50
Message-ID:	CAEYLb_WBjrhRPuMgnUGHvXdp+VOaRKxbJ0_vYooemMQKx=QjHA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 17 July 2012 23:56, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> This implies that nobody has done pull-the-plug testing on either HEAD
> or 9.2 since the checkpointer split went in (2011-11-01), because even
> a modicum of such testing would surely have shown that we're failing to
> fsync a significant fraction of our write traffic.
>
> Furthermore, I would say that any performance testing done since then,
> if it wasn't looking at purely read-only scenarios, isn't worth the
> electrons it's written on. In particular, any performance gain that
> anybody might have attributed to the checkpointer splitup is very
> probably hogwash.
>
> This is not giving me a warm feeling about our testing practices.

The checkpointer slit-up was not justified as a performance
optimisation so much as a re-factoring effort that might have some
concomitant performance benefits. While I agree that it is regrettable
that this was allowed to go undetected for so long, I do not find it
especially surprising that some performance testing results post-split
didn't strike somebody as fool's gold. Much of the theory surrounding
checkpoint tuning, if followed, results in relatively little work
being done during the sync phase of a checkpoint, especially if an I/O
scheduler like deadline is used.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-18 00:13:19
Message-ID:	5005FF9F.4040706@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/18/2012 06:56 AM, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> BTW, while we are on the subject: hasn't this split completely broken
>>> the statistics about backend-initiated writes?
>> Yes, it seems to have done just that.
> So I went to fix this in the obvious way (attached), but while testing
> it I found that the number of buffers_backend events reported during
> a regression test run barely changed; which surprised the heck out of
> me, so I dug deeper. The cause turns out to be extremely scary:
> ForwardFsyncRequest isn't getting called at all in the bgwriter process,
> because the bgwriter process has a pendingOpsTable. So it just queues
> its fsync requests locally, and then never acts on them, since it never
> runs any checkpoints anymore.
>
> This implies that nobody has done pull-the-plug testing on either HEAD
> or 9.2 since the checkpointer split went in (2011-11-01)

That makes me wonder if on top of the buildfarm, extending some
buildfarm machines into a "crashfarm" is needed:

- Keep kvm instances with copy-on-write snapshot disks and the build env
on them
- Fire up the VM, do a build, and start the server
- From outside the vm have the test controller connect to the server and
start a test run
- Hard-kill the OS instance at a random point in time.
- Start the OS instance back up
- Start Pg back up and connect to it again
- From the test controller, test the Pg install for possible corruption
by reading the indexes and tables, doing some test UPDATEs, etc.

The main challenge would be coming up with suitable tests to run, ones
that could then be checked to make sure nothing was broken. The test
controller would know how far a test got before the OS got killed and
would know which test it was running, so it'd be able to check for
expected data if provided with appropriate test metadata. Use of enable_
flags should permit scans of indexes and table heaps to be forced.

What else should be checked? The main thing that comes to mind for me is
something I've worried about for a while: that Pg might not always
handle out-of-disk-space anywhere near as gracefully as it's often
claimed to. There's no automated testing for that, so it's hard to
really know. A harnessed VM could be used to test that. Instead of
virtual plug pull tests it could generate a virtual disk of constrained
random size, run its tests until out-of-disk caused failure, stop Pg,
expand the disk, restart Pg, and run its checks.

Variants where WAL was on a separate disk and only WAL or only the main
non-WAL disk run out of space would also make sense and be easy to
produce with such a harness.

I've written some automated kvm test harnesses, so I could have a play
with this idea. I would probably need some help with the test design,
though, and the guest OS would be Linux, Linux, or Linux at least to
start with.

Opinions?

--
Craig Ringer

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-18 00:31:52
Message-ID:	11099.1342571512@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> writes:
> On 07/18/2012 06:56 AM, Tom Lane wrote:
>> This implies that nobody has done pull-the-plug testing on either HEAD
>> or 9.2 since the checkpointer split went in (2011-11-01)

> That makes me wonder if on top of the buildfarm, extending some
> buildfarm machines into a "crashfarm" is needed:

Not sure if we need a whole "farm", but certainly having at least one
machine testing this sort of stuff on a regular basis would make me feel
a lot better.

> The main challenge would be coming up with suitable tests to run, ones
> that could then be checked to make sure nothing was broken.

One fairly simple test scenario could go like this:

* run the regression tests
* pg_dump the regression database
* run the regression tests again
* hard-kill immediately upon completion
* restart database, allow it to perform recovery
* pg_dump the regression database
* diff previous and new dumps; should be the same

The main thing this wouldn't cover is discrepancies in user indexes,
since pg_dump doesn't do anything that's likely to result in indexscans
on user tables. It ought to be enough to detect the sort of system-wide
problem we're talking about here, though.

In general I think the hard part is automated reproduction of an
OS-crash scenario, but your ideas about how to do that sound promising.
Once we have that going, it shouldn't be hard to come up with tests
of the form "do X, hard-crash, recover, check X still looks sane".

> What else should be checked? The main thing that comes to mind for me is
> something I've worried about for a while: that Pg might not always
> handle out-of-disk-space anywhere near as gracefully as it's often
> claimed to.

regards, tom lane

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-18 03:22:22
Message-ID:	50062BEE.80700@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/16/2012 02:39 PM, Robert Haas wrote:
> Unfortunately, there are lots of important operations (like bulk
> loading, SELECT * FROM bigtable, and VACUUM notverybigtable) that
> inevitably end up writing out their own dirty buffers. And even when
> the background writer does write something, it's not always clear that
> this is a positive thing. Here's Greg Smith commenting on the
> more-is-worse phenonmenon:
>
> http://archives.postgresql.org/pgsql-hackers/2012-02/msg00564.php

You can add "crash recovery" to the list of things where the interaction
with the OS write cache matters a lot too, something I just took a
beating and learned from recently. Since the recovery process is
essentially one giant unified backend, how effectively the background
writer and/or checkpointer move writes from recovery to themselves is
really important. It's a bit easier to characterize than a complicated
mixed set of clients, which has given me a couple of ideas to chase down.

What I've been doing for much of the last month (instead of my original
plan of reviewing patches) is moving toward the bottom of characterizing
that under high pressure. It provides an even easier way to compare
multiple write strategies at the OS level than regular pgbench-like
benchmarks. Recovery playback with a different tuning becomes as simple
as rolling back to a simple base backup and replaying all the WAL,
possibly including some number of bulk operations that showed up. You
can measure that speed instead of transaction-level throughput. I'm
seeing the same ~100% difference in performance between various Linux
tunings on recovery as I was getting on VACUUM tests, and it's a whole
lot easier to setup and (ahem) replicate the results. I'm putting
together a playback time benchmark based on this observation.

The fact that I have servers all over the place now with >64GB worth of
RAM has turned the topic of how much dirty memory should be used for
write caching into a hot item for me again in general too. If I live
through 9.3 development, I expect to have a lot more ideas about how to
deal with this whole area play out in the upcoming months. I could
really use a cool day to sit outside thinking about it right now.

> Jeff Janes and I came up with what I believe to be a plausible
> explanation for the problem:
>
> http://archives.postgresql.org/pgsql-hackers/2012-03/msg00356.php
>
> I kinda think we ought to be looking at fixing that for 9.2, and
> perhaps even back-patching further, but nobody else seemed terribly
> excited about it.

FYI, I never rejected any of that thinking, I just haven't chewed on
what you two were proposing. If that's still something you think should
be revisited for 9.2, I'll take a longer look at it. My feeling on this
so far has really been that the write blocking issues are much larger
than the exact logic used by the background writer during the code you
were highlighting, which I always saw as more active/important during
idle periods. This whole area needs to get a complete overhaul during
9.3 though, especially since there are plenty of people who want to fit
checksum writes into that path too.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-18 04:00:08
Message-ID:	500634C8.8030302@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/17/2012 06:56 PM, Tom Lane wrote:
> So I went to fix this in the obvious way (attached), but while testing
> it I found that the number of buffers_backend events reported during
> a regression test run barely changed; which surprised the heck out of
> me, so I dug deeper. The cause turns out to be extremely scary:
> ForwardFsyncRequest isn't getting called at all in the bgwriter process,
> because the bgwriter process has a pendingOpsTable.

When I did my testing early this year to look at checkpointer
performance (among other 9.2 write changes like group commit), I did see
some cases where buffers_backend was dramatically different on 9.2 vs.
9.1 There were plenty of cases where the totals across a 10 minute
pgbench were almost identical though, so this issue didn't stick out
then. That's a very different workload than the regression tests though.

Ugh. Most of my pull the plug testing the last six months has been
focused on SSD tests with older versions. I want to duplicate this (and
any potential fix) now that you've highlighted it.

> Furthermore, I would say that any performance testing done since then,
> if it wasn't looking at purely read-only scenarios, isn't worth the
> electrons it's written on. In particular, any performance gain that
> anybody might have attributed to the checkpointer splitup is very
> probably hogwash.

There hasn't been any performance testing that suggested the
checkpointer splitup was justified. The stuff I did showed it being
flat out negative for a subset of pgbench oriented cases, which didn't
seem real-world enough to disprove it as the right thing to do though.

I thought there were two valid justifications for the checkpointer split
(which is not a feature I have any corporate attachment to--I'm as
isolated from how it was developed as you are). The first is that it
seems like the right architecture to allow reworking checkpoints and
background writes for future write path optimization. A good chunk of
the time when I've tried to improve one of those (like my spread sync
stuff from last year), the code was complicated by the background writer
needing to follow the drum of checkpoint timing, and vice-versa. Being
able to hack on those independently got a sign of relief from me. And
while this adds some code duplication in things like the process setup,
I thought the result would be cleaner for people reading the code to
follow too. This problem is terrible, but I think part of how it crept
in is that the single checkpoint+background writer process was doing way
too many things to even follow all of them some days.

The second justification for the split was that it seems easier to get a
low power result from, which I believe was the angle Peter Geoghegan was
working when this popped up originally. The checkpointer has to run
sometimes, but only at a 50% duty cycle as it's tuned out of the box.
It seems nice to be able to approach that in a way that's power
efficient without coupling it to whatever heartbeat the BGW is running
at. I could even see people changing the frequencies for each
independently depending on expected system load. Tune for lower power
when you don't expect many users, that sort of thing.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-18 04:20:39
Message-ID:	50063997.9030509@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/18/2012 12:00 PM, Greg Smith wrote:

> The second justification for the split was that it seems easier to get
> a low power result from, which I believe was the angle Peter Geoghegan
> was working when this popped up originally. The checkpointer has to
> run sometimes, but only at a 50% duty cycle as it's tuned out of the
> box. It seems nice to be able to approach that in a way that's power
> efficient without coupling it to whatever heartbeat the BGW is running
> at. I could even see people changing the frequencies for each
> independently depending on expected system load. Tune for lower power
> when you don't expect many users, that sort of thing.
>
Yeah - I'm already seeing benefits from that on my laptop, with much
less need to stop Pg when I'm not using it.

--
Craig Ringer

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-18 04:57:53
Message-ID:	50064251.5010908@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/18/2012 08:31 AM, Tom Lane wrote:
> Not sure if we need a whole "farm", but certainly having at least one
> machine testing this sort of stuff on a regular basis would make me feel
> a lot better.

OK. That's something I can actually be useful for.

My current qemu/kvm test harness control code is in Python since that's
what all the other tooling for the project I was using it for is in. Is
it likely to be useful for me to adapt that code for use for a Pg
crash-test harness, or will you need a particular tool/language to be
used? If so, which/what? I'll do pretty much anything except Perl. I'll
have a result for you more quickly working in Python, though I'm happy
enough to write it in C (or Java, but I'm guessing that won't get any
enthusiasm around here).

> One fairly simple test scenario could go like this:
>
> * run the regression tests
> * pg_dump the regression database
> * run the regression tests again
> * hard-kill immediately upon completion
> * restart database, allow it to perform recovery
> * pg_dump the regression database
> * diff previous and new dumps; should be the same
>
> The main thing this wouldn't cover is discrepancies in user indexes,
> since pg_dump doesn't do anything that's likely to result in indexscans
> on user tables. It ought to be enough to detect the sort of system-wide
> problem we're talking about here, though.

It also won't detect issues that only occur during certain points in
execution, under concurrent load, etc. Still, a start, and I could look
at extending it into some kind of "crash fuzzing" once the basics were
working.

> In general I think the hard part is automated reproduction of an
> OS-crash scenario, but your ideas about how to do that sound promising.

It's worked well for other testing I've done. Any writes that're still
in the guest OS's memory, write queues, etc are lost when kvm is killed,
just like a hard crash. Anything the kvm guest has flushed to "disk" is
on the host and preserved - either on the host's disks
(cache=writethrough) or at least in dirty writeback buffers in ram
(cache=writeback).

kvm can even do a decent job of simulating a BBU-equipped write-through
volume by allowing the host OS to do write-back caching of KVM's backing
device/files. You don't get to set a max write-back cache size directly,
but Linux I/O writeback settings provide some control.

My favourite thing about kvm is that it's just another command. It can
be run headless and controlled via virtual serial console and/or its
monitor socket. It doesn't require special privileges and can operate on
ordinary files. It's very well suited for hooking into test harnesses.

The only challenge with using kvm/qemu is that there have been some
breaking changes and a couple of annoying bugs that mean I won't be able
to support anything except pretty much the latest versions initially.
kvm is easy to compile and has limited dependencies, so I don't expect
that to be an issue, but thought it was worth raising.

--
Craig Ringer

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-18 05:56:05
Message-ID:	17623.1342590965@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Greg Smith <greg(at)2ndQuadrant(dot)com> writes:
> On 07/17/2012 06:56 PM, Tom Lane wrote:
>> Furthermore, I would say that any performance testing done since then,
>> if it wasn't looking at purely read-only scenarios, isn't worth the
>> electrons it's written on. In particular, any performance gain that
>> anybody might have attributed to the checkpointer splitup is very
>> probably hogwash.

> There hasn't been any performance testing that suggested the
> checkpointer splitup was justified. The stuff I did showed it being
> flat out negative for a subset of pgbench oriented cases, which didn't
> seem real-world enough to disprove it as the right thing to do though.

Just to clarify, I'm not saying that this means we should revert the
checkpointer split. What I *am* worried about is that we may have been
hacking other things on the basis of faulty performance tests.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-18 06:00:43
Message-ID:	17694.1342591243@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> writes:
> On 07/18/2012 08:31 AM, Tom Lane wrote:
>> Not sure if we need a whole "farm", but certainly having at least one
>> machine testing this sort of stuff on a regular basis would make me feel
>> a lot better.

> OK. That's something I can actually be useful for.

> My current qemu/kvm test harness control code is in Python since that's
> what all the other tooling for the project I was using it for is in. Is
> it likely to be useful for me to adapt that code for use for a Pg
> crash-test harness, or will you need a particular tool/language to be
> used? If so, which/what? I'll do pretty much anything except Perl. I'll
> have a result for you more quickly working in Python, though I'm happy
> enough to write it in C (or Java, but I'm guessing that won't get any
> enthusiasm around here).

If we were talking about code that was going to end up in the PG
distribution, I'd kind of want it to be in C or Perl, just to keep down
the number of languages we're depending on. However, it's not obvious
that a tool like this would ever go into our distribution. I'd suggest
working with what you're comfortable with, and we can worry about
translation when and if there's a reason to.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-18 07:30:40
Message-ID:	50066620.7090501@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 18.07.2012 02:48, Peter Geoghegan wrote:
> On 17 July 2012 23:56, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> This implies that nobody has done pull-the-plug testing on either HEAD
>> or 9.2 since the checkpointer split went in (2011-11-01), because even
>> a modicum of such testing would surely have shown that we're failing to
>> fsync a significant fraction of our write traffic.
>>
>> Furthermore, I would say that any performance testing done since then,
>> if it wasn't looking at purely read-only scenarios, isn't worth the
>> electrons it's written on. In particular, any performance gain that
>> anybody might have attributed to the checkpointer splitup is very
>> probably hogwash.
>>
>> This is not giving me a warm feeling about our testing practices.
>
> The checkpointer slit-up was not justified as a performance
> optimisation so much as a re-factoring effort that might have some
> concomitant performance benefits.

Agreed, but it means that we need to re-run the tests that were done to
make sure the extra fsync-request traffic is not causing a performance
regression,
http://archives.postgresql.org/pgsql-hackers/2011-10/msg01321.php.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-18 12:26:02
Message-ID:	CA+TgmoYTv0J6QMwurM8kR6gR_OfDZ5vkw2wsQE0e+5-Oqf3A5g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Tue, Jul 17, 2012 at 6:56 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> So I went to fix this in the obvious way (attached), but while testing
> it I found that the number of buffers_backend events reported during
> a regression test run barely changed; which surprised the heck out of
> me, so I dug deeper. The cause turns out to be extremely scary:
> ForwardFsyncRequest isn't getting called at all in the bgwriter process,
> because the bgwriter process has a pendingOpsTable. So it just queues
> its fsync requests locally, and then never acts on them, since it never
> runs any checkpoints anymore.

:-(

> This implies that nobody has done pull-the-plug testing on either HEAD
> or 9.2 since the checkpointer split went in (2011-11-01), because even
> a modicum of such testing would surely have shown that we're failing to
> fsync a significant fraction of our write traffic.
>
> Furthermore, I would say that any performance testing done since then,
> if it wasn't looking at purely read-only scenarios, isn't worth the
> electrons it's written on. In particular, any performance gain that
> anybody might have attributed to the checkpointer splitup is very
> probably hogwash.

I don't think anybody thought that was going to result in a direct
performance gain, but I agree the performance testing needs to be
redone. I suspect that the impact on my testing is limited, because I
do mostly pgbench testing, and the lost fsync requests were probably
duplicated by non-lost fsync requests from backend writes. But I
agree that it needs to be redone once this is fixed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-18 21:17:26
Message-ID:	2358.1342646246@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Jul 16, 2012 at 12:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Well, that argument is exactly why the code is designed the way it is...
>> but we are now finding out that sending useless fsync requests isn't as
>> cheap as all that.

> I agree, but I think the problem can be solved for a pretty modest
> amount of effort without needing to make fsync PGC_POSTMASTER. Your
> proposal to refactor the pendingOpsTable representation seems like it
> will help a lot. Perhaps you should do that first and then we can
> reassess.
> ...
> In my view, the elephant in the room here is that it's dramatically
> inefficient for every backend to send an fsync request on every block
> write. For many users, in many workloads, all of those requests will
> be for just a tiny handful of relation segments. The fsync queue
> compaction code works as well as it does for precisely that reason -
> when it triggers, we typically can compact a list of thousands or
> millions of entries down to less than two dozen. In other words, as I
> see it, the issue here is not so much that 100% of the fsync requests
> are useless when fsync=off, but rather that 99.9% of them are useless
> even when fsync=on.

> In any case, I'm still of the opinion that we ought to try making one
> fix (your proposed refactoring of the pendingOpsTable) and then see
> where we're at.

I've been chewing on this issue some more, and no longer like my
previous proposal, which was

>>> ... What I'm thinking about
>>> is reducing the hash key to just RelFileNodeBackend + ForkNumber,
>>> so that there's one hashtable entry per fork, and then storing a
>>> bitmap to indicate which segment numbers need to be sync'd. At
>>> one gigabyte to the bit, I think we could expect the bitmap would
>>> not get terribly large. We'd still have a "cancel" flag in each
>>> hash entry, but it'd apply to the whole relation fork not each
>>> segment.

The reason that's not so attractive is the later observation that what
we really care about optimizing is FORGET_RELATION_FSYNC for all the
forks of a relation at once, which we could produce just one request
for with trivial refactoring of smgrunlink/mdunlink. The above
representation doesn't help for that. So what I'm now thinking is that
we should create a second hash table, with key RelFileNode only,
carrying two booleans: a cancel-previous-fsyncs bool and a
please-unlink-after-checkpoint bool. (The latter field would allow us
to drop the separate pending-unlinks data structure.) Entries would
be made in this table when we got a FORGET_RELATION_FSYNC or
UNLINK_RELATION_REQUEST message -- note that in 99% of cases we'd get
both message types for each relation, since they're both created during
DROP. (Maybe we could even combine these request types.) To use the
table, as we scan the existing per-fork-and-segment hash table, we'd
have to do a lookup in the per-relation table to see if there was a
later cancel message for that relation. Now this does add a few cycles
to the processing of each pendingOpsTable entry in mdsync ... but
considering that the major work in that loop is an fsync call, it is
tough to believe that anybody would notice an extra hashtable lookup.

However, I also came up with an entirely different line of thought,
which unfortunately seems incompatible with either of the improved
table designs above. It is this: instead of having a request queue
that feeds into a hash table hidden within the checkpointer process,
what about storing the pending-fsyncs table as a shared hash table
in shared memory? That is, ForwardFsyncRequest would not simply
try to add the request to a linear array, but would do a HASH_ENTER
call on a shared hash table. This means the de-duplication occurs
for free and we no longer need CompactCheckpointerRequestQueue at all.
Basically, this would amount to saying that the original design was
wrong to try to micro-optimize the time spent in ForwardFsyncRequest,
and that we'd rather pay a little more per ForwardFsyncRequest call
to avoid the enormous response-time spike that will occur when
CompactCheckpointerRequestQueue has to run. (Not to mention that
the checkpointer would eventually have to do HASH_ENTER anyway.)
I think this would address your observation above that the request
queue tends to contain an awful lot of duplicates.

But I only see how to make that work with the existing hash table
structure, because with either of the other table designs, it's
difficult to set a predetermined limit on the amount of shared
memory needed. The segment-number bitmaps could grow uncomfortably
large in the first design, while in the second there's no good way
to know how large the per-relation table has to be to cover a given
size for the per-fork-and-segment table. (The sore spot here is that
once we've accepted a per-fork entry, failing to record a relation-level
cancel for it is not an option, so we can't just return failure.)

So if we go that way it seems like we still have the problem of
having to do hash_seq_search to implement a cancel. We could
possibly arrange for that to be done under shared rather than
exclusive lock of the hash table, but nonetheless it's not
really fixing the originally complained-of O(N^2) problem.

Another issue, which might be fatal to the whole thing, is that
it's not clear that a shared hash table similar in size to the
existing request array is big enough. The entries basically need
to live for about one checkpoint cycle, and with a slow cycle
you could need an arbitrarily large number of them.

A variant that might work a little better is to keep the main
request table still in checkpointer private memory, but to have
*both* a small hash table and a request queue in shared memory.
The idea is that you first try to enter your request in the hash
table; if successful, done (and de-duping has happened automatically).
If no room left in the hash table, add it to the request queue as
normal. The checkpointer periodically empties both the hash table
and the queue. The hash table probably doesn't have to be too huge
to be effective at de-duping requests ... but having said that,
I have no idea exactly how to size it.

So that's a brain dump of some half baked ideas. Thoughts anyone?

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-18 21:45:08
Message-ID:	CA+U5nM+GTCjEcpSvb=L-=bZ+6ngnhCsRmVuXEYih_4q2CcMcBw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 17 July 2012 23:56, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> BTW, while we are on the subject: hasn't this split completely broken
>>> the statistics about backend-initiated writes?
>
>> Yes, it seems to have done just that.
>
> So I went to fix this in the obvious way (attached), but while testing
> it I found that the number of buffers_backend events reported during
> a regression test run barely changed; which surprised the heck out of
> me, so I dug deeper. The cause turns out to be extremely scary:
> ForwardFsyncRequest isn't getting called at all in the bgwriter process,
> because the bgwriter process has a pendingOpsTable. So it just queues
> its fsync requests locally, and then never acts on them, since it never
> runs any checkpoints anymore.
>
> This implies that nobody has done pull-the-plug testing on either HEAD
> or 9.2 since the checkpointer split went in (2011-11-01), because even
> a modicum of such testing would surely have shown that we're failing to
> fsync a significant fraction of our write traffic.

That problem was reported to me on list some time ago, and I made note
to fix that after last CF.

I added a note to 9.2 open items about it myself, but it appears my
fix was too simple and fixed only the reported problem not the
underlying issue. Reading your patch gave me strong deja vu, so not
sure what happened there.

Not very good from me. Feel free to thwack me to fix such things if I
seem not to respond quickly enough.

I'm now looking at the other open items in my area.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-19 12:56:51
Message-ID:	CA+TgmoYg1t3jAOVQLwiCCG6eNcJ8DbVdi_3Vfr=-8-cvxm3zag@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Jul 18, 2012 at 5:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I've been chewing on this issue some more, and no longer like my
> previous proposal, which was
>
>>>> ... What I'm thinking about
>>>> is reducing the hash key to just RelFileNodeBackend + ForkNumber,
>>>> so that there's one hashtable entry per fork, and then storing a
>>>> bitmap to indicate which segment numbers need to be sync'd. At
>>>> one gigabyte to the bit, I think we could expect the bitmap would
>>>> not get terribly large. We'd still have a "cancel" flag in each
>>>> hash entry, but it'd apply to the whole relation fork not each
>>>> segment.
>
> The reason that's not so attractive is the later observation that what
> we really care about optimizing is FORGET_RELATION_FSYNC for all the
> forks of a relation at once, which we could produce just one request
> for with trivial refactoring of smgrunlink/mdunlink. The above
> representation doesn't help for that. So what I'm now thinking is that
> we should create a second hash table, with key RelFileNode only,
> carrying two booleans: a cancel-previous-fsyncs bool and a
> please-unlink-after-checkpoint bool. (The latter field would allow us
> to drop the separate pending-unlinks data structure.) Entries would
> be made in this table when we got a FORGET_RELATION_FSYNC or
> UNLINK_RELATION_REQUEST message -- note that in 99% of cases we'd get
> both message types for each relation, since they're both created during
> DROP. (Maybe we could even combine these request types.) To use the
> table, as we scan the existing per-fork-and-segment hash table, we'd
> have to do a lookup in the per-relation table to see if there was a
> later cancel message for that relation. Now this does add a few cycles
> to the processing of each pendingOpsTable entry in mdsync ... but
> considering that the major work in that loop is an fsync call, it is
> tough to believe that anybody would notice an extra hashtable lookup.

Seems a bit complex, but it might be worth it. Keep in mind that I
eventually want to be able to make an unlogged table logged or a visca
versa, which will probably entail unlinking just the init fork (for
the logged -> unlogged direction).

> However, I also came up with an entirely different line of thought,
> which unfortunately seems incompatible with either of the improved
> table designs above. It is this: instead of having a request queue
> that feeds into a hash table hidden within the checkpointer process,
> what about storing the pending-fsyncs table as a shared hash table
> in shared memory? That is, ForwardFsyncRequest would not simply
> try to add the request to a linear array, but would do a HASH_ENTER
> call on a shared hash table. This means the de-duplication occurs
> for free and we no longer need CompactCheckpointerRequestQueue at all.
> Basically, this would amount to saying that the original design was
> wrong to try to micro-optimize the time spent in ForwardFsyncRequest,
> and that we'd rather pay a little more per ForwardFsyncRequest call
> to avoid the enormous response-time spike that will occur when
> CompactCheckpointerRequestQueue has to run. (Not to mention that
> the checkpointer would eventually have to do HASH_ENTER anyway.)
> I think this would address your observation above that the request
> queue tends to contain an awful lot of duplicates.

I'm not concerned about the queue *containing* a large number of
duplicates; I'm concerned about the large number of duplicate
*requests*. Under either the current system or this proposal, every
time we write a block, we must take and release CheckpointerCommLock.
Now, I have no evidence that there's actually a bottleneck there, but
if there is, this proposal won't fix it. In fact, I suspect on the
whole it would make things worse, because while it's true that
CompactCheckpointerRequestQueue is expensive, it shouldn't normally be
happening at all, because the checkpointer should be draining the
queue regularly enough to prevent it from filling. So except when the
system is in the pathological state where the checkpointer becomes
unresponsive because it's blocked in-kernel on a very long fsync and
there is a large amount of simultaneous write activity, each process
that acquires CheckpointerCommLock holds it for just long enough to
slam a few bytes of data into the queue, which is very cheap. I
suspect that updating a hash table would be significantly more
expensive, and we'd pay whatever that extra overhead is on every fsync
request, not just in the unusual case where we manage to fill the
queue. So I don't think this is likely to be a win.

If you think about the case of an UPDATE statement that hits a large
number of blocks in the same relation, it sends an fsync request for
every single block. Really, it's only necessary to send a new fsync
request if the checkpointer has begun a new checkpoint cycle in the
meantime; otherwise, the old request is still pending and will cover
the new write as well. But there's no way for the backend doing the
writes to know whether that's happened, so it just sends a request
every time. That's not necessarily a problem, because, again, I have
no evidence whatsoever that CheckpointerCommLock is contented, or that
the overhead of sending those requests is significant. But if it is
then we need a solution that does not require acquisition of a
system-wide lwlock on every block write.

> But I only see how to make that work with the existing hash table
> structure, because with either of the other table designs, it's
> difficult to set a predetermined limit on the amount of shared
> memory needed. The segment-number bitmaps could grow uncomfortably
> large in the first design, while in the second there's no good way
> to know how large the per-relation table has to be to cover a given
> size for the per-fork-and-segment table. (The sore spot here is that
> once we've accepted a per-fork entry, failing to record a relation-level
> cancel for it is not an option, so we can't just return failure.)

Moreover, even if it were technically an option, we know from
experience that failure to absorb fsync requests has disastrous
performance consequences.

> So if we go that way it seems like we still have the problem of
> having to do hash_seq_search to implement a cancel. We could
> possibly arrange for that to be done under shared rather than
> exclusive lock of the hash table, but nonetheless it's not
> really fixing the originally complained-of O(N^2) problem.

Yep. In fact it's making it worse, because AIUI the existing
hash_seq_search calls are happening in backend-private memory while
holding no lock. Doing it on a shared-memory hash table while holding
a high-traffic LWLock figures to be much worse.

> Another issue, which might be fatal to the whole thing, is that
> it's not clear that a shared hash table similar in size to the
> existing request array is big enough. The entries basically need
> to live for about one checkpoint cycle, and with a slow cycle
> you could need an arbitrarily large number of them.

Yep.

> A variant that might work a little better is to keep the main
> request table still in checkpointer private memory, but to have
> *both* a small hash table and a request queue in shared memory.
> The idea is that you first try to enter your request in the hash
> table; if successful, done (and de-duping has happened automatically).
> If no room left in the hash table, add it to the request queue as
> normal. The checkpointer periodically empties both the hash table
> and the queue. The hash table probably doesn't have to be too huge
> to be effective at de-duping requests ... but having said that,
> I have no idea exactly how to size it.

I think this is just over-engineered. The originally complained-of
problem was all about the inefficiency of manipulating the
checkpointer's backend-private data structures, right? I don't see
any particular need to mess with the shared memory data structures at
all. If you wanted to add some de-duping logic to retail fsync
requests, you could probably accomplish that more cheaply by having
each such request look at the last half-dozen or so items in the queue
and skip inserting the new request if any of them match the new
request. But I think that'd probably be a net loss, because it would
mean holding the lock for longer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-19 14:09:26
Message-ID:	20408.1342706966@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Seems a bit complex, but it might be worth it. Keep in mind that I
> eventually want to be able to make an unlogged table logged or a visca
> versa, which will probably entail unlinking just the init fork (for
> the logged -> unlogged direction).

Well, as far as that goes, I don't see a reason why you couldn't unlink
the init fork immediately on commit. The checkpointer should not have
to be involved at all --- there's no reason to send it a FORGET FSYNC
request either, because there shouldn't be any outstanding writes
against an init fork, no?

But having said that, this does serve as an example that we might
someday want the flexibility to kill individual forks. I was
intending to kill smgrdounlinkfork altogether, but I'll refrain.

> I think this is just over-engineered. The originally complained-of
> problem was all about the inefficiency of manipulating the
> checkpointer's backend-private data structures, right? I don't see
> any particular need to mess with the shared memory data structures at
> all. If you wanted to add some de-duping logic to retail fsync
> requests, you could probably accomplish that more cheaply by having
> each such request look at the last half-dozen or so items in the queue
> and skip inserting the new request if any of them match the new
> request. But I think that'd probably be a net loss, because it would
> mean holding the lock for longer.

What about checking just the immediately previous entry? This would
at least fix the problem for bulk-load situations, and the cost ought
to be about negligible compared to acquiring the LWLock.

I have also been wondering about de-duping on the backend side, but
the problem is that if a backend remembers its last few requests,
it doesn't know when that cache has to be cleared because of a new
checkpoint cycle starting. We could advertise the current cycle
number in shared memory, but you'd still need to take a lock to
read it. (If we had memory fence primitives it could be a bit
cheaper, but I dunno how much.)

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-19 16:17:12
Message-ID:	CA+TgmoagwPYG8QO3ykccp4_dpYy_Y_KKzF4QQBt_TPaFKXV9pg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Seems a bit complex, but it might be worth it. Keep in mind that I
>> eventually want to be able to make an unlogged table logged or a visca
>> versa, which will probably entail unlinking just the init fork (for
>> the logged -> unlogged direction).
>
> Well, as far as that goes, I don't see a reason why you couldn't unlink
> the init fork immediately on commit. The checkpointer should not have
> to be involved at all --- there's no reason to send it a FORGET FSYNC
> request either, because there shouldn't be any outstanding writes
> against an init fork, no?

Well, it gets written when it gets created. Some of those writes go
through shared_buffers.

> But having said that, this does serve as an example that we might
> someday want the flexibility to kill individual forks. I was
> intending to kill smgrdounlinkfork altogether, but I'll refrain.

If you want to remove it, it's OK with me. We can always put it back
later if it's needed. We have an SCM that allows us to revert
patches. :-)

> What about checking just the immediately previous entry? This would
> at least fix the problem for bulk-load situations, and the cost ought
> to be about negligible compared to acquiring the LWLock.

Well, two things:

1. If a single bulk load is the ONLY activity on the system, or more
generally if only one segment in the system is being heavily written,
then that would reduce the number of entries that get added to the
queue, but if you're doing two bulk loads on different tables at the
same time, then it might not do much. From Greg Smith's previous
comments on this topic, I understand that having two or three entries
alternating in the queue is a fairly common pattern.

2. You say "fix the problem" but I'm not exactly clear what problem
you think this fixes. It's true that the compaction code is a lot
slower than an ordinary queue insertion, but I think it generally
doesn't happen enough to matter, and when it does happen the system is
generally I/O bound anyway, so who cares? One possible argument in
favor of doing something along these lines is that it would reduce the
amount of data that the checkpointer would have to copy while holding
the lock, thus causing less disruption for other processes trying to
insert into the request queue. But I don't know whether that effect
is significant enough to matter.

> I have also been wondering about de-duping on the backend side, but
> the problem is that if a backend remembers its last few requests,
> it doesn't know when that cache has to be cleared because of a new
> checkpoint cycle starting. We could advertise the current cycle
> number in shared memory, but you'd still need to take a lock to
> read it. (If we had memory fence primitives it could be a bit
> cheaper, but I dunno how much.)

Well, we do have those, as of 9.2. There not being used for anything
yet, but I've been looking for an opportunity to put them into use.
sinvaladt.c's msgnumLock is an obvious candidate, but the 9.2 changes
to reduce the impact of sinval synchronization work sufficiently well
that I haven't been motivated to tinker with it any further. Maybe it
would be worth doing just to exercise that code, though.

Or, maybe we can use them here. But after some thought I can't see
exactly how we'd do it. Memory barriers prevent a value from being
prefetched too early or written back to main memory too late, relative
to other memory operations by the same process, but the definition of
"too early" and "too late" is not quite clear to me here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-19 18:57:26
Message-ID:	10862.1342724246@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> What about checking just the immediately previous entry? This would
>> at least fix the problem for bulk-load situations, and the cost ought
>> to be about negligible compared to acquiring the LWLock.

> 2. You say "fix the problem" but I'm not exactly clear what problem
> you think this fixes.

What I'm concerned about is that there is going to be a great deal more
fsync request queue traffic in 9.2 than there ever was before, as a
consequence of the bgwriter/checkpointer split. The design expectation
for this mechanism was that most fsync requests would be generated
locally inside the bgwriter and thus go straight into the hash table
without having to go through the shared-memory queue. I admit that
we have seen no benchmarks showing that there's a problem, but that's
because up till yesterday the bgwriter was failing to transmit such
messages at all. So I'm looking for ways to cut the overhead.

But having said that, maybe we should not panic until we actually see
some benchmarks showing the problem.

Meanwhile, we do know there's a problem with FORGET_RELATION_FSYNC.
I have been looking at the two-hash-tables design I suggested before,
and realized that there's a timing issue: if we just stuff "forget"
requests into a separate table, there is no method for determining
whether a given fsync request arrived before or after a given forget
request. This is problematic if the relfilenode gets recycled: we
need to be able to guarantee that a previously-posted forget request
won't cancel a valid fsync for the new relation. I believe this is
soluble though, if we merge the "forget" requests with unlink requests,
because a relfilenode can't be recycled until we do the unlink.
So as far as the code goes:

1. Convert the PendingUnlinkEntry linked list to a hash table keyed by
RelFileNode. It acts the same as before, and shouldn't be materially
slower to process, but now we can determine in O(1) time whether there
is a pending unlink for a relfilenode.

2. Treat the existence of a pending unlink request as a relation-wide
fsync cancel; so the loop in mdsync needs one extra hashtable lookup
to determine validity of a PendingOperationEntry. As before, this
should not matter much considering that we're about to do an fsync().

3. Tweak mdunlink so that it does not send a FORGET_RELATION_FSYNC
message if it is sending an UNLINK_RELATION_REQUEST. (A side benefit
is that this gives us another 2X reduction in fsync queue traffic,
and not just any queue traffic but the type of traffic that we must
not fail to queue.)

The FORGET_RELATION_FSYNC code path will still exist, and will still
require a full hashtable scan, but we don't care because it isn't
being used in common situations. It would only be needed for stuff
like killing an init fork.

The argument that this is safe involves these points:

* mdunlink cannot send UNLINK_RELATION_REQUEST until it's done
ftruncate on the main fork's first segment, because otherwise that
segment could theoretically get unlinked from under it before it can do
the truncate. But this is okay since the ftruncate won't cause any
fsync the checkpointer might concurrently be doing to fail. The
request *will* be sent before we unlink any other files, so mdsync
will be able to recover if it gets an fsync failure due to concurrent
unlink.

* Because a relfilenode cannot be recycled until we process and delete
the PendingUnlinkEntry during mdpostckpt, it is not possible for valid
new fsync requests to arrive while the PendingUnlinkEntry still exists
to cause them to be considered canceled.

* Because we only process and delete PendingUnlinkEntrys that have been
there since before the checkpoint started, we can be sure that any
PendingOperationEntrys referring to the relfilenode will have been
scanned and deleted by mdsync before we remove the PendingUnlinkEntry.

Unless somebody sees a hole in this logic, I'll go make this happen.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-19 20:03:20
Message-ID:	CA+TgmobkLaBJa5+HG9VsV_MNJGiVEBymio-msdKETZiPk1q36Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Jul 19, 2012 at 2:57 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> What about checking just the immediately previous entry? This would
>>> at least fix the problem for bulk-load situations, and the cost ought
>>> to be about negligible compared to acquiring the LWLock.
>
>> 2. You say "fix the problem" but I'm not exactly clear what problem
>> you think this fixes.
>
> What I'm concerned about is that there is going to be a great deal more
> fsync request queue traffic in 9.2 than there ever was before, as a
> consequence of the bgwriter/checkpointer split. The design expectation
> for this mechanism was that most fsync requests would be generated
> locally inside the bgwriter and thus go straight into the hash table
> without having to go through the shared-memory queue. I admit that
> we have seen no benchmarks showing that there's a problem, but that's
> because up till yesterday the bgwriter was failing to transmit such
> messages at all. So I'm looking for ways to cut the overhead.
>
> But having said that, maybe we should not panic until we actually see
> some benchmarks showing the problem.

+1 for not panicking. I'm prepared to believe that there could be a
problem here, but I'm not prepared to believe that we've characterized
it well enough to be certain that any changes we choose to make will
make things better not worse.

> Meanwhile, we do know there's a problem with FORGET_RELATION_FSYNC.
> I have been looking at the two-hash-tables design I suggested before,
> and realized that there's a timing issue: if we just stuff "forget"
> requests into a separate table, there is no method for determining
> whether a given fsync request arrived before or after a given forget
> request. This is problematic if the relfilenode gets recycled: we
> need to be able to guarantee that a previously-posted forget request
> won't cancel a valid fsync for the new relation. I believe this is
> soluble though, if we merge the "forget" requests with unlink requests,
> because a relfilenode can't be recycled until we do the unlink.
> So as far as the code goes:
>
> 1. Convert the PendingUnlinkEntry linked list to a hash table keyed by
> RelFileNode. It acts the same as before, and shouldn't be materially
> slower to process, but now we can determine in O(1) time whether there
> is a pending unlink for a relfilenode.
>
> 2. Treat the existence of a pending unlink request as a relation-wide
> fsync cancel; so the loop in mdsync needs one extra hashtable lookup
> to determine validity of a PendingOperationEntry. As before, this
> should not matter much considering that we're about to do an fsync().
>
> 3. Tweak mdunlink so that it does not send a FORGET_RELATION_FSYNC
> message if it is sending an UNLINK_RELATION_REQUEST. (A side benefit
> is that this gives us another 2X reduction in fsync queue traffic,
> and not just any queue traffic but the type of traffic that we must
> not fail to queue.)
>
> The FORGET_RELATION_FSYNC code path will still exist, and will still
> require a full hashtable scan, but we don't care because it isn't
> being used in common situations. It would only be needed for stuff
> like killing an init fork.
>
> The argument that this is safe involves these points:
>
> * mdunlink cannot send UNLINK_RELATION_REQUEST until it's done
> ftruncate on the main fork's first segment, because otherwise that
> segment could theoretically get unlinked from under it before it can do
> the truncate. But this is okay since the ftruncate won't cause any
> fsync the checkpointer might concurrently be doing to fail. The
> request *will* be sent before we unlink any other files, so mdsync
> will be able to recover if it gets an fsync failure due to concurrent
> unlink.
>
> * Because a relfilenode cannot be recycled until we process and delete
> the PendingUnlinkEntry during mdpostckpt, it is not possible for valid
> new fsync requests to arrive while the PendingUnlinkEntry still exists
> to cause them to be considered canceled.
>
> * Because we only process and delete PendingUnlinkEntrys that have been
> there since before the checkpoint started, we can be sure that any
> PendingOperationEntrys referring to the relfilenode will have been
> scanned and deleted by mdsync before we remove the PendingUnlinkEntry.
>
> Unless somebody sees a hole in this logic, I'll go make this happen.

What if we change the hash table to have RelFileNode as the key and an
array of MAX_FORKNUM bitmapsets as the value? Then when you get a
"forget" request, you can just zap all the sets to empty. That seems
a whole lot simpler than your proposal and I don't see any real
downside. I can't actually poke a whole in your logic at the moment
but a simpler system that requires no assumptions about filesystem
behavior seems preferable to me.

You can still make an unlink request imply a corresponding
forget-request if you want, but now that's a separate optimization.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-19 21:02:08
Message-ID:	13265.1342731728@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> What if we change the hash table to have RelFileNode as the key and an
> array of MAX_FORKNUM bitmapsets as the value? Then when you get a
> "forget" request, you can just zap all the sets to empty.

Hm ... the only argument I can really make against that is that there'll
be no way to move such a table into shared memory; but there's probably
little hope of that anyway, given points made upthread. The bitmapset
manipulations are a bit tricky but solvable, and I agree there's
something to be said for not tying this stuff so closely to the
mechanism for relfilenode recycling.

regards, tom lane

From:	David Fetter <david(at)fetter(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-23 04:37:33
Message-ID:	20120723043733.GA13020@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Tue, Jul 17, 2012 at 06:56:50PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> BTW, while we are on the subject: hasn't this split completely
> >> broken the statistics about backend-initiated writes?
>
> > Yes, it seems to have done just that.
>
> This implies that nobody has done pull-the-plug testing on either
> HEAD or 9.2 since the checkpointer split went in (2011-11-01),
> because even a modicum of such testing would surely have shown that
> we're failing to fsync a significant fraction of our write traffic.
>
> Furthermore, I would say that any performance testing done since
> then, if it wasn't looking at purely read-only scenarios, isn't
> worth the electrons it's written on. In particular, any performance
> gain that anybody might have attributed to the checkpointer splitup
> is very probably hogwash.
>
> This is not giving me a warm feeling about our testing practices.

Is there any part of this that the buildfarm, or some other automation
framework, might be able to handle?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	David Fetter <david(at)fetter(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-23 12:29:16
Message-ID:	500D439C.50608@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/23/2012 12:37 AM, David Fetter wrote:
> On Tue, Jul 17, 2012 at 06:56:50PM -0400, Tom Lane wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> BTW, while we are on the subject: hasn't this split completely
>>>> broken the statistics about backend-initiated writes?
>>> Yes, it seems to have done just that.
>> This implies that nobody has done pull-the-plug testing on either
>> HEAD or 9.2 since the checkpointer split went in (2011-11-01),
>> because even a modicum of such testing would surely have shown that
>> we're failing to fsync a significant fraction of our write traffic.
>>
>> Furthermore, I would say that any performance testing done since
>> then, if it wasn't looking at purely read-only scenarios, isn't
>> worth the electrons it's written on. In particular, any performance
>> gain that anybody might have attributed to the checkpointer splitup
>> is very probably hogwash.
>>
>> This is not giving me a warm feeling about our testing practices.
> Is there any part of this that the buildfarm, or some other automation
> framework, might be able to handle?
>

I'm not sure how you automate testing a pull-the-plug scenario.

The buildfarm is not at all designed to test performance. That's why we
want a performance farm.

cheers

andrew

From:	David Fetter <david(at)fetter(dot)org>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-23 12:41:23
Message-ID:	20120723124123.GA20499@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Jul 23, 2012 at 08:29:16AM -0400, Andrew Dunstan wrote:
>
> On 07/23/2012 12:37 AM, David Fetter wrote:
> >On Tue, Jul 17, 2012 at 06:56:50PM -0400, Tom Lane wrote:
> >>Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >>>On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>>>BTW, while we are on the subject: hasn't this split completely
> >>>>broken the statistics about backend-initiated writes?
> >>>Yes, it seems to have done just that.
> >>This implies that nobody has done pull-the-plug testing on either
> >>HEAD or 9.2 since the checkpointer split went in (2011-11-01),
> >>because even a modicum of such testing would surely have shown that
> >>we're failing to fsync a significant fraction of our write traffic.
> >>
> >>Furthermore, I would say that any performance testing done since
> >>then, if it wasn't looking at purely read-only scenarios, isn't
> >>worth the electrons it's written on. In particular, any performance
> >>gain that anybody might have attributed to the checkpointer splitup
> >>is very probably hogwash.
> >>
> >>This is not giving me a warm feeling about our testing practices.
> >Is there any part of this that the buildfarm, or some other automation
> >framework, might be able to handle?
> >
>
> I'm not sure how you automate testing a pull-the-plug scenario.

I have a dim memory of how the FreeBSD project was alleged to have
done it, namely by rigging a serial port (yes, it was that long ago)
to the power supply of another machine and randomly cycling the power.

> The buildfarm is not at all designed to test performance. That's why
> we want a performance farm.

Right. Apart from hardware, what are we stalled on?

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	David Fetter <david(at)fetter(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-23 12:56:38
Message-ID:	500D4A06.2050404@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/23/2012 08:41 AM, David Fetter wrote:
>> The buildfarm is not at all designed to test performance. That's why
>> we want a performance farm.
> Right. Apart from hardware, what are we stalled on?
>

Software :-)

I am trying to find some cycles to get something going.

cheers

andrew

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-23 13:04:46
Message-ID:	500D4BEE.7010205@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/23/2012 08:29 PM, Andrew Dunstan wrote:

> I'm not sure how you automate testing a pull-the-plug scenario.

fire up kvm or qemu instances, then kill 'em.

--
Craig Ringer

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
Cc:	David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-23 13:47:16
Message-ID:	500D55E4.40002@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/23/2012 09:04 AM, Craig Ringer wrote:
> On 07/23/2012 08:29 PM, Andrew Dunstan wrote:
>
>> I'm not sure how you automate testing a pull-the-plug scenario.
>
> fire up kvm or qemu instances, then kill 'em.
>
>

Yeah, maybe. Knowing just when to kill them might be an interesting
question.

I'm also unsure how much nice cleanup the host supervisor does in such
cases. VMs are wonderful things, but they aren't always the answer. I'm
not saying they aren't here, just wondering.

cheers

andrew

From:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-23 13:58:47
Message-ID:	500D5897.705@ringerc.id.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 07/23/2012 09:47 PM, Andrew Dunstan wrote:
>
> On 07/23/2012 09:04 AM, Craig Ringer wrote:
>> On 07/23/2012 08:29 PM, Andrew Dunstan wrote:
>>
>>> I'm not sure how you automate testing a pull-the-plug scenario.
>>
>> fire up kvm or qemu instances, then kill 'em.
>>
>>
>
> Yeah, maybe. Knowing just when to kill them might be an interesting
> question.
>
> I'm also unsure how much nice cleanup the host supervisor does in such
> cases. VMs are wonderful things, but they aren't always the answer.
> I'm not saying they aren't here, just wondering.
I've done some testing with this, and what it boils down to is that any
data that made it to the virtual disk is persistent after a VM kill.
Anything in dirty buffers on the VM guest is lost. It's a very close
match for real hardware. I haven't tried to examine the details of the
handling of virtualised disk hardware write caches, but disks should be
in write-through mode anyway. A `kill -9` will clear 'em for sure,
anyway, as the guest has no chance to do any cleanup.

One of the great things about kvm and qemu for this sort of testing is
that it's just another program. There's very little magic, and it's
quite easy to test and trace.

I have a qemu/kvm test harness I've been using for another project that
I need to update and clean up as it'd be handy for this. It's just a
matter of making the time, as it's been a busy few days.

--
Craig Ringer

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	David Fetter <david(at)fetter(dot)org>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-23 15:02:51
Message-ID:	CAMkU=1weLk1dKC7XUv8kPWKw0FQtE25ztPZLZEXyFUXKhqP1AA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Jul 23, 2012 at 5:41 AM, David Fetter <david(at)fetter(dot)org> wrote:
> On Mon, Jul 23, 2012 at 08:29:16AM -0400, Andrew Dunstan wrote:
>>
>>
>> I'm not sure how you automate testing a pull-the-plug scenario.
>
> I have a dim memory of how the FreeBSD project was alleged to have
> done it, namely by rigging a serial port (yes, it was that long ago)
> to the power supply of another machine and randomly cycling the power.

On Linux,

echo b > /proc/sysrq-trigger

Is supposed to take it down instantly, with no flushing of dirty buffers.

Cheers,

Jeff

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date:	2012-07-24 13:33:26
Message-ID:	CAM-w4HMeTjbVKooUq_sEqjnEp+7fSZQj823XaE5geQ7-in4pdA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Jul 18, 2012 at 1:13 AM, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> wrote:

> That makes me wonder if on top of the buildfarm, extending some buildfarm
> machines into a "crashfarm" is needed:
>
> - Keep kvm instances with copy-on-write snapshot disks and the build env
> on them
> - Fire up the VM, do a build, and start the server
> - From outside the vm have the test controller connect to the server and
> start a test run
> - Hard-kill the OS instance at a random point in time.
>

For what it's worth you don't need to do a hard kill of the vm and start
over repeatedly to kill at different times. You could take a snapshot of
the disk storage and keep running. You could take many snapshots from a
single run. Each snapshot would represent the storage that would exist if
the machine had crashed at the point in time that the snapshot was taken.

You do want the snapshots to be taken using something outside the virtual
machine. Either the kvm storage layer or using lvm on the host. But not
using lvm on the guest virtual machine.

And yes, the hard part that always stopped me from looking at this was
having any way to test the correctness of the data.

--
greg

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: DELETE vs TRUNCATE explanation
Date:	2012-08-09 18:06:54
Message-ID:	CAMkU=1x36-xLdUORNZCXZh4yt+qC0z+Qqcyj7Y-gfPgFt2w2FA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Jul 12, 2012 at 4:21 PM, Harold A. Giménez
<harold(dot)gimenez(at)gmail(dot)com> wrote:
> Hi,
>
> I work with Daniel Farina and was the other engineer who "discovered" this,
> once again. That is, I got bit by it and have been running TRUNCATE on my
> test suites for years.

Hi Daniel and Harold,

I don't know if you followed this thread over into the -hacker mailing list.

There was some bookkeeping code that was N^2 in the number of
truncations performed during any given checkpoint cycle. That has
been fixed in 9.2Beta3.

I suspect that this was the root cause of the problem you encountered.

If you are in a position to retest using 9.2Beta3
(http://www.postgresql.org/about/news/1405/), I'd be interested to
know if it does make truncations comparable in speed to unqualified
deletes.

Thanks,

Jeff