Quick Links

Do we need so many hint bits?

Lists:	pgsql-hackers

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Do we need so many hint bits?
Date:	2012-11-16 00:42:57
Message-ID:	1353026577.14335.91.camel@sussancws0025
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

It occurred to me recently that many of the hint bits aren't terribly
important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly
has a purpose, and we'd expect it to be used many times following the
initial CLOG lookup.

But the other tuple hint bits seem to be there just for symmetry,
because they shouldn't last long. If HEAP_XMIN_INVALID or
HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed
soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it
should just be changed to InvalidTransactionId.

Removing those 3 hints would give us 3 more flag bits (eventually, after
we are sure they aren't just leftover), and it would also reduce the
chance that a page is dirtied for no other reason than to set them. It
might even take a few cycles out of the tqual.c routines, or at least
reduce the code size. Not a huge win, but I don't see much downside
either.

Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced
in the visibility map patch, apparently as a way to know when to clear
the VM bit when doing an update. It was then also used for scans, which
showed a significant speedup. But I wonder: why not just use the
visibilitymap directly from those places? It can be used for the scan
because it is crash safe now (not possible before). And since it's only
one lookup per scanned page, then I don't think it would be a measurable
performance loss there. Inserts/updates/deletes also do a significant
amount of work, so again, I doubt it's a big drop in performance there
-- maybe under a lot of concurrency or something.

The benefit of removing PD_ALL_VISIBLE would be significantly higher.
It's quite common to load a lot of data, and then do some reads for a
while (setting hint bits and flushing them to disk), and then do a
VACUUM a while later, setting PD_ALL_VISIBLE and writing all of the
pages again. Also, if I remember correctly, Robert went to significant
effort when making the VM crash-safe to keep the PD_ALL_VISIBLE and VM
bits consistent. Maybe this was all discussed before?

All of these hint bits will have a bit more of a performance impact
after checksums are introduced (for those that use them in conjunction
with large data loads), so I'm looking for some simple ways to mitigate
those effects. What kind of worst-case tests could I construct to see if
there are worrying performance effects to removing these hint bits?

Regards,
Jeff Davis

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 01:26:42
Message-ID:	CA+U5nMK92PLjP-6fRArv0VDw7zikrZ7kMoQvdjiCEkSVhxhPAw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15 November 2012 19:42, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> many of the hint bits aren't terribly important

The truth is that nobody knows because there is no way of knowing.

A good way forwards would be to collect detailed stats by table. We
can then run various loads and see if any of this is worthwhile.

I'm sure we'd learn enough to understand how to optimise things.

track_hints?
track_buffers?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 03:08:00
Message-ID:	2488.1353035280@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On 15 November 2012 19:42, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>> many of the hint bits aren't terribly important

> The truth is that nobody knows because there is no way of knowing.

We had a discussion awhile back in which the idea of *no* hint bits
was advocated, and someone (I think Robert) did some preliminary
tests that pretty much shot it down. However, I don't recall
anyone suggesting before that the four existing bits might not all
be equally worthwhile. It's worth looking into. The hard part is
probably agreeing on the test case or cases to measure behavior for.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 03:21:10
Message-ID:	2718.1353036070@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> It occurred to me recently that many of the hint bits aren't terribly
> important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly
> has a purpose, and we'd expect it to be used many times following the
> initial CLOG lookup.

Right.

> But the other tuple hint bits seem to be there just for symmetry,
> because they shouldn't last long. If HEAP_XMIN_INVALID or
> HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed
> soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it
> should just be changed to InvalidTransactionId.

Hm. It is not cheaper to change xmax to 0 than it is to set the hint
bit --- you still need a write, and there are also added locking and
atomicity worries --- so I'm not convinced by your argument there.
But you might be right that the expected number of wins from the other
two bits is a lot less.

> Removing those 3 hints would give us 3 more flag bits (eventually, after
> we are sure they aren't just leftover), and it would also reduce the
> chance that a page is dirtied for no other reason than to set them.

We aren't pressed for flag bits particularly. I think the main
attraction of this idea is precisely to reduce unnecessary page dirties,
and so that leads me to suggest a variant: keep the four bits defined as
now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the
page is already dirty. This would make it a straight-up trade of more
clog consultation for fewer page dirties.

Added contention for access to the visibility map pages would be a
problem.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 04:12:49
Message-ID:	CA+U5nMLjREqC3u-oEt1FF7FEt0_7Ye5yMmyi=9H3qA2HVFNofA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15 November 2012 22:08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On 15 November 2012 19:42, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>>> many of the hint bits aren't terribly important
>
>> The truth is that nobody knows because there is no way of knowing.
>
> We had a discussion awhile back in which the idea of *no* hint bits
> was advocated, and someone (I think Robert) did some preliminary
> tests that pretty much shot it down. However, I don't recall
> anyone suggesting before that the four existing bits might not all
> be equally worthwhile. It's worth looking into.

Itsn't that what I said? In case of doubt, Yes, its a great idea and
worth looking into.

The question is *how* we look into it.

> The hard part is
> probably agreeing on the test case or cases to measure behavior for.

I think thats impossible. There are just too many possible cases.
Measuring top-level performance without measuring low level stats just
means we can't tell the difference between a test that didn't exercise
the code and a test where there was no difference.

We need detailed stats that allow many people to make their own tests
and to report on what they find.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 04:14:49
Message-ID:	CABOikdPXKgnkMdcmq8F7K6R9ZbLeQdZZqy4DjyDrxLwqGPJ0gg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 16, 2012 at 8:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
>
> > Removing those 3 hints would give us 3 more flag bits (eventually, after
> > we are sure they aren't just leftover), and it would also reduce the
> > chance that a page is dirtied for no other reason than to set them.
>
> We aren't pressed for flag bits particularly. I think the main
> attraction of this idea is precisely to reduce unnecessary page dirties,
> and so that leads me to suggest a variant: keep the four bits defined as
> now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the
> page is already dirty.

Another approach could be to set those additional bits, but don't dirty the
page. So if the page is already dirty or gets dirty later on before its
eviction, those hint bits will get recorded. We can also try some other
variants like: set the bits and dirty the page if the XID is beyond the
horizon that CLOG buffers can track. A very old XID will most likely cause
a CLOG page read later on.

I forgot if we have a way to differentiate between critical and
non-critical dirtiness of a page. If we don't, that might be another
worthwhile thing to try and test. Setting hint bits is a non-critical
operation. So such a dirty page may be discarded if the system is under
pressure.

Well, all of the above may have been discussed and rejected in the past.
What we need is a thorough benchmarking to know what is better in which
situation.

Thanks,
Pavan

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 04:37:49
Message-ID:	CA+U5nM+ptW9AxDYob0p5E1PoO4CtYWsCXcNzLaw5KxL5ZOmoag@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15 November 2012 22:21, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>> Removing those 3 hints would give us 3 more flag bits (eventually, after
>> we are sure they aren't just leftover), and it would also reduce the
>> chance that a page is dirtied for no other reason than to set them.
>
> We aren't pressed for flag bits particularly. I think the main
> attraction of this idea is precisely to reduce unnecessary page dirties,
> and so that leads me to suggest a variant: keep the four bits defined as
> now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the
> page is already dirty. This would make it a straight-up trade of more
> clog consultation for fewer page dirties.

Hmm, I thought Alvaro wanted an extra flag bit for foreign key locks
but couldn't find it.

Come to think of it, why do we have XMIN_INVALID and XMAX_INVALID? We
never need both at the same time, so we can just use one...
XMIN_INVALID -> XACT_INVALID
XMAX_INVALID == XMIN_COMMITTED | XACT_INVALID

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 06:44:23
Message-ID:	1353048263.10198.11.camel@jdavis-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2012-11-15 at 22:21 -0500, Tom Lane wrote:
> We aren't pressed for flag bits particularly. I think the main
> attraction of this idea is precisely to reduce unnecessary page dirties,
> and so that leads me to suggest a variant: keep the four bits defined as
> now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the
> page is already dirty. This would make it a straight-up trade of more
> clog consultation for fewer page dirties.

Sounds reasonable.

> > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced
> > in the visibility map patch, apparently as a way to know when to clear
> > the VM bit when doing an update. It was then also used for scans, which
> > showed a significant speedup. But I wonder: why not just use the
> > visibilitymap directly from those places?
>
> Added contention for access to the visibility map pages would be a
> problem.

Can you think of a reasonable (or worst-case) test scenario that would
illustrate this problem? Would I need a many-core box to actually test
this?

Also, what locks do you expect to be contended? The buffer header lock
while pinning the VM page?

Regards,
Jeff Davis

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 09:40:07
Message-ID:	CA+U5nMJ5_7WqYjopVLQRwNto1QyESvVjv=PJ7P80YWuUFwpbWg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15 November 2012 23:12, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> We need detailed stats that allow many people to make their own tests
> and to report on what they find.

To work out how to proceed we need

* wait event times on clog accesses at table-level
* numbers of blocks dirtied by hint bit settings at table-level

That way we can see which tables need to have hint bits set and which don't.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 14:43:12
Message-ID:	CAHyXU0z7Wso4pFWS7L_6wnTn6YQ+zK867cZKiD3CU4E_Y-aHgw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 15, 2012 at 9:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
>> It occurred to me recently that many of the hint bits aren't terribly
>> important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly
>> has a purpose, and we'd expect it to be used many times following the
>> initial CLOG lookup.
>
> Right.
>
>> But the other tuple hint bits seem to be there just for symmetry,
>> because they shouldn't last long. If HEAP_XMIN_INVALID or
>> HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed
>> soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it
>> should just be changed to InvalidTransactionId.
>
> Hm. It is not cheaper to change xmax to 0 than it is to set the hint
> bit --- you still need a write, and there are also added locking and
> atomicity worries --- so I'm not convinced by your argument there.
> But you might be right that the expected number of wins from the other
> two bits is a lot less.

Is that true in a post checksum world though? Given that we are
logging changes can we relax atomicity expectations? IIRC xmin/xmax
are aligned, how come you can't just set InvalidTransactionId for
INVALID and 'FrozenTransactionId' for COMMITTED? Why can't you do
this now?

merlin

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 14:58:27
Message-ID:	20121116145827.GD6505@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2012-11-16 08:43:12 -0600, Merlin Moncure wrote:
> On Thu, Nov 15, 2012 at 9:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> >> It occurred to me recently that many of the hint bits aren't terribly
> >> important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly
> >> has a purpose, and we'd expect it to be used many times following the
> >> initial CLOG lookup.
> >
> > Right.
> >
> >> But the other tuple hint bits seem to be there just for symmetry,
> >> because they shouldn't last long. If HEAP_XMIN_INVALID or
> >> HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed
> >> soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it
> >> should just be changed to InvalidTransactionId.
> >
> > Hm. It is not cheaper to change xmax to 0 than it is to set the hint
> > bit --- you still need a write, and there are also added locking and
> > atomicity worries --- so I'm not convinced by your argument there.
> > But you might be right that the expected number of wins from the other
> > two bits is a lot less.
>
> Is that true in a post checksum world though? Given that we are
> logging changes can we relax atomicity expectations? IIRC xmin/xmax
> are aligned, how come you can't just set InvalidTransactionId for
> INVALID and 'FrozenTransactionId' for COMMITTED? Why can't you do
> this now?

Uhm. The latter doesn't really work if you have any transactions that
might not see that row or am I missing something?

Greetings,

Andres Freund

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 15:00:15
Message-ID:	CAHyXU0yXRTGBssqXXL-PPTyik+ZgtmhPYe8jrVKmYQq0nzVVag@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 16, 2012 at 8:58 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2012-11-16 08:43:12 -0600, Merlin Moncure wrote:
>> On Thu, Nov 15, 2012 at 9:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> > Jeff Davis <pgsql(at)j-davis(dot)com> writes:
>> >> It occurred to me recently that many of the hint bits aren't terribly
>> >> important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly
>> >> has a purpose, and we'd expect it to be used many times following the
>> >> initial CLOG lookup.
>> >
>> > Right.
>> >
>> >> But the other tuple hint bits seem to be there just for symmetry,
>> >> because they shouldn't last long. If HEAP_XMIN_INVALID or
>> >> HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed
>> >> soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it
>> >> should just be changed to InvalidTransactionId.
>> >
>> > Hm. It is not cheaper to change xmax to 0 than it is to set the hint
>> > bit --- you still need a write, and there are also added locking and
>> > atomicity worries --- so I'm not convinced by your argument there.
>> > But you might be right that the expected number of wins from the other
>> > two bits is a lot less.
>>
>> Is that true in a post checksum world though? Given that we are
>> logging changes can we relax atomicity expectations? IIRC xmin/xmax
>> are aligned, how come you can't just set InvalidTransactionId for
>> INVALID and 'FrozenTransactionId' for COMMITTED? Why can't you do
>> this now?
>
> Uhm. The latter doesn't really work if you have any transactions that
> might not see that row or am I missing something?

nope. you're right.

merlin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 15:08:51
Message-ID:	CAHyXU0zPx5uitO-W=WE1Z6ayRYn7cvEDogFQAPMmGxcLi+TTnQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 15, 2012 at 10:14 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> Another approach could be to set those additional bits, but don't dirty the
> page. So if the page is already dirty or gets dirty later on before its
> eviction, those hint bits will get recorded. We can also try some other
> variants like: set the bits and dirty the page if the XID is beyond the
> horizon that CLOG buffers can track. A very old XID will most likely cause a
> CLOG page read later on.
>
> I forgot if we have a way to differentiate between critical and non-critical
> dirtiness of a page. If we don't, that might be another worthwhile thing to
> try and test. Setting hint bits is a non-critical operation. So such a dirty
> page may be discarded if the system is under pressure.
>
> Well, all of the above may have been discussed and rejected in the past.
> What we need is a thorough benchmarking to know what is better in which
> situation.

Yeah: all previous ideas attempts manage hint bits through trade-offs
have failed because of general nervousness about who loses. The
unlucky guy who has high transaction rates that suddenly has a big
increase in clog activity is in for a world of hurt.

Also, AIUI 'set but don't dirty' strategies are less effective if/when
checksums are implemented as currently proposed. Previously hint bit
mitigation was all about saving some i/o. Now, everything is getting
rammed through the WAL lock.

merlin

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 15:09:42
Message-ID:	20121116150942.GE6505@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2012-11-15 16:42:57 -0800, Jeff Davis wrote:
> Related to discussion here:
> http://archives.postgresql.org/message-id/CAHyXU0zn5emePLedoZUGrAQiF92F-YjvFr-P5vUh6n0WpKZ6PQ@mail.gmail.com
>
> It occurred to me recently that many of the hint bits aren't terribly
> important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly
> has a purpose, and we'd expect it to be used many times following the
> initial CLOG lookup.

Wrt HEAP_XMAX_COMMITTED:
It can take an *awfully* long time till autovacuum crosses the
thresholds the next time for a big table.
I also think we cannot dismiss the case of longrunning transactions
because vacuum won't be able to cleanup those rows in that case.

Wrt HEAP_(XMIN|XMAX)_INVALID: yes, if we are in need of new flag bits
those sound like a good target to me.

> Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced
> in the visibility map patch, apparently as a way to know when to clear
> the VM bit when doing an update. It was then also used for scans, which
> showed a significant speedup. But I wonder: why not just use the
> visibilitymap directly from those places? It can be used for the scan
> because it is crash safe now (not possible before). And since it's only
> one lookup per scanned page, then I don't think it would be a measurable
> performance loss there. Inserts/updates/deletes also do a significant
> amount of work, so again, I doubt it's a big drop in performance there
> -- maybe under a lot of concurrency or something.
>
> The benefit of removing PD_ALL_VISIBLE would be significantly higher.
> It's quite common to load a lot of data, and then do some reads for a
> while (setting hint bits and flushing them to disk), and then do a
> VACUUM a while later, setting PD_ALL_VISIBLE and writing all of the
> pages again. Also, if I remember correctly, Robert went to significant
> effort when making the VM crash-safe to keep the PD_ALL_VISIBLE and VM
> bits consistent. Maybe this was all discussed before?

As far as I understand the code the crash-safety aspects of the
visibilitymap currently rely on on having the knowledge that ALL_VISIBLE
has been cleared during a heap_(insert|update|delete). That allows
management of the visibilitymap without it being xlogged itself which
seems pretty important to me.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 16:58:23
Message-ID:	CA+Tgmoag0BYeEinvNWxFpSiAMTWjK6XiCQiP-U6+GHVn9UdqWw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 15, 2012 at 7:42 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> But the other tuple hint bits seem to be there just for symmetry,
> because they shouldn't last long. If HEAP_XMIN_INVALID or
> HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed
> soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it
> should just be changed to InvalidTransactionId.

"Soon" is a relative term. I doubt that we can really rely on vacuum
to be timely enough to avoid pain here - you can easily have tens of
thousands of hits on the same tuple before vacuum gets around to
dealing with it. Now, we might be able to rejigger things to avoid
that. For example, maybe it'd be possible to arrange things so that
when we see an invalid xmin, we set the flag that triggers a HOT prune
instead of setting the hint bit. That would probably be good enough
to dispense with the hint bit, and maybe better altogether better than
the current system, because now the next time someone (including us)
locks the buffer we'll nuke the entire tuple, which would not only
make it cheaper to scan but also frees up space in the buffer sooner.

However, that solution only works for invalid-xmin. For
committed-xmax, there could actually be quite a long time before the
page can be pruned, because there can be some other backend holding an
old snapshot open. A one-minute reporting query in another database,
which is hardly an unreasonable scenario, could result in many, many
additional CLOG lookups, which are already a major contention point at
high concurrencies. I think that bit is probably pretty important,
and I don't see a viable way to get rid of it, though maybe someone
can think of one. For invalid-xmax, I agree that we could probably
just change xmax to InvalidTransactionId, if we need to save
bit-space. In the past Tom and I think also Alvaro have been
skeptical about anything that would overwrite xmin/xmax values too
quickly for forensic reasons, but maybe it's worth considering.

Well, you'd have to look up, lock and pin the page to do that. I
suspect that overhead is pretty significant. The benefit of noticing
that the flag is set is that you need not call HeapTupleSatisfiesMVCC
for each tuple on the page: checking one bit in the page header is a
lot cheaper than calling that function for every tuple. However, if
you had to lock and pin a second page in order to check whether the
page is all-visible, I suspect it wouldn't be a win; you'd probably be
better off just doing the HeapTupleSatisfiesMVCC checks for each
tuple.

One of the main advantages of PD_ALL_VISIBLE is that if you do an
insert, update, or delete on a page where that bit isn't set, you need
not lock and pin the visibility map page, because you already know
that the bit will be clear in the visibility map. If the data is
being rapidly modified, you'll get the benefit of this optimization
most of the time, only losing it when vacuum has visited recently. I
hope that's not premature optimization because I sure sweat a lot of
blood last release cycle to keep it working like that. I had a few
doubts at the time about how much we were winning there, but I don't
actually have any hard data either way, so I would be reluctant to
assume it doesn't matter.

Even if it doesn't, the sequential-scan optimization definitely
matters a LOT, as you can easily verify.

One approach that I've been hoping to pursue is to find a way to make
CLOG lookups cheaper and more concurrent. I started to work on some
concurrent hash table code, which you can find here:

http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash

The concurrency properties of this code are vastly better than what we
have now, but there are cases where it loses vs. dynahash when there's
no concurrency. That might be fixable or just not a big deal, though.
A bigger problem is that I got sucked off into other things before I
was able to get as far with it as I wanted to; in particular, I have
only unit test results for it, and haven't tried to integrate it into
the SLRU code yet.

But I'm not sure any of this is going to fundamentally chip away at
the need for hint bits all that much. Making CLOG lookups cheaper or
less frequent is all to the good, but the prognosis for improving
things enough that we can dump some or all of the hint bits completely
seems uncertain at best. Even if we COULD dump everything but
heap-xmin-committed, how much would that really help with the
disk-write problem? I bet heap-xmin-committed gets set far more
often than the other three put together.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-16 18:07:55
Message-ID:	1353089275.10198.22.camel@jdavis-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

When I tried a simple test on my laptop, where I removed the
XMIN_INVALID and XMAX_COMMITTED bits on a simple SELECT test (no aborts,
no deletes, fits in shared_buffers) I saw a pretty consistent 2%
speedup. Not earth-shattering, but it would be nice if we could figure
out a way to do that.

Not sure where that speedup came from, to be honest, it doesn't look
like much difference in the work being done in the visibility check.

Regards,
Jeff Davis

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-17 00:58:50
Message-ID:	1353113930.14335.185.camel@sussancws0025
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2012-11-16 at 11:58 -0500, Robert Haas wrote:
> > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced
> > in the visibility map patch, apparently as a way to know when to clear
> > the VM bit when doing an update. It was then also used for scans, which
> > showed a significant speedup. But I wonder: why not just use the
> > visibilitymap directly from those places?
>
> Well, you'd have to look up, lock and pin the page to do that. I
> suspect that overhead is pretty significant. The benefit of noticing
> that the flag is set is that you need not call HeapTupleSatisfiesMVCC
> for each tuple on the page: checking one bit in the page header is a
> lot cheaper than calling that function for every tuple. However, if
> you had to lock and pin a second page in order to check whether the
> page is all-visible, I suspect it wouldn't be a win; you'd probably be
> better off just doing the HeapTupleSatisfiesMVCC checks for each
> tuple.

That's pretty easy to test. Here's what I got on a 10M record table
(Some runs got some strangely high numbers around 1700ms, which I assume
is because it's difficult to keep the data in shared buffers, so I took
the lower numbers.):

PD_ALL_VISIBLE: 661ms
VM Lookup: 667ms
Neither: 740ms

Even if pinning the vm buffer were slow, we could keep the pin longer
during a scan (it seems like the VM API is designed for that kind of a
use case), so I don't think scans are a problem at all, even if there is
a lot of concurrency.

> One of the main advantages of PD_ALL_VISIBLE is that if you do an
> insert, update, or delete on a page where that bit isn't set, you need
> not lock and pin the visibility map page, because you already know
> that the bit will be clear in the visibility map. If the data is
> being rapidly modified, you'll get the benefit of this optimization
> most of the time, only losing it when vacuum has visited recently. I
> hope that's not premature optimization because I sure sweat a lot of
> blood last release cycle to keep it working like that. I had a few
> doubts at the time about how much we were winning there, but I don't
> actually have any hard data either way, so I would be reluctant to
> assume it doesn't matter.

This is a more plausible concern, because it's per-tuple rather than
per-page.

However, I think we might be able to save the current VM buffer across
multiple calls of heap_insert|update|delete in nodeModifyTable. At first
glance, it doesn't look very difficult.

It's unfortunate that I didn't think much about this while you were
working on it before; but difficult-to-write code is also difficult to
maintain, so simplifying it would be nice.

> Even if it doesn't, the sequential-scan optimization definitely
> matters a LOT, as you can easily verify.

I am in no way suggesting that we remove that optimization; I am
suggesting that we can get the exact same benefit by looking at the VM
directly.

> One approach that I've been hoping to pursue is to find a way to make
> CLOG lookups cheaper and more concurrent. I started to work on some
> concurrent hash table code, which you can find here:
>
> http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash
>
> The concurrency properties of this code are vastly better than what we
> have now, but there are cases where it loses vs. dynahash when there's
> no concurrency. That might be fixable or just not a big deal, though.
> A bigger problem is that I got sucked off into other things before I
> was able to get as far with it as I wanted to; in particular, I have
> only unit test results for it, and haven't tried to integrate it into
> the SLRU code yet.

Great work!

> But I'm not sure any of this is going to fundamentally chip away at
> the need for hint bits all that much. Making CLOG lookups cheaper or
> less frequent is all to the good, but the prognosis for improving
> things enough that we can dump some or all of the hint bits completely
> seems uncertain at best. Even if we COULD dump everything but
> heap-xmin-committed, how much would that really help with the
> disk-write problem? I bet heap-xmin-committed gets set far more
> often than the other three put together.

Right now I'm setting my sights on PD_ALL_VISIBLE. I know that causes a
lot of extra page dirties, and it seems like it doesn't serve all that
much purpose after the VM became crash-safe. And it would simplify the
code.

If we can't get rid of the other hint bits, then so be it.

> > But the other tuple hint bits seem to be there just for symmetry,
> > because they shouldn't last long. If HEAP_XMIN_INVALID or
> > HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed
> > soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it
> > should just be changed to InvalidTransactionId.
>
> However, that solution only works for invalid-xmin. For
> committed-xmax, there could actually be quite a long time before the
> page can be pruned, because there can be some other backend holding an
> old snapshot open.

I think I'll set aside this part of the idea for now, because it doesn't
seem like a huge win after I thought about it, anyway. If
HEAP_XMIN_INVALID or HEAP_XMAX_COMMITTED are set, it's likely to be
dirty anyway, unless it's a bulk delete or something.

Regards,
Jeff Davis

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-17 01:04:26
Message-ID:	CAMkU=1xh=gLncf1RfWiH38UGOGM6Sn9ZfhVy1ci82RK0L4am+w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 15, 2012 at 4:42 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced
> in the visibility map patch, apparently as a way to know when to clear
> the VM bit when doing an update. It was then also used for scans, which
> showed a significant speedup. But I wonder: why not just use the
> visibilitymap directly from those places? It can be used for the scan
> because it is crash safe now (not possible before). And since it's only
> one lookup per scanned page,

Wouldn't you need to lock two pages simultaneously, for each table
page, in order to ensure that there are no races?

For a full table scan, that might not be so good.

> then I don't think it would be a measurable
> performance loss there. Inserts/updates/deletes also do a significant
> amount of work, so again, I doubt it's a big drop in performance there
> -- maybe under a lot of concurrency or something.

Your question prompts me to post something I had been wondering.
Might it be worthwhile to break the PD_ALL_VISIBLE / vm equivalence?
Should the vm bit get cleared by a HOT update?

Currently the vm is used only for non-freezing vacuuming and
index-only scans, as far as I know.

For the IOS, it doesn't care which tuple in the chain is visible, as
long as exactly one of them is. So everything it cares about is still
visible.

And anyone can vacuum a block that has only had HOT updates, you don't
need to be dedicated vacuum worker to do that.

I guess one limitation of this as that if the block was populated by
large tuples which then systematically got made smaller by HOT
updates, the free space map would never get updated until the next
vacuum freeze kicked in.

And obviously this would be incompatible with removing the
PD_ALL_VISIBLE, unless we also wanted to eliminate the ability to
short-cut hint bit checks.

Cheers,

Jeff

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-17 01:19:23
Message-ID:	1353115163.14335.196.camel@sussancws0025
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2012-11-16 at 16:09 +0100, Andres Freund wrote:
> As far as I understand the code the crash-safety aspects of the
> visibilitymap currently rely on on having the knowledge that ALL_VISIBLE
> has been cleared during a heap_(insert|update|delete). That allows
> management of the visibilitymap without it being xlogged itself which
> seems pretty important to me.

It looks like the xlrec does contain a "cleared all visible" flag in it,
and it uses that to clear the VM as well as PD_ALL_VISIBLE.

So, it still seems achievable, and maybe not even a large patch, to do
away with PD_ALL_VISIBLE without a lot of performance risk.

Regards,
Jeff Davis

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-17 01:35:10
Message-ID:	1353116110.14335.206.camel@sussancws0025
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2012-11-16 at 17:04 -0800, Jeff Janes wrote:
> On Thu, Nov 15, 2012 at 4:42 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> >
> > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced
> > in the visibility map patch, apparently as a way to know when to clear
> > the VM bit when doing an update. It was then also used for scans, which
> > showed a significant speedup. But I wonder: why not just use the
> > visibilitymap directly from those places? It can be used for the scan
> > because it is crash safe now (not possible before). And since it's only
> > one lookup per scanned page,
>
> Wouldn't you need to lock two pages simultaneously, for each table
> page, in order to ensure that there are no races?
>
> For a full table scan, that might not be so good.

Can you explain? The table scan share locks the heap page during the
visibility check (which should suffice as a memory barrier) and then
could test the VM bit with only a pin on the VM page.

> Your question prompts me to post something I had been wondering.
> Might it be worthwhile to break the PD_ALL_VISIBLE / vm equivalence?
> Should the vm bit get cleared by a HOT update?

To clarify: are you saying that a hot update should clear the
PD_ALL_VISIBLE bit, but not the VM bit?

> And anyone can vacuum a block that has only had HOT updates, you don't
> need to be dedicated vacuum worker to do that.

> And obviously this would be incompatible with removing the
> PD_ALL_VISIBLE, unless we also wanted to eliminate the ability to
> short-cut hint bit checks.

I'm still a little unclear on what the benefit is.

It sounds like a slightly different kind of hint, so maybe we should
just treat it as a completely different thing after removing
PD_ALL_VISIBLE. If it's related to HOT updates, then the page will
probably be dirty anyway, so that removes my primary complaint about
PD_ALL_VISIBLE.

Regards,
Jeff Davis

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-17 13:24:15
Message-ID:	20121117132415.GA4222@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2012-11-16 17:19:23 -0800, Jeff Davis wrote:
> On Fri, 2012-11-16 at 16:09 +0100, Andres Freund wrote:
> > As far as I understand the code the crash-safety aspects of the
> > visibilitymap currently rely on on having the knowledge that ALL_VISIBLE
> > has been cleared during a heap_(insert|update|delete). That allows
> > management of the visibilitymap without it being xlogged itself which
> > seems pretty important to me.
>
> It looks like the xlrec does contain a "cleared all visible" flag in it,
> and it uses that to clear the VM as well as PD_ALL_VISIBLE.

I think the point is that to check whether the visibilitymap bit needs
to be unset - and thus locked exlusively - no locks have to be acquired
but those heap_* already has. Given that in a the large amount of cases
ALL_VISIBLE does *not* need to be reset I think that this is a very
important property for concurrency purposes. If you consider the large
amount of pages that are covered by a single visibilitymap page we don't
want more locking in that path...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-17 21:30:07
Message-ID:	1353187807.10198.28.camel@jdavis-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2012-11-17 at 14:24 +0100, Andres Freund wrote:
> I think the point is that to check whether the visibilitymap bit needs
> to be unset

What's the problem with that? If you already have the VM buffer pinned
(which should be possible if we keep the VM buffer in a longer-lived
structure), then doing the test is almost as cheap as checking
PD_ALL_VISIBLE, because you don't need any locks.

So, the proposal is:
1. Keep the VM buffer around in a longer-lived structure for scans and
nodeModifyTable.
2. Replace all tests of PD_ALL_VISIBLE with tests directly against the
VM, hopefully using a buffer that we already have a pin on.

I haven't really dug into this yet, but I don't see any obvious problem.

Regards,
Jeff Davis

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-17 21:53:31
Message-ID:	14867.1353189211@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> What's the problem with that? If you already have the VM buffer pinned
> (which should be possible if we keep the VM buffer in a longer-lived
> structure), then doing the test is almost as cheap as checking
> PD_ALL_VISIBLE, because you don't need any locks.

Really? What about race conditions? Specifically, I think what you
suggest is likely to be unreliable on machines with weak memory
ordering. Consider possibility that someone else just changed the VM
bit. Getting a lock ensures synchronization. (Yeah, it's possible that
we could use some primitive cheaper than a lock ... but it's not going
to be free.)

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-18 00:35:34
Message-ID:	CA+U5nMKb0b0u8ifbMEkUY1+9m18NhvbZPcdUyuY_sJiXd1cttw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16 November 2012 19:58, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Fri, 2012-11-16 at 11:58 -0500, Robert Haas wrote:
>> > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced
>> > in the visibility map patch, apparently as a way to know when to clear
>> > the VM bit when doing an update. It was then also used for scans, which
>> > showed a significant speedup. But I wonder: why not just use the
>> > visibilitymap directly from those places?
>>
>> Well, you'd have to look up, lock and pin the page to do that. I
>> suspect that overhead is pretty significant. The benefit of noticing
>> that the flag is set is that you need not call HeapTupleSatisfiesMVCC
>> for each tuple on the page: checking one bit in the page header is a
>> lot cheaper than calling that function for every tuple. However, if
>> you had to lock and pin a second page in order to check whether the
>> page is all-visible, I suspect it wouldn't be a win; you'd probably be
>> better off just doing the HeapTupleSatisfiesMVCC checks for each
>> tuple.
>
> That's pretty easy to test. Here's what I got on a 10M record table
> (Some runs got some strangely high numbers around 1700ms, which I assume
> is because it's difficult to keep the data in shared buffers, so I took
> the lower numbers.):
>
> PD_ALL_VISIBLE: 661ms
> VM Lookup: 667ms
> Neither: 740ms
>
> Even if pinning the vm buffer were slow, we could keep the pin longer
> during a scan (it seems like the VM API is designed for that kind of a
> use case), so I don't think scans are a problem at all, even if there is
> a lot of concurrency.

The biggest problem with hint bits is SeqScans on a table that ends up
dirtying many pages. Repeated checks against clog and hint bit setting
are massive overheads for the user that hits that, plus it generates
an unexpected surge of database writes. Even without checksums that is
annoying.

ISTM that we should tune that specifically by performing a VM lookup
for next 32 pages (or more), so we reduce the lookups well below 1 per
page. That way the overhead of using the VM will be similar to using
the PD_ALL_VISIBLE. Also, if we pass through a flag to
HeapTupleSateisfies indicating we are not interested in setting hints
on a SeqScan then we can skip individual tuple hints also. If the
whole page becomes visible then we can set the VM.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-18 02:07:01
Message-ID:	1353204421.10198.60.camel@jdavis-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2012-11-17 at 16:53 -0500, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > What's the problem with that? If you already have the VM buffer pinned
> > (which should be possible if we keep the VM buffer in a longer-lived
> > structure), then doing the test is almost as cheap as checking
> > PD_ALL_VISIBLE, because you don't need any locks.
>
> Really? What about race conditions? Specifically, I think what you
> suggest is likely to be unreliable on machines with weak memory
> ordering. Consider possibility that someone else just changed the VM
> bit. Getting a lock ensures synchronization. (Yeah, it's possible that
> we could use some primitive cheaper than a lock ... but it's not going
> to be free.)

There's already a similar precedent in IndexOnlyNext, which calls
visibilitymap_test with no lock.

I am not authoritative on these kinds of lockless accesses, but it looks
like we can satisfy those memory barrier requirements in the places we
need to.

Here is my analysis:

Process A (process that clears a VM bit for a data page):
1. Acquires exclusive lock on data buffer
2. Acquires exclusive lock on VM buffer
3. clears VM bit
4. Releases VM buffer lock
5. Releases data buffer lock

Process B (process that tests the VM bit for the same data page):
1. Acquires shared lock (if it's a scan doing a visibility test) or an
exclusive lock (if it's an I/U/D that wants to know whether to clear the
bit or not) on the data buffer.
2. Tests bit using an already-pinned VM buffer.
3. Releases data buffer lock.

Process A and B must be serialized, because A takes an exclusive lock on
the data buffer and B takes at least a shared lock on the data buffer.
The only dangerous case is when A happens right before B. So, the
question is: are there enough memory barriers between A-3 and B-2? And I
think the answer is yes. A-4 should act as a write barrier after
clearing the bit, and B-1 should act as a read barrier before reading
the bit.

Let me know if there is a flaw with this analysis.

If not, then I still agree with you that it's not as cheap as testing
PD_ALL_VISIBLE, but I am skeptical that memory-ordering constraints
we're imposing on the CPU are expensive enough to matter in these cases.
If you have a test case in mind that might exercise this, then I will
try to run it (although my workstation is only 4 cores, and the most I
can probably get access to is 16 cores).

Regards,
Jeff Davis

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-18 02:20:33
Message-ID:	1353205233.10198.74.camel@jdavis-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2012-11-17 at 19:35 -0500, Simon Riggs wrote:
> The biggest problem with hint bits is SeqScans on a table that ends up
> dirtying many pages. Repeated checks against clog and hint bit setting
> are massive overheads for the user that hits that, plus it generates
> an unexpected surge of database writes. Even without checksums that is
> annoying.

Yeah. I am nowhere close to a general solution for that, but I am
targeting the PD_ALL_VISIBLE hint for removal (which is one part of the
problem), and I think I am close to an approach with no measurable
downside.

> ISTM that we should tune that specifically by performing a VM lookup
> for next 32 pages (or more), so we reduce the lookups well below 1 per
> page. That way the overhead of using the VM will be similar to using
> the PD_ALL_VISIBLE.

That's another potential way to mitigate the effects during a scan, but
it does add a little complexity. Right now, it share locks a buffer, and
uses an array with one element for each tuple in the page. If
PD_ALL_VISIBLE is set, then it marks all of the tuples *currently
present* on the page as visible in the array, and then releases the
share lock. Then, when reading the page, if another tuple is added
(because we released the share lock and only have a pin), it doesn't
matter because it's already invisible according to the array.

With this approach, we'd need to keep a larger array to represent many
pages. And it sounds like we'd need to share lock the pages ahead, and
find out which items are currently present, in order to properly fill in
the array. Not quite sure what to do there, but would require some more
thought.

I'm inclined to avoid going down this path unless there is some
performance reason to do so. We can keep a VM buffer pinned and do some
lockless testing (similar to that in IndexOnlyNext; see my response to
Tom), which will hopefully be fast enough that we don't need anything
else.

> Also, if we pass through a flag to
> HeapTupleSateisfies indicating we are not interested in setting hints
> on a SeqScan then we can skip individual tuple hints also. If the
> whole page becomes visible then we can set the VM.

Hmm, that's an idea. Maybe we shouldn't bother setting the hints if it's
already all-visible in the VM? Something else to think about.

Regards,
Jeff Davis

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-18 05:24:02
Message-ID:	CAMkU=1xyG=22ZRWqpSPcOhq2gD9YDY3eFN1dF3csK0cAKbRUKw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 16, 2012 at 5:35 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Fri, 2012-11-16 at 17:04 -0800, Jeff Janes wrote:

>
>> Your question prompts me to post something I had been wondering.
>> Might it be worthwhile to break the PD_ALL_VISIBLE / vm equivalence?
>> Should the vm bit get cleared by a HOT update?
>
> To clarify: are you saying that a hot update should clear the
> PD_ALL_VISIBLE bit, but not the VM bit?

Yes.

>
>> And anyone can vacuum a block that has only had HOT updates, you don't
>> need to be dedicated vacuum worker to do that.
>
>> And obviously this would be incompatible with removing the
>> PD_ALL_VISIBLE, unless we also wanted to eliminate the ability to
>> short-cut hint bit checks.
>
> I'm still a little unclear on what the benefit is.

The benefit would be that index only scans would be more likely to not
need to visit the heap page, if it only had HOT updates done to it
since the last time it was all-visible.

Also some reduced vacuuming, but I don't know if that benefit would be
beneficial.

> It sounds like a slightly different kind of hint, so maybe we should
> just treat it as a completely different thing after removing
> PD_ALL_VISIBLE. If it's related to HOT updates, then the page will
> probably be dirty anyway, so that removes my primary complaint about
> PD_ALL_VISIBLE.

Right, but if the HOT-only bit was on the page itself, it would no
longer help out index-only-scans.

Cheers,

Jeff

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-18 13:52:10
Message-ID:	CA+U5nM+pfN15VQi2HyobFpqbd5=WsB6HZ7pM9ehyLKrTHRwPbA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 17 November 2012 21:20, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

>> ISTM that we should tune that specifically by performing a VM lookup
>> for next 32 pages (or more), so we reduce the lookups well below 1 per
>> page. That way the overhead of using the VM will be similar to using
>> the PD_ALL_VISIBLE.
>
> That's another potential way to mitigate the effects during a scan, but
> it does add a little complexity. Right now, it share locks a buffer, and
> uses an array with one element for each tuple in the page. If
> PD_ALL_VISIBLE is set, then it marks all of the tuples *currently
> present* on the page as visible in the array, and then releases the
> share lock. Then, when reading the page, if another tuple is added
> (because we released the share lock and only have a pin), it doesn't
> matter because it's already invisible according to the array.
>
> With this approach, we'd need to keep a larger array to represent many
> pages. And it sounds like we'd need to share lock the pages ahead, and
> find out which items are currently present, in order to properly fill in
> the array. Not quite sure what to do there, but would require some more
> thought.

Hmm, that's too much and not really what I was thinking, but I concede
that was a little vague. No need for bigger arrays etc..

If we check the VM for next N blocks, then we know that all completed
transactions are commited. Yes, the VM can change, but that is not a
problem.

What I mean is that we keep an array of boolean[N] that simply tracks
what the VM said last time we checked it. If that is true for a block
then we do special processing, similar to the current all-visible path
and yet different, desribed below.

What we want is to do a HeapTupleVisibility check that does not rely
on tuple hints AND yet avoids all clog access. So when we scan a
buffer in page mode and we know the VM said it was all visible we
still check each tuple's visibility. If xid is below snapshot xmin
then the xid is known committed and the tuple is visible to this scan
(not necessarily all scans). We know this because the VM said this
page was all-visible AFTER our snapshot was taken. If tuple xid is
within snapshot or greater than snapshot xmax then the tuple is
invisible to our snapshot and we don't need to check clog. So once we
know the VM said the page was all visible we do not need to check clog
to establish visibility, we only need to check the tuple xmin against
our snapshot xmin.

So the VM can change under us and it doesn't matter. We don't need a
pin or lock on the VM, we just read it and let go. No race conditions,
no fuss.

The difference here is that we still need to check visibility of each
tuple, but that can be a very cheap check and never involves clog, nor
does it dirty the page. Tuple access is reasonably expensive in
comparison with a clog-less check on tuple xmin against snapshot xmin,
so the extra work is negligible.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-18 14:19:31
Message-ID:	201211181519.33154.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sunday, November 18, 2012 03:07:01 AM Jeff Davis wrote:
> Process A (process that clears a VM bit for a data page):
> 1. Acquires exclusive lock on data buffer
> 2. Acquires exclusive lock on VM buffer
> 3. clears VM bit
> 4. Releases VM buffer lock
> 5. Releases data buffer lock

Well, but right this is a rather big difference. If vm pages get
unconditionally locked all the time we will have a huge source of new
contention as they are shared between so many heap pages.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-18 14:58:16
Message-ID:	CA+U5nM+Nbvjxtn43WoD8KznMAHPoMKX4mBQC_PxwERa=Y6kbQA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18 November 2012 08:52, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> The difference here is that we still need to check visibility of each
> tuple, but that can be a very cheap check and never involves clog, nor
> does it dirty the page. Tuple access is reasonably expensive in
> comparison with a clog-less check on tuple xmin against snapshot xmin,
> so the extra work is negligible.

The attached *partial* patch shows how the tuple checks would work.

This should fit in neatly with the vismap skip code you've got already.

Looks to me possible to skip the all-vis hint completely, as you
suggest, plus avoid repeated checks of VM or clog.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
all_visible_cheap_check.v1.patch	application/octet-stream	3.3 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-18 17:13:19
Message-ID:	1353258799.10198.84.camel@jdavis-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2012-11-18 at 15:19 +0100, Andres Freund wrote:
> On Sunday, November 18, 2012 03:07:01 AM Jeff Davis wrote:
> > Process A (process that clears a VM bit for a data page):
> > 1. Acquires exclusive lock on data buffer
> > 2. Acquires exclusive lock on VM buffer
> > 3. clears VM bit
> > 4. Releases VM buffer lock
> > 5. Releases data buffer lock
>
> Well, but right this is a rather big difference. If vm pages get
> unconditionally locked all the time we will have a huge source of new
> contention as they are shared between so many heap pages.

No, that is only for the process *clearing* the bit, and this already
happens. I am not planning on introducing any new locks, aside from the
buffer header lock when acquiring a pin. And I plan to keep those pins
for long enough that those don't matter, either.

Regards,
Jeff Davis

From:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-19 18:20:24
Message-ID:	CAOeZVieq+1wLqgqk7n+L4vYAYin9cu=vq0PEYUtWtKi3dENAaQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 18, 2012 at 10:43 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> On Sun, 2012-11-18 at 15:19 +0100, Andres Freund wrote:
> > On Sunday, November 18, 2012 03:07:01 AM Jeff Davis wrote:
> > > Process A (process that clears a VM bit for a data page):
> > > 1. Acquires exclusive lock on data buffer
> > > 2. Acquires exclusive lock on VM buffer
> > > 3. clears VM bit
> > > 4. Releases VM buffer lock
> > > 5. Releases data buffer lock
> >
> > Well, but right this is a rather big difference. If vm pages get
> > unconditionally locked all the time we will have a huge source of new
> > contention as they are shared between so many heap pages.
>
> No, that is only for the process *clearing* the bit, and this already
> happens. I am not planning on introducing any new locks, aside from the
> buffer header lock when acquiring a pin. And I plan to keep those pins
> for long enough that those don't matter, either.
>
> Regards,
> Jeff Davis
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Sorry If I am being a bit naive, but shouldnt a simple mutex work in the
case when a process wants to change the VM bit in cache?

Mutex would be cheaper than locks.

Atri

--
Regards,

Atri
*l'apprenant*

From:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-19 18:29:16
Message-ID:	CAOeZVie44Edc2xPZtp9DH8yruHyjB3g=Xk8k8Jy0CXcPCyC0Gg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
> It's quite common to load a lot of data, and then do some reads for a
> while (setting hint bits and flushing them to disk), and then do a
> VACUUM a while later, setting PD_ALL_VISIBLE and writing all of the
> pages again. Also, if I remember correctly, Robert went to significant
> effort when making the VM crash-safe to keep the PD_ALL_VISIBLE and VM
> bits consistent. Maybe this was all discussed before?
>
> All of these hint bits will have a bit more of a performance impact
> after checksums are introduced (for those that use them in conjunction
> with large data loads), so I'm looking for some simple ways to mitigate
> those effects. What kind of worst-case tests could I construct to see if
> there are worrying performance effects to removing these hint bits?
>
> Regards,
> Jeff Davis
>
>
>
>
I completely agree.In fact, that is the problem that we are trying to solve
in our patch(https://commitfest.postgresql.org/action/patch_view?id=991).
Essentially, we are trying to mitigate the expense of maintaining hint bits
in the cases when the user loads a lot of data, does some operations such
as SELECT, and deletes them all.We maintain a cache that can be used to
fetch the commit status of XMAX or XMIN instead of hint bits.As the cache
is single frame, it has no issues in replacement algorithm.Cache lookup is
pretty cheap.

I agree with the removal of PD_ALL_VISIBLE.AFAIK(pardon me if I am wrong, I
have been trying to research while following this thread), PD_AL_VISIBLE
was really useful when VM bits were not really safe, and crashes could lead
to redo setting the bit on the heap pages.

Atri

--
Regards,

Atri
*l'apprenant*

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-19 18:38:50
Message-ID:	1353350330.10198.133.camel@jdavis-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2012-11-19 at 23:50 +0530, Atri Sharma wrote:

> Sorry If I am being a bit naive, but shouldnt a simple mutex work in
> the case when a process wants to change the VM bit in cache?
>
> Mutex would be cheaper than locks.
>
I thought mutexes are locks?

The point is to avoid taking new locks (or mutexes) during a read of the
VM bit, because there is concern that it could cause contention. If we
lock the entire VM page, that represents many, many data pages, so it's
worrisome.

Regards,
Jeff Davis

From:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-19 18:43:06
Message-ID:	CAOeZVicPMe6oP3QpQOn8mv8PpNeNHbACYiwHcb0Csrgdtb=0yQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 20, 2012 at 12:08 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> On Mon, 2012-11-19 at 23:50 +0530, Atri Sharma wrote:
>
> > Sorry If I am being a bit naive, but shouldnt a simple mutex work in
> > the case when a process wants to change the VM bit in cache?
> >
> > Mutex would be cheaper than locks.
> >
> I thought mutexes are locks?
>
> The point is to avoid taking new locks (or mutexes) during a read of the
> VM bit, because there is concern that it could cause contention. If we
> lock the entire VM page, that represents many, many data pages, so it's
> worrisome.
>
> Regards,
> Jeff Davis
>
>
>
My mistake...I thought we were more concerned about the cost of locking.

I agree, locking many data pages simultaneously could be hazardous.

This requires more thought.Maybe removing PD_ALL_VISIBLE isnt such a great
idea after all...

Atri

--
Regards,

Atri
*l'apprenant*

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-19 18:46:37
Message-ID:	1353350797.10198.134.camel@jdavis-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2012-11-20 at 00:13 +0530, Atri Sharma wrote:

> My mistake...I thought we were more concerned about the cost of
> locking.
>
> I agree, locking many data pages simultaneously could be hazardous.
>
> This requires more thought.Maybe removing PD_ALL_VISIBLE isnt such a
> great idea after all...

As I said elsewhere in the thread, I'm not planning to introduce any
additional locking. There is already precedent in IndexOnlyNext.

Regards,
Jeff Davis

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-19 18:49:27
Message-ID:	20121119184927.GC3252@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2012-11-19 10:46:37 -0800, Jeff Davis wrote:
> On Tue, 2012-11-20 at 00:13 +0530, Atri Sharma wrote:
>
> > My mistake...I thought we were more concerned about the cost of
> > locking.
> >
> > I agree, locking many data pages simultaneously could be hazardous.
> >
> > This requires more thought.Maybe removing PD_ALL_VISIBLE isnt such a
> > great idea after all...
>
> As I said elsewhere in the thread, I'm not planning to introduce any
> additional locking. There is already precedent in IndexOnlyNext.

Yea, but that precedent is only valid because IOS does check for a MVCC
snapshots. Also it doesn't set anything.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-19 19:46:51
Message-ID:	11108.1353354411@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> As I said elsewhere in the thread, I'm not planning to introduce any
> additional locking. There is already precedent in IndexOnlyNext.

Of course, that just begs the question of whether the code in
IndexOnlyNext is correct. And after looking at it, it seems pretty
broken, or at least the argument for its correctness is broken.
That comment seems to consider only the case where the target tuple has
just been inserted --- but what of the case where the target tuple is in
process of being deleted? The deleter will not make a new index entry
for that. Of course, there's a pretty long code path from marking a
tuple deleted to committing the deletion, and it seems safe to assume
that the deleter will hit a write barrier in that path. But I don't
believe the argument here that the reader must hit a read barrier in
between.

regards, tom lane

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-20 21:42:25
Message-ID:	CAHyXU0wr-8LT9R9e0__CM-S+B=tTxu4uEFgerYayo-fHfMSXtg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 15, 2012 at 10:37 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 15 November 2012 22:21, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>>> Removing those 3 hints would give us 3 more flag bits (eventually, after
>>> we are sure they aren't just leftover), and it would also reduce the
>>> chance that a page is dirtied for no other reason than to set them.
>>
>> We aren't pressed for flag bits particularly. I think the main
>> attraction of this idea is precisely to reduce unnecessary page dirties,
>> and so that leads me to suggest a variant: keep the four bits defined as
>> now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the
>> page is already dirty. This would make it a straight-up trade of more
>> clog consultation for fewer page dirties.
>
> Hmm, I thought Alvaro wanted an extra flag bit for foreign key locks
> but couldn't find it.
>
> Come to think of it, why do we have XMIN_INVALID and XMAX_INVALID? We
> never need both at the same time, so we can just use one...
> XMIN_INVALID -> XACT_INVALID
> XMAX_INVALID == XMIN_COMMITTED | XACT_INVALID

Hm, I wonder if you could squeeze two bits out. ISTM here are the
interesting cases enumerated:

0: xmin unknown
1: xmin invalid
2: xmin valid, xmax unknown
3: xmin valid, xmax invalid
4: xmin valid, xmax valid

Did I miss any? If not, I think case #3 could be covered by utilizing
xmax == InvalidTransactionId or simply ignored. That makes the check
a little dirtier than a bit test though, but you could be sneaky and
map both xmin=valid cases to a bit.

merlin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-20 21:58:46
Message-ID:	15965.1353448726@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
> Hm, I wonder if you could squeeze two bits out. ISTM here are the
> interesting cases enumerated:

> 0: xmin unknown
> 1: xmin invalid
> 2: xmin valid, xmax unknown
> 3: xmin valid, xmax invalid
> 4: xmin valid, xmax valid

> Did I miss any?

Yes. xmin unknown, xmax unknown is possible and different from all the
above, ie a tuple can be deleted by the creating transaction. (But it
could still be visible to some of that transaction's snapshots, so you
can't equate this state to "xmin invalid".)

There's a fairly big problem with any of these ideas, and it's not
even on-disk compatibility. It is that we assume that hint bits can be
set without exclusive lock on the buffer. If any of the transitions
xmin unknown -> xmin committed, xmin unknown -> xmin aborted,
xmax unknown -> xmax committed, xmax unknown -> xmax aborted
aren't expressed by setting a bit that wasn't set before, we probably
lose that property, and thereby a whole lot of concurrency.

regards, tom lane

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-21 22:20:16
Message-ID:	1353536416.11440.14.camel@sussancws0025
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2012-11-19 at 14:46 -0500, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > As I said elsewhere in the thread, I'm not planning to introduce any
> > additional locking. There is already precedent in IndexOnlyNext.
>
> Of course, that just begs the question of whether the code in
> IndexOnlyNext is correct. And after looking at it, it seems pretty
> broken, or at least the argument for its correctness is broken.
> That comment seems to consider only the case where the target tuple has
> just been inserted --- but what of the case where the target tuple is in
> process of being deleted? The deleter will not make a new index entry
> for that. Of course, there's a pretty long code path from marking a
> tuple deleted to committing the deletion, and it seems safe to assume
> that the deleter will hit a write barrier in that path. But I don't
> believe the argument here that the reader must hit a read barrier in
> between.

After digging in a little, this is bothering me now, too. I think it's
correct, and I agree with your analysis, but it seems a little complex
to explain why it works.

It also seems like a fortunate coincidence that we can detect clearing
the bit due to an insert, which we need to know about immediately; but
not detect a delete, which we don't care about until it commits.

I will try to update the comment along with my submission.

Regards,
Jeff Davis

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject:	Re: Do we need so many hint bits?
Date:	2012-11-27 15:47:58
Message-ID:	CA+TgmoY53B98Ba06qduy2sbBVLr58h_dj5Pb8xV6ypj4M-7bng@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 19, 2012 at 2:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
>> As I said elsewhere in the thread, I'm not planning to introduce any
>> additional locking. There is already precedent in IndexOnlyNext.
>
> Of course, that just begs the question of whether the code in
> IndexOnlyNext is correct. And after looking at it, it seems pretty
> broken, or at least the argument for its correctness is broken.
> That comment seems to consider only the case where the target tuple has
> just been inserted --- but what of the case where the target tuple is in
> process of being deleted? The deleter will not make a new index entry
> for that. Of course, there's a pretty long code path from marking a
> tuple deleted to committing the deletion, and it seems safe to assume
> that the deleter will hit a write barrier in that path. But I don't
> believe the argument here that the reader must hit a read barrier in
> between.

Ouch. I don't know how I missed that case when writing that comment,
but I agree it should have been discussed there. The timing
requirements for updates and deletes seem to be generally less strict,
because the update or delete won't generally be visible to our
snapshot anyway, so it doesn't really matter if we think the page is
all-visible for slightly longer than it really is. However, it would
be a problem if we thought the page was still all-visible when in fact
it contained a tuple that was not visible to our snapshot. For that
to happen, the updating or deleting transaction would have to commit
before we took our snapshot, yet the visibility map update would have
to still be visible after we took our snapshot. That clearly can't
happen, since both committing and taking a snapshot require LWLock
acquire/release.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company