Re: visibility map

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: visibility map
Date: 2010-06-14 03:08:13
Message-ID: AANLkTilxav5NzXHdZ_K8M8gi0ARXt4jOsumLcIImXZRv@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

visibilitymap.c begins with a long and useful comment - but this part
seems to have a bit of split personality disorder.

* Currently, the visibility map is not 100% correct all the time.
* During updates, the bit in the visibility map is cleared after releasing
* the lock on the heap page. During the window between releasing the lock
* and clearing the bit in the visibility map, the bit in the visibility map
* is set, but the new insertion or deletion is not yet visible to other
* backends.
*
* That might actually be OK for the index scans, though. The newly inserted
* tuple wouldn't have an index pointer yet, so all tuples reachable from an
* index would still be visible to all other backends, and deletions wouldn't
* be visible to other backends yet. (But HOT breaks that argument, no?)

I believe that the answer to the parenthesized question here is "yes"
(in which case we might want to just delete this paragraph).

* There's another hole in the way the PD_ALL_VISIBLE flag is set. When
* vacuum observes that all tuples are visible to all, it sets the flag on
* the heap page, and also sets the bit in the visibility map. If we then
* crash, and only the visibility map page was flushed to disk, we'll have
* a bit set in the visibility map, but the corresponding flag on the heap
* page is not set. If the heap page is then updated, the updater won't
* know to clear the bit in the visibility map. (Isn't that prevented by
* the LSN interlock?)

I *think* that the answer to this parenthesized question is "no".
When we vacuum a page, we set the LSN on both the heap page and the
visibility map page. Therefore, neither of them can get written to
disk until the WAL record is flushed, but they could get flushed in
either order. So the visibility map page could get flushed before the
heap page, as the non-parenthesized portion of the comment indicates.
However, at least in theory, it seems like we could fix this up during
redo.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: visibility map
Date: 2010-06-14 05:19:38
Message-ID: 4C15BBEA.500@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14/06/10 06:08, Robert Haas wrote:
> visibilitymap.c begins with a long and useful comment - but this part
> seems to have a bit of split personality disorder.
>
> * Currently, the visibility map is not 100% correct all the time.
> * During updates, the bit in the visibility map is cleared after releasing
> * the lock on the heap page. During the window between releasing the lock
> * and clearing the bit in the visibility map, the bit in the visibility map
> * is set, but the new insertion or deletion is not yet visible to other
> * backends.
> *
> * That might actually be OK for the index scans, though. The newly inserted
> * tuple wouldn't have an index pointer yet, so all tuples reachable from an
> * index would still be visible to all other backends, and deletions wouldn't
> * be visible to other backends yet. (But HOT breaks that argument, no?)
>
> I believe that the answer to the parenthesized question here is "yes"
> (in which case we might want to just delete this paragraph).

A HOT update can only update non-indexed columns, so I think we're still
OK with HOT. To an index-only scan, it doesn't matter which tuple in a
HOT update chain you consider as live, because they both must all the
same value in the indexed columns. Subtle..

> * There's another hole in the way the PD_ALL_VISIBLE flag is set. When
> * vacuum observes that all tuples are visible to all, it sets the flag on
> * the heap page, and also sets the bit in the visibility map. If we then
> * crash, and only the visibility map page was flushed to disk, we'll have
> * a bit set in the visibility map, but the corresponding flag on the heap
> * page is not set. If the heap page is then updated, the updater won't
> * know to clear the bit in the visibility map. (Isn't that prevented by
> * the LSN interlock?)
>
> I *think* that the answer to this parenthesized question is "no".
> When we vacuum a page, we set the LSN on both the heap page and the
> visibility map page. Therefore, neither of them can get written to
> disk until the WAL record is flushed, but they could get flushed in
> either order. So the visibility map page could get flushed before the
> heap page, as the non-parenthesized portion of the comment indicates.

Right.

> However, at least in theory, it seems like we could fix this up during
> redo.

Setting a bit in the visibility map is currently not WAL-logged, but yes
once we add WAL-logging, that's straightforward to fix.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: visibility map
Date: 2010-11-22 19:24:55
Message-ID: AANLkTimuJy6tQXG4hhocRPhEqvrYODuMV6aN9+S6E15o@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 14, 2010 at 1:19 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> I *think* that the answer to this parenthesized question is "no".
>> When we vacuum a page, we set the LSN on both the heap page and the
>> visibility map page.  Therefore, neither of them can get written to
>> disk until the WAL record is flushed, but they could get flushed in
>> either order.  So the visibility map page could get flushed before the
>> heap page, as the non-parenthesized portion of the comment indicates.
>
> Right.
>
>> However, at least in theory, it seems like we could fix this up during
>> redo.
>
> Setting a bit in the visibility map is currently not WAL-logged, but yes
> once we add WAL-logging, that's straightforward to fix.

Eh, so. Suppose - for the sake of argument - we do the following:

1. Allocate an additional infomask(2) bit that means "xmin is frozen,
no need to call XidInMVCCSnapshot()". When we freeze a tuple, we set
this bit in lieu of overwriting xmin. Note that freezing pages is
already WAL-logged, so redo is possible.

2. Modify VACUUM so that, when the page is observed to be all-visible,
it will freeze all tuples on the page, set PD_ALL_VISIBLE, and set the
visibility map bit, writing a single XLOG record for the whole
operation (possibly piggybacking on XLOG_HEAP2_CLEAN if the same
vacuum already removed tuples; otherwise and/or when no tuples were
removed writing XLOG_HEAP2_FREEZE or some new record type). This
loses no forensic information because of (1). (If the page is NOT
observed to be all-visible, we freeze individual tuples only when they
hit the current age thresholds.)

Setting the visibility map bit is now crash-safe.

Please poke holes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: 高增琦 <pgf00a(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: visibility map
Date: 2010-11-23 08:13:50
Message-ID: AANLkTinCZYET70ojyp+y2YXuVO4H4ZNY+_tiZVPoDS70@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Can we just log the change of VM in log_heap_clean() for redo?
Thanks

--
GaoZengqi
pgf00a(at)gmail(dot)com
zengqigao(at)gmail(dot)com

On Tue, Nov 23, 2010 at 3:24 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Jun 14, 2010 at 1:19 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> >> I *think* that the answer to this parenthesized question is "no".
> >> When we vacuum a page, we set the LSN on both the heap page and the
> >> visibility map page. Therefore, neither of them can get written to
> >> disk until the WAL record is flushed, but they could get flushed in
> >> either order. So the visibility map page could get flushed before the
> >> heap page, as the non-parenthesized portion of the comment indicates.
> >
> > Right.
> >
> >> However, at least in theory, it seems like we could fix this up during
> >> redo.
> >
> > Setting a bit in the visibility map is currently not WAL-logged, but yes
> > once we add WAL-logging, that's straightforward to fix.
>
> Eh, so. Suppose - for the sake of argument - we do the following:
>
> 1. Allocate an additional infomask(2) bit that means "xmin is frozen,
> no need to call XidInMVCCSnapshot()". When we freeze a tuple, we set
> this bit in lieu of overwriting xmin. Note that freezing pages is
> already WAL-logged, so redo is possible.
>
> 2. Modify VACUUM so that, when the page is observed to be all-visible,
> it will freeze all tuples on the page, set PD_ALL_VISIBLE, and set the
> visibility map bit, writing a single XLOG record for the whole
> operation (possibly piggybacking on XLOG_HEAP2_CLEAN if the same
> vacuum already removed tuples; otherwise and/or when no tuples were
> removed writing XLOG_HEAP2_FREEZE or some new record type). This
> loses no forensic information because of (1). (If the page is NOT
> observed to be all-visible, we freeze individual tuples only when they
> hit the current age thresholds.)
>
> Setting the visibility map bit is now crash-safe.
>
> Please poke holes.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: visibility map
Date: 2010-11-23 08:42:13
Message-ID: 4CEB7E65.6000007@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 22.11.2010 21:24, Robert Haas wrote:
> Eh, so. Suppose - for the sake of argument - we do the following:
>
> 1. Allocate an additional infomask(2) bit that means "xmin is frozen,
> no need to call XidInMVCCSnapshot()". When we freeze a tuple, we set
> this bit in lieu of overwriting xmin. Note that freezing pages is
> already WAL-logged, so redo is possible.
>
> 2. Modify VACUUM so that, when the page is observed to be all-visible,
> it will freeze all tuples on the page, set PD_ALL_VISIBLE, and set the
> visibility map bit, writing a single XLOG record for the whole
> operation (possibly piggybacking on XLOG_HEAP2_CLEAN if the same
> vacuum already removed tuples; otherwise and/or when no tuples were
> removed writing XLOG_HEAP2_FREEZE or some new record type). This
> loses no forensic information because of (1). (If the page is NOT
> observed to be all-visible, we freeze individual tuples only when they
> hit the current age thresholds.)
>
> Setting the visibility map bit is now crash-safe.

That's an interesting idea. You pickyback setting the vm bit on the
freeze WAL record, on the assumption that you have to write the freeze
record anyway. However, if that assumption doesn't hold, because the
tuples are deleted before they reach vacuum_freeze_min_age, it's no
better than the naive approach of WAL-logging the vm bit set separately.
Whether that's acceptable or not, I don't know.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: visibility map
Date: 2010-11-23 15:51:01
Message-ID: AANLkTimGPG+D=7g=MLDw+Yi7jhE6Tg3RphV+Z8PBJNNd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 23, 2010 at 3:42 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> That's an interesting idea. You pickyback setting the vm bit on the freeze
> WAL record, on the assumption that you have to write the freeze record
> anyway. However, if that assumption doesn't hold, because the tuples are
> deleted before they reach vacuum_freeze_min_age, it's no better than the
> naive approach of WAL-logging the vm bit set separately. Whether that's
> acceptable or not, I don't know.

I don't know, either. I was trying to think of the cases where this
would generate a net increase in WAL before I sent the email, but
couldn't fully wrap my brain around it at the time. Thanks for
summarizing.

Here's another design to poke holes in:

1. Imagine that the visibility map is divided into granules. For the
sake of argument let's suppose there are 8K bits per granule; thus
each granule covers 64M of the underlying heap and 1K of space in the
visibility map itself.

2. In shared memory, create a new array called the visibility vacuum
array (VVA), each element of which has room for a backend ID, a
relfilenode, a granule number, and an LSN. Before setting bits in the
visibility map, a backend is required to allocate a slot in this
array, XLOG the slot allocation, and fill in its backend ID,
relfilenode number, and the granule number whose bits it will be
manipulating, plus the LSN of the slot allocation XLOG record. It
then sets as many bits within that granule as it likes. When done, it
sets the backend ID of the VVA slot to InvalidBackendId but does not
remove it from the array immediately; such a slot is said to have been
"released".

3. When visibility map bits are set, the LSN of the page is set to the
new-VVA-slot XLOG record, so that the visibility map page can't hit
the disk before the new-VVA-slot XLOG record. Also, the contents of
the VVA, sans backend IDs, are XLOG'd at each checkpoint. Thus, on
redo, we can compute a list of all VVA slots for which visibility-bit
changes might already be on disk; we go through and clear both the
visibility map bit and the PD_ALL_VISIBLE bits on the underlying
pages.

4. To free a VVA slot that has been released, we must xlogflush as far
as the record that allocated the slot and sync the visibility map and
heap segments containing that granule. Thus, all slots released
before a checkpoint starts can be freed after it completes.
Alternatively, an individual backend can free a previously-released
slot by perfoming the xlog flush and syncs itself. (This might
require a few more bookkeeping details to be stored in the VVA, but it
seems manageable.)

One problem with this design is that the visibility map bits never get
set on standby servers. If we don't XLOG setting the bit then I
suppose that doesn't happen now either, but it's more sucky (that's
the technical term) if you're relying on it for index-only scans
(which are also relevant on the standby, either during HS or if
promoted) versus if you're only relying on it for vacuum (which
doesn't happen on the standby anyway unless and until it's promoted).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company