Quick Links

Re: crash-safe visibility map, take three

Lists:	pgsql-hackers

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	crash-safe visibility map, take three
Date:	2010-11-30 04:57:25
Message-ID:	AANLkTik49aAd_AY3-HyT+3vmSUuGW7Qy5LaT+q3aLQEi@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Last week, I posted a couple of possible designs for making the
visibility map crash-safe, which did not elicit much comment. Since
this is an important prerequisite to index-only scans, I'm trying
again.

http://archives.postgresql.org/pgsql-hackers/2010-11/msg01474.php
http://archives.postgresql.org/pgsql-hackers/2010-11/msg01529.php

Here's a third possible design.

Instead of representing each page with a single bit in the visibility
map, use two bits. One bit indicates whether all tuples on the page
are all-visible (same as the current bit) - call this VM_ALL_VISIBLE.
The other bit is only used during VACUUM and indicates whether VACUUM
is trying to set the all-visible bit - call this
VM_BECOMING_ALL_VISIBLE. We observe the rule that any operation that
clears PD_ALL_VISIBLE must clear both the VM_ALL_VISIBLE and
VM_BECOMING_ALL_VISIBLE bits for that page in the visibility map.
VACUUM precedes as follows:

1. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits
are set, take the exclusive content lock for long enough to clear
them.
2. Scan the heap. When a page is observed to be all-visible, set
VM_BECOMING_ALL_VISIBLE and PD_ALL_VISIBLE.
3. Loop over shared buffers and write out every page to the OS which
belongs to the target relation, was marked all-visible in step 2, and
is still dirty. Note that this may require a WAL flush.
4. fsync() the heap.
5. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits
are set, take the exclusive content lock, clear each such bit, set the
corresponding VM_ALL_VISIBLE bits and XLOG the page.

One might actually want to do steps 2-5 incrementally, in 1GB chunks,
so that you don't fsync() too much of the relation all at once.

If you tilt your head just right, the recurring problem in all of this
is that the heap page and the visibility map page can go to disk in
either order, and we have no way of knowing which one. A more radical
solution to this problem (and, thus, a fourth possible design) would
be to add a field to the buffer descriptor allowing one page to "wire"
another page into shared buffers. If the value is >0, it's the number
of a buffer it's currently wiring. If the value is <0, it's the
number of other buffers that have wired this buffer. A buffer both
wire another buffer and itself be wired at the same time. If the
value is =0, everything's normal. To make this work, though, you'd
have to complicate the checkpoint logic pretty considerably - make
sure all the unwired buffers are written and fsync'd first, thus
unwiring the remaining ones to be written and fsync'd in a second
pass; and there are also possible problems with very small relations,
where the number of wired buffers might grow to an uncomfortably high
percentage of the total. Waving my hands at all this complexity, you
could then make the heap pages wire the visibility map pages.

I can't say I'm totally in love with any of these designs. Anyone
else have any ideas, or any opinions about which one is best?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 07:34:04
Message-ID:	4CF4A8EC.2070408@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.11.2010 06:57, Robert Haas wrote:
> I can't say I'm totally in love with any of these designs. Anyone
> else have any ideas, or any opinions about which one is best?

Well, the design I've been pondering goes like this:

At vacuum:

1. Write an "intent" XLOG record listing a chunk of visibility map bits
that are not currently set, that we are going to try to set. A chunk of
say 100 bits would be about right.

2. Scan the 100 heap pages as we currently do, setting the visibility
map bits as we go.

3. After the scan, lock the visibility map page, check which of the bits
that we set in step 2 are still set (concurrent updates might've cleared
some), and write a final XLOG record listing the set bits. This step
isn't necessary for correctness, BTW, but without it you lose all the
set bits if you crash before next checkpoint.

At replay, when we see the intent XLOG record, clear all the bits listed
in it. This ensures that if we crashed and some of the visibility map
bits were flushed to disk but the corresponding changes to the heap
pages were not, the bits are cleared. When we see the final XLOG record,
we set the bits.

Some care is needed with checkpoints. Setting visibility map bits in
step 2 is safe because crash recovery will replay the intent XLOG record
and clear any incorrectly set bits. But if a checkpoint has happened
after the intent XLOG record was written, that's not true. This can be
avoided by checking RedoRecPtr in step 2, and writing a new intent XLOG
record if it has changed since the last intent XLOG record was written.

There's a small race condition in the way a visibility map bit is
currently cleared. When a heap page is updated, it is locked, the update
is WAL-logged, and the lock is released. The visibility map page is
updated only after that. If the final vacuum XLOG record is written just
after updating the heap page, but before the visibility map bit is
cleared, replaying the final XLOG record will set a bit that should not
have been set.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Rob Wultsch <wultsch(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 13:18:12
Message-ID:	AANLkTinQeiV9HhkFDOUB=uewdNR4TryOLBieJ3x28PAD@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 29, 2010 at 9:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> 1. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits
> are set, take the exclusive content lock for long enough to clear
> them.

I wonder what the performance hit will be to workloads with contention
and if this feature should be optional.

--
Rob Wultsch
wultsch(at)gmail(dot)com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 15:32:15
Message-ID:	AANLkTinZDakWkLQ9D=b7hrBB1A3bE2pPj1NoyT=KusyJ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Some care is needed with checkpoints. Setting visibility map bits in step 2
> is safe because crash recovery will replay the intent XLOG record and clear
> any incorrectly set bits. But if a checkpoint has happened after the intent
> XLOG record was written, that's not true. This can be avoided by checking
> RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed
> since the last intent XLOG record was written.

It seems like you'll need to hold some kind of lock between the time
you examine RedoRecPtr and the time you actually examine the bit.
WALInsertLock in shared mode, maybe?

> There's a small race condition in the way a visibility map bit is currently
> cleared. When a heap page is updated, it is locked, the update is
> WAL-logged, and the lock is released. The visibility map page is updated
> only after that. If the final vacuum XLOG record is written just after
> updating the heap page, but before the visibility map bit is cleared,
> replaying the final XLOG record will set a bit that should not have been
> set.

Well, if that final XLOG record isn't necessary for correctness
anyway, the obvious thing to do seems to be - don't write it. Crashes
are not so common that loss of even a full hour's visibility map bits
in the event that we have one seems worth killing ourselves over. And
not everybody sets checkpoint_timeout to an hour, and not all
checkpoints are triggered by checkpoint_timeout, and not all crashes
happen just before it expires. Seems like we might be better off
writing that much less WAL.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 15:38:02
Message-ID:	26198.1291131482@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 30.11.2010 06:57, Robert Haas wrote:
>> I can't say I'm totally in love with any of these designs. Anyone
>> else have any ideas, or any opinions about which one is best?

> Well, the design I've been pondering goes like this:

Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions? That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values. All of these other schemes seem too complicated and not
provably correct.

Of course, that'd mean doing the bit changes inside the critical
sections for the related actions, so it's not a trivial change
code-wise, but neither are these other ideas.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 15:43:47
Message-ID:	4CF51BB3.10207@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.11.2010 17:32, Robert Haas wrote:
> On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Some care is needed with checkpoints. Setting visibility map bits in step 2
>> is safe because crash recovery will replay the intent XLOG record and clear
>> any incorrectly set bits. But if a checkpoint has happened after the intent
>> XLOG record was written, that's not true. This can be avoided by checking
>> RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed
>> since the last intent XLOG record was written.
>
> It seems like you'll need to hold some kind of lock between the time
> you examine RedoRecPtr and the time you actually examine the bit.
> WALInsertLock in shared mode, maybe?

It's enough to hold an exclusive lock on the visibility map page. You
have to set the bit first, and then check RedoRecPtr, and if it changed,
write the XLOG record before releasing the lock. If RedoRecPtr changes
any time before we check RedoRecPtr, we'll write the XLOG record so
we're safe. If it changes after that, we're safe because the checkpoint
will flush the updated heap page and visibility map page.

>> There's a small race condition in the way a visibility map bit is currently
>> cleared. When a heap page is updated, it is locked, the update is
>> WAL-logged, and the lock is released. The visibility map page is updated
>> only after that. If the final vacuum XLOG record is written just after
>> updating the heap page, but before the visibility map bit is cleared,
>> replaying the final XLOG record will set a bit that should not have been
>> set.
>
> Well, if that final XLOG record isn't necessary for correctness
> anyway, the obvious thing to do seems to be - don't write it. Crashes
> are not so common that loss of even a full hour's visibility map bits
> in the event that we have one seems worth killing ourselves over. And
> not everybody sets checkpoint_timeout to an hour, and not all
> checkpoints are triggered by checkpoint_timeout, and not all crashes
> happen just before it expires. Seems like we might be better off
> writing that much less WAL.

Yeah, possibly. It also means that the set bits will not propagate to
standby servers, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 15:46:44
Message-ID:	AANLkTim_SX_6UfU+DNkmfmFWuDQnN2Pd51qtP=c9t8xi@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 10:38 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> On 30.11.2010 06:57, Robert Haas wrote:
>>> I can't say I'm totally in love with any of these designs. Anyone
>>> else have any ideas, or any opinions about which one is best?
>
>> Well, the design I've been pondering goes like this:
>
> Wouldn't it be easier and more robust to just consider VM bit changes to
> be part of the WAL-logged actions? That would include updating LSNs on
> VM pages and flushing VM pages to disk during checkpoint based on their
> LSN values. All of these other schemes seem too complicated and not
> provably correct.

What WAL-logged actions?

The problem case is where a page has no tuples or line pointers that
need to be removed, and all we need to do is mark it all-visible. We
don't current WAL-log anything in that case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 15:53:25
Message-ID:	4CF51DF5.8060107@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.11.2010 17:38, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> On 30.11.2010 06:57, Robert Haas wrote:
>>> I can't say I'm totally in love with any of these designs. Anyone
>>> else have any ideas, or any opinions about which one is best?
>
>> Well, the design I've been pondering goes like this:
>
> Wouldn't it be easier and more robust to just consider VM bit changes to
> be part of the WAL-logged actions? That would include updating LSNs on
> VM pages and flushing VM pages to disk during checkpoint based on their
> LSN values. All of these other schemes seem too complicated and not
> provably correct.

The vm bit can be set once all the tuples on the page become visible to
everyone. There is no WAL-logged action at that point we could piggyback on.

Clearing the bit is already handled like that - replay of heap
insert/update/delete records clear the visibility map bit.

> Of course, that'd mean doing the bit changes inside the critical
> sections for the related actions, so it's not a trivial change
> code-wise, but neither are these other ideas.

Yeah, I'm not terribly excited about any of these schemes. The "intent"
record seems like the simplest one, but even that is quite different
from the traditional WAL-logging we do that it makes me slightly nervous.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 15:54:50
Message-ID:	AANLkTin2nGKtRoT0mRzWxsnnnWShthiJ_WdW48t8ZuJ3@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 10:43 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> It seems like you'll need to hold some kind of lock between the time
>> you examine RedoRecPtr and the time you actually examine the bit.
>> WALInsertLock in shared mode, maybe?
>
> It's enough to hold an exclusive lock on the visibility map page. You have
> to set the bit first, and then check RedoRecPtr, and if it changed, write
> the XLOG record before releasing the lock. If RedoRecPtr changes any time
> before we check RedoRecPtr, we'll write the XLOG record so we're safe. If it
> changes after that, we're safe because the checkpoint will flush the updated
> heap page and visibility map page.

Brilliant. I assume that we need to call GetRedoRecPtr() after taking
the exclusive lock on the page, though?

> Yeah, possibly. It also means that the set bits will not propagate to
> standby servers, though.

That's definitely sucky, but in some ways it would be more complicated
if they did, because I don't think all-visible on the master implies
all-visible on the standby.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:02:05
Message-ID:	4CF51FFD.8030006@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here's one more idea:

The trivial solution to this is to WAL-log setting the visibility map
bit, like we WAL-log any other operation. Lock the heap page, lock the
visibility map page, write WAL-record, and release locks. That works,
but the problem is that it creates quite a lot of new WAL traffic.

We could reduce the WAL traffic by simply updating multiple pages at a
time. Lock N pages, lock the visibility map page, write one WAL record,
and release locks. If N=10, for example, we only need to WAL-log a
couple of bytes per page, so the WAL volume should be acceptable. The
downside is that you need to keep more pages locked at the same time,
but maybe that's not too bad.

This wouldn't require anything special, which means fewer hard-to-debug
visibility & recovery bugs.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:10:13
Message-ID:	26979.1291133413@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 30.11.2010 17:38, Tom Lane wrote:
>> Wouldn't it be easier and more robust to just consider VM bit changes to
>> be part of the WAL-logged actions? That would include updating LSNs on
>> VM pages and flushing VM pages to disk during checkpoint based on their
>> LSN values. All of these other schemes seem too complicated and not
>> provably correct.

> The vm bit can be set once all the tuples on the page become visible to
> everyone. There is no WAL-logged action at that point we could piggyback on.

So you start emitting a WAL entry for the act of setting the VM bit
(and I guess the page header hint bit would be included in that too).

> Yeah, I'm not terribly excited about any of these schemes. The "intent"
> record seems like the simplest one, but even that is quite different
> from the traditional WAL-logging we do that it makes me slightly nervous.

I'm not convinced it works at all. Consider write intent record,
checkpoint, set bit, crash before completing vacuum. There will be
no second intent record at which you could clean up if things are
inconsistent.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:16:33
Message-ID:	27121.1291133793@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> The trivial solution to this is to WAL-log setting the visibility map
> bit, like we WAL-log any other operation. Lock the heap page, lock the
> visibility map page, write WAL-record, and release locks. That works,
> but the problem is that it creates quite a lot of new WAL traffic.

How much is "quite a lot"? Do we have any real reason to think that
this solution is unacceptable performance-wise?

I'd also suggest that if you want to prevent torn-page syndrome on VM
pages (and if you want to rely on their LSN values, you do) then you
probably don't have any choice anyway. VM pages will have to adhere to
the same write-full-page-on-first-mod-after-checkpoint rule as any other
page. I'd guess that this will swamp any savings from cutesy schemes
for reducing the number of WAL records.

> We could reduce the WAL traffic by simply updating multiple pages at a
> time. Lock N pages, lock the visibility map page, write one WAL record,
> and release locks.

I don't think that will work, because you have to hold the lock on a
page from the time you check that it's all-visible to the time you apply
the update. The loss of concurrency against updates would be pretty
bad, and I think you'd be creating significant risk of deadlocks from
holding multiple buffer locks at once.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:22:03
Message-ID:	AANLkTi=7St=B=MONtpwD7eTQe9o2Vzd2j5Vfs0wnmj0b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> How much is "quite a lot"? Do we have any real reason to think that
> this solution is unacceptable performance-wise?

Well, let's imagine a 1GB insert-only table. It has 128K pages. If
you XLOG setting the bit on each page, you'll need to write 128K WAL
records, each containing a 12-byte relfilenode and a 4-byte block
offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.

But you did just dirty a gigabyte of data.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:27:04
Message-ID:	AANLkTinVXHXSi0Bgtz5_WHPnOqGcZj_DT6cB2c2_HS3Z@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 11:22 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> How much is "quite a lot"? Do we have any real reason to think that
>> this solution is unacceptable performance-wise?
>
> Well, let's imagine a 1GB insert-only table. It has 128K pages. If
> you XLOG setting the bit on each page, you'll need to write 128K WAL
> records, each containing a 12-byte relfilenode and a 4-byte block
> offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.
>
> But you did just dirty a gigabyte of data.

Oh, but it's worse than that. When you XLOG a WAL record for each of
those pages, you're going to trigger full-page writes for all of them.
So now you've turned 1GB of data to write into 2+ GB of data to
write.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:27:30
Message-ID:	4CF525F2.3030806@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.11.2010 18:22, Robert Haas wrote:
> On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> How much is "quite a lot"? Do we have any real reason to think that
>> this solution is unacceptable performance-wise?
>
> Well, let's imagine a 1GB insert-only table. It has 128K pages. If
> you XLOG setting the bit on each page, you'll need to write 128K WAL
> records, each containing a 12-byte relfilenode and a 4-byte block
> offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.

Plus WAL headers, I think it's something like 32 or 40 bytes of WAL per
page.

> But you did just dirty a gigabyte of data.

Good point.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:30:53
Message-ID:	4CF526BD.7050203@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.11.2010 18:10, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Yeah, I'm not terribly excited about any of these schemes. The "intent"
>> record seems like the simplest one, but even that is quite different
>> from the traditional WAL-logging we do that it makes me slightly nervous.
>
> I'm not convinced it works at all. Consider write intent record,
> checkpoint, set bit, crash before completing vacuum. There will be
> no second intent record at which you could clean up if things are
> inconsistent.

That's why you need to check the RedoRecPtr when you set the bit. If it
has changed, ie. a checkpoint has happened, the set bit step will write
a new intent record.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:33:18
Message-ID:	27556.1291134798@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Oh, but it's worse than that. When you XLOG a WAL record for each of
> those pages, you're going to trigger full-page writes for all of them.
> So now you've turned 1GB of data to write into 2+ GB of data to
> write.

No, because only the first mod of each VM page would trigger a full page
write, at least assuming a reasonable ordering of the operations.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:37:13
Message-ID:	27659.1291135033@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 30.11.2010 18:10, Tom Lane wrote:
>> I'm not convinced it works at all. Consider write intent record,
>> checkpoint, set bit, crash before completing vacuum. There will be
>> no second intent record at which you could clean up if things are
>> inconsistent.

> That's why you need to check the RedoRecPtr when you set the bit. If it
> has changed, ie. a checkpoint has happened, the set bit step will write
> a new intent record.

Oh, you explained the proposal poorly then. I thought you meant recheck
and write another intent record just once, immediately before sending
the final xlog record.

It still seems rickety and not clearly correct, especially when you
start thinking about all the other constraints we have on xlog behavior
(eg, does this work while taking a base backup).

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:40:10
Message-ID:	27743.1291135210@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> That's definitely sucky, but in some ways it would be more complicated
> if they did, because I don't think all-visible on the master implies
> all-visible on the standby.

Ouch. That seems like it could shoot down all these proposals. There
definitely isn't any way to make VM crash-safe if there is no WAL-driven
mechanism for setting the bits.

I guess what we need is a way to delay the application of such a WAL
record on the slave until it's safe, which means the record also has to
carry some indication of the youngest XMIN on the page.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:47:40
Message-ID:	AANLkTimD2ACTWWiCJowuoh9t5w79bxFVmnsvaqLxBdTi@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 11:33 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Oh, but it's worse than that. When you XLOG a WAL record for each of
>> those pages, you're going to trigger full-page writes for all of them.
>> So now you've turned 1GB of data to write into 2+ GB of data to
>> write.
>
> No, because only the first mod of each VM page would trigger a full page
> write, at least assuming a reasonable ordering of the operations.

I'm not worried about the full-page writes from updating the
visibility map - I'm worried about the full-page writes from updating
the heap. It doesn't matter a whit if we fail to set a bit in the
visibility map. What matters is if we DO set the bit in the visibility
map but FAIL TO set the bit in the heap, because then a subsequent
update to the heap page won't check the visibility map and clear the
bit. The *heap* updates are the ones that have to be guaranteed to
make it to disk.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:48:08
Message-ID:	4CF52AC8.8080004@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.11.2010 18:40, Tom Lane wrote:
> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>> That's definitely sucky, but in some ways it would be more complicated
>> if they did, because I don't think all-visible on the master implies
>> all-visible on the standby.
>
> Ouch. That seems like it could shoot down all these proposals. There
> definitely isn't any way to make VM crash-safe if there is no WAL-driven
> mechanism for setting the bits.

Note that this is only a problem for *hot* standby. After failover, all
the tuples that were visible to everyone in the master are also visible
to all new transactions in the standby.

We dealt with this in 9.0 already, with the "killed" flag in index
tuples and the PD_ALL_VISIBLE flag in heap scans. We simply don't
believe them in hot standby mode, and check visibility even if the flag
is set.

> I guess what we need is a way to delay the application of such a WAL
> record on the slave until it's safe, which means the record also has to
> carry some indication of the youngest XMIN on the page.

Something like that would certainly be nice. With index-only scans, it
can be a big disappointment if you can't do an index-only scan in hot
standby.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:49:34
Message-ID:	4CF52B1E.3040106@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.11.2010 18:33, Tom Lane wrote:
> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>> Oh, but it's worse than that. When you XLOG a WAL record for each of
>> those pages, you're going to trigger full-page writes for all of them.
>> So now you've turned 1GB of data to write into 2+ GB of data to
>> write.
>
> No, because only the first mod of each VM page would trigger a full page
> write, at least assuming a reasonable ordering of the operations.

If you change the LSN on the heap pages, you have to write full page
images of those as well.

Let's recap what happens when a VM bit is set: You set the
PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it
usually isn't), and then set the bit in the VM while keeping the heap
page locked.

Can we get away with not setting the LSN on the heap page, even though
we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page
can be flushed to disk before the WAL record, but I think that's fine
because it's OK to have the flag set in the heap page even if the VM bit
is not set.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:54:43
Message-ID:	AANLkTi=q1YQdwU_7TpLT9kOjz5pMDbsq1=bOwehzhaVZ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> That's definitely sucky, but in some ways it would be more complicated
>> if they did, because I don't think all-visible on the master implies
>> all-visible on the standby.
>
> Ouch. That seems like it could shoot down all these proposals. There
> definitely isn't any way to make VM crash-safe if there is no WAL-driven
> mechanism for setting the bits.

Heikki's intent method works fine, because the WAL record only clears
the visibility map bits on redo; it never sets them.

I'm unexcited about inventing more ways to delay XLOG application on
the standby. We have enough of those already.

We could actually allow the slave to set the visibility map bits based
on its own xmin horizon. The only problem is that you wouldn't be
able to write the intent XLOG records. I suppose you could have a
separate file which is just used to store the intent records, and
designate a range of very-high numbered LSNs to mean blocks of the
intent file rather than a position in the regular WAL stream. VACUUM
is so much fun on the master, let's have it on the standby too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:55:44
Message-ID:	28180.1291136144@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Can we get away with not setting the LSN on the heap page, even though
> we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page
> can be flushed to disk before the WAL record, but I think that's fine
> because it's OK to have the flag set in the heap page even if the VM bit
> is not set.

Why is that fine? It's certainly not fine from the standpoint of
someone wondering why his index-only scan performs so badly.

I think all this hair-splitting about cases where it's okay to have one
bit set and not the other is misguided. To me, crash-safety of the VM
means that its copy of the page-header bit is right. Period. Yes, it
will cost something to ensure that; so what? If we don't get more than
enough compensating performance gain from index-only scans, the whole
patch is going to end up reverted.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:59:20
Message-ID:	28288.1291136360@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Ouch. That seems like it could shoot down all these proposals. There
>> definitely isn't any way to make VM crash-safe if there is no WAL-driven
>> mechanism for setting the bits.

> Heikki's intent method works fine, because the WAL record only clears
> the visibility map bits on redo; it never sets them.

Uh, no, because he also had that final WAL record that would set the
bits.

> We could actually allow the slave to set the visibility map bits based
> on its own xmin horizon.

Not in a crash-safe way, which is exactly the problem here.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 16:59:47
Message-ID:	AANLkTinOnXyYg9spEu01JHyUy+DhEJ-aa6bw7xjz0rv-@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 11:49 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 30.11.2010 18:33, Tom Lane wrote:
>>
>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>>
>>> Oh, but it's worse than that. When you XLOG a WAL record for each of
>>> those pages, you're going to trigger full-page writes for all of them.
>>> So now you've turned 1GB of data to write into 2+ GB of data to
>>> write.
>>
>> No, because only the first mod of each VM page would trigger a full page
>> write, at least assuming a reasonable ordering of the operations.
>
> If you change the LSN on the heap pages, you have to write full page images
> of those as well.
>
> Let's recap what happens when a VM bit is set: You set the PD_ALL_VISIBLE
> flag on the heap page (assuming it's not set already, it usually isn't), and
> then set the bit in the VM while keeping the heap page locked.
>
> Can we get away with not setting the LSN on the heap page, even though we
> set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page can be
> flushed to disk before the WAL record, but I think that's fine because it's
> OK to have the flag set in the heap page even if the VM bit is not set.

I don't immediately see why that wouldn't work. As long as you bump
the LSN on the visibility map page, and also bump the LSN of the
visibility map page every time you clear a bit, I think you should be
OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 17:00:29
Message-ID:	AANLkTi=5A9oA4Uuk_6ceHVwBaCszfOBv-ZUN05t0_j5_@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 11:55 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Can we get away with not setting the LSN on the heap page, even though
>> we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page
>> can be flushed to disk before the WAL record, but I think that's fine
>> because it's OK to have the flag set in the heap page even if the VM bit
>> is not set.
>
> Why is that fine? It's certainly not fine from the standpoint of
> someone wondering why his index-only scan performs so badly.
>
> I think all this hair-splitting about cases where it's okay to have one
> bit set and not the other is misguided. To me, crash-safety of the VM
> means that its copy of the page-header bit is right. Period. Yes, it
> will cost something to ensure that; so what? If we don't get more than
> enough compensating performance gain from index-only scans, the whole
> patch is going to end up reverted.

We're not going to double the cost of VACUUM to get index-only scans.
And that's exactly what will happen if you do full-page writes of
every heap page to set a single bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 17:01:56
Message-ID:	AANLkTinMLD+V5uf6178YTB-Q4eqrt7Zvg+pObCuwehaJ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 11:59 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Ouch. That seems like it could shoot down all these proposals. There
>>> definitely isn't any way to make VM crash-safe if there is no WAL-driven
>>> mechanism for setting the bits.
>
>> Heikki's intent method works fine, because the WAL record only clears
>> the visibility map bits on redo; it never sets them.
>
> Uh, no, because he also had that final WAL record that would set the
> bits.

Well, as already discussed upthread, that WAL record causes some other
problems, so make it Heikki's intent method, without the final WAL
record that breaks things.

>> We could actually allow the slave to set the visibility map bits based
>> on its own xmin horizon.
>
> Not in a crash-safe way, which is exactly the problem here.

Brilliant selective quoting.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 17:10:49
Message-ID:	28581.1291137049@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> We're not going to double the cost of VACUUM to get index-only scans.
> And that's exactly what will happen if you do full-page writes of
> every heap page to set a single bit.

It's ridiculous to claim that that "doubles the cost of VACUUM". In the
worst case, it will add 25% to the cost of setting an all-visible bit on
a page where there is no other work to do. (You already are writing out
the heap page and the VM page, plus a WAL image of the heap page, so a
WAL image of the VM page adds 25%. But only if you did not set any
other bits on the same VM page, which is probably not a real common
case.) Given that VACUUM has a lot of other cleanup besides visibility
bit setting, I'm not convinced that this would even be noticeable.

I think the burden is on people who are proposing complicated mechanisms
to show that there's actually a strong need for them.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 17:12:35
Message-ID:	AANLkTikAoV4DzuwSts4HZkSi65hsEGs5tX7WBYmdpq7z@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> We're not going to double the cost of VACUUM to get index-only scans.
>> And that's exactly what will happen if you do full-page writes of
>> every heap page to set a single bit.
>
> It's ridiculous to claim that that "doubles the cost of VACUUM". In the
> worst case, it will add 25% to the cost of setting an all-visible bit on
> a page where there is no other work to do. (You already are writing out
> the heap page and the VM page,

True.

> plus a WAL image of the heap page, so a

False. That is exactly what we are NOT doing now and what we must
find a way to avoid doing.

> WAL image of the VM page adds 25%. But only if you did not set any
> other bits on the same VM page, which is probably not a real common
> case.) Given that VACUUM has a lot of other cleanup besides visibility
> bit setting, I'm not convinced that this would even be noticeable.
>
> I think the burden is on people who are proposing complicated mechanisms
> to show that there's actually a strong need for them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 17:22:02
Message-ID:	28879.1291137722@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> It's ridiculous to claim that that "doubles the cost of VACUUM". In the
>> worst case, it will add 25% to the cost of setting an all-visible bit on
>> a page where there is no other work to do. (You already are writing out
>> the heap page and the VM page,

> True.

>> plus a WAL image of the heap page, so a

> False. That is exactly what we are NOT doing now and what we must
> find a way to avoid doing.

I do not accept that argument. You can't make an omelette without
breaking eggs, and the cost of index-only scans is going to be that
it costs more to get the visibility bits set in the first place.

But having said that, I wonder whether we need a full-page image for
a WAL-logged action that is known to involve only setting a single bit
and updating LSN. Would omitting the FPI be any more risky than what
happens now (ie, the page does get written back to disk at some point,
without any image from which it can be rewritten if the write fails...)

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 17:25:18
Message-ID:	4CF5337E.8000505@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.11.2010 19:22, Tom Lane wrote:
> But having said that, I wonder whether we need a full-page image for
> a WAL-logged action that is known to involve only setting a single bit
> and updating LSN. Would omitting the FPI be any more risky than what
> happens now (ie, the page does get written back to disk at some point,
> without any image from which it can be rewritten if the write fails...)

You have to write a full-page image if you update the LSN, because
otherwise the next update that comes along will not write a full page image.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 17:25:41
Message-ID:	AANLkTinM57GpFRy78HniGQb+ojdMTtx-u7U2T3n9b0Ge@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> But having said that, I wonder whether we need a full-page image for
> a WAL-logged action that is known to involve only setting a single bit
> and updating LSN. Would omitting the FPI be any more risky than what
> happens now (ie, the page does get written back to disk at some point,
> without any image from which it can be rewritten if the write fails...)

That's pretty much exactly what Heikki proposed 35 minutes ago, and
you objected 6 minutes later. I still think it might work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 17:26:08
Message-ID:	AANLkTimPZJ6200fAdOz1RwGh3hegw+ZZrHpkMui7ChKt@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 12:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> But having said that, I wonder whether we need a full-page image for
>> a WAL-logged action that is known to involve only setting a single bit
>> and updating LSN. Would omitting the FPI be any more risky than what
>> happens now (ie, the page does get written back to disk at some point,
>> without any image from which it can be rewritten if the write fails...)
>
> That's pretty much exactly what Heikki proposed 35 minutes ago, and
> you objected 6 minutes later. I still think it might work.

Oh, I see the difference now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-11-30 17:36:07
Message-ID:	29237.1291138567@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 30.11.2010 19:22, Tom Lane wrote:
>> But having said that, I wonder whether we need a full-page image for
>> a WAL-logged action that is known to involve only setting a single bit
>> and updating LSN.

> You have to write a full-page image if you update the LSN, because
> otherwise the next update that comes along will not write a full page image.

Um. Drat. I was thinking about the replay side, where I think it would
actually work --- but you're right, it would break the logic on the
generation side. Unless you want to put in some kind of flag saying
"this was only a visibility bit update, any bigger update still needs
to write an FPI".

regards, tom lane

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 01:35:56
Message-ID:	201012010135.oB11Zud04761@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> On 30.11.2010 18:33, Tom Lane wrote:
> > Robert Haas<robertmhaas(at)gmail(dot)com> writes:
> >> Oh, but it's worse than that. When you XLOG a WAL record for each of
> >> those pages, you're going to trigger full-page writes for all of them.
> >> So now you've turned 1GB of data to write into 2+ GB of data to
> >> write.
> >
> > No, because only the first mod of each VM page would trigger a full page
> > write, at least assuming a reasonable ordering of the operations.
>
> If you change the LSN on the heap pages, you have to write full page
> images of those as well.
>
> Let's recap what happens when a VM bit is set: You set the
> PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it
> usually isn't), and then set the bit in the VM while keeping the heap
> page locked.

What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
happen so the heap page is guaranteed to be on disk, then on next read,
if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
VM bit.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 07:35:40
Message-ID:	4CF5FACC.6040404@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01.12.2010 03:35, Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>> Let's recap what happens when a VM bit is set: You set the
>> PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it
>> usually isn't), and then set the bit in the VM while keeping the heap
>> page locked.
>
> What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
> happen so the heap page is guaranteed to be on disk, then on next read,
> if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
> VM bit.

Hmm, you'd somehow have to know if a checkpoint has happened since the
flag was set. And it might be a long wait, which makes it less
attractive for index-only scans.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 13:39:07
Message-ID:	201012011339.oB1Dd7L22590@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> On 01.12.2010 03:35, Bruce Momjian wrote:
> > Heikki Linnakangas wrote:
> >> Let's recap what happens when a VM bit is set: You set the
> >> PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it
> >> usually isn't), and then set the bit in the VM while keeping the heap
> >> page locked.
> >
> > What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
> > happen so the heap page is guaranteed to be on disk, then on next read,
> > if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
> > VM bit.
>
> Hmm, you'd somehow have to know if a checkpoint has happened since the
> flag was set. And it might be a long wait, which makes it less

Well, doesn't the page LSN tell you that already?

> attractive for index-only scans.

My assumption is that this page will remain read-only for a while, so I
don't see the big downside in a delay of max 5-10 minutes. For sites
where pages go frequently in and out of read-only status, I don't think
index-only scans are every going to be a big win. Long-running queries
alone are going to delay how quickly we can set PD_ALL_VISIBLE, so I
don't see an additional 5-10 minutes as a big problem.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 13:53:12
Message-ID:	4CF65348.7070903@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01.12.2010 15:39, Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>> On 01.12.2010 03:35, Bruce Momjian wrote:
>>> Heikki Linnakangas wrote:
>>>> Let's recap what happens when a VM bit is set: You set the
>>>> PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it
>>>> usually isn't), and then set the bit in the VM while keeping the heap
>>>> page locked.
>>>
>>> What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
>>> happen so the heap page is guaranteed to be on disk, then on next read,
>>> if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
>>> VM bit.
>>
>> Hmm, you'd somehow have to know if a checkpoint has happened since the
>> flag was set. And it might be a long wait, which makes it less
>
> Well, doesn't the page LSN tell you that already?

If we update the LSN when we set the flag, then we have to write the
full-page-image. That's very expensive. If we don't update the LSN, then
the LSN says nothing about when the flag was set.

>> attractive for index-only scans.
>
> My assumption is that this page will remain read-only for a while, so I
> don't see the big downside in a delay of max 5-10 minutes. For sites
> where pages go frequently in and out of read-only status, I don't think
> index-only scans are every going to be a big win. Long-running queries
> alone are going to delay how quickly we can set PD_ALL_VISIBLE, so I
> don't see an additional 5-10 minutes as a big problem.

Perhaps we could live with it, but it would be annoying to have to
checkpoint after a data load, before index-only scans kick in.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 14:57:51
Message-ID:	4CF60E0F0200002500037FEE@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> it would be annoying to have to checkpoint after a data load

Heck, in my world it's currently pretty much a necessity to run
VACUUM FREEZE ANALYZE on a table after a data load before it's
reasonable to expose the table to production use. It would hardly
be an inconvenience to also run a CHECKPOINT.

-Kevin

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 15:36:51
Message-ID:	201012011536.oB1Fapf16029@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> On 01.12.2010 15:39, Bruce Momjian wrote:
> > Heikki Linnakangas wrote:
> >> On 01.12.2010 03:35, Bruce Momjian wrote:
> >>> Heikki Linnakangas wrote:
> >>>> Let's recap what happens when a VM bit is set: You set the
> >>>> PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it
> >>>> usually isn't), and then set the bit in the VM while keeping the heap
> >>>> page locked.
> >>>
> >>> What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
> >>> happen so the heap page is guaranteed to be on disk, then on next read,
> >>> if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
> >>> VM bit.
> >>
> >> Hmm, you'd somehow have to know if a checkpoint has happened since the
> >> flag was set. And it might be a long wait, which makes it less
> >
> > Well, doesn't the page LSN tell you that already?
>
> If we update the LSN when we set the flag, then we have to write the
> full-page-image. That's very expensive. If we don't update the LSN, then
> the LSN says nothing about when the flag was set.

Oh, we don't update the LSN when we set the PD_ALL_VISIBLE flag? OK,
please let me think some more. Thanks.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 15:46:21
Message-ID:	AANLkTikU5VtO1XW5-=gLvJA=cX66CsMo42AKVWYMGHZW@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 1, 2010 at 9:57 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>> it would be annoying to have to checkpoint after a data load
>
> Heck, in my world it's currently pretty much a necessity to run
> VACUUM FREEZE ANALYZE on a table after a data load before it's
> reasonable to expose the table to production use. It would hardly
> be an inconvenience to also run a CHECKPOINT.

The goal here is to make that better, not worse!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 16:25:36
Message-ID:	AANLkTi=m063OeTLPWzixSDahsO9Aep9uRSs0DHYNMdgp@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 1, 2010 at 10:36 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Oh, we don't update the LSN when we set the PD_ALL_VISIBLE flag? OK,
> please let me think some more. Thanks.

As far as I can tell, there are basically two viable solutions on the
table here.

1. Every time we observe a page as all-visible, (a) set the
PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the
bit in the visibility map page, bumping the LSN as usual, and (c) emit
a WAL record indicating the relation and block number. On redo of
this record, set both the page-level bit and the visibility map bit.
The heap page may hit the disk before the WAL record, but that's OK;
it just might result in a little extra work until some subsequent
operation gets the visibility map bit set. The visibility map page
page may hit the disk before the heap page, but that's OK too, because
the WAL record will already be on disk due to the LSN interlock. If a
crash occurs before the heap page is flushed, redo will fix the heap
page. (The heap page will get flushed as part of the next checkpoint,
if not sooner, so by the time the redo pointer advances past the WAL
record, there's no longer a risk.)

2. Every time we observe a page as all-visible, (a) set the
PD_ALL_VISIBLE bit on the page, without bumping the LSN, (b) set the
bit in the visibility map page, bumping the LSN if a WAL record is
issued (which only happens sometimes, read on), and (c) emit a WAL
record indicating the "chunk" of 128 visibility map bits which
contains the bit we just set - but only if we're now dealing with a
new group of 128 visibility map bits or if a checkpoint has intervened
since the last such record we emitted. On redo of this record, clear
the visibility map bits in each chunk. The heap page may hit the disk
before the WAL record, but that's OK for the same reasons as in plan
#1. The visibility map page may hit the disk before the heap page,
but that's OK too, because the WAL record will already be on disk to
due the LSN interlock. If a crash occurs before the heap page makes
it to disk, then redo will clear the visibility map bits, leaving them
to be reset by a subsequent VACUUM.

As is typical with good ideas, neither of these seems terribly
complicated in retrospect. Kudos to Heikki for thinking them up and
explaining them.

After some thought, I think that approach #1 is probably better,
because it propagates visibility map bits to the standby. During
index-only scans, the standby will have to ignore them during HS
operation just as it currently ignores the PD_ALL_VISIBLE page-level
bit, but if and when the standby is promoted to master, it's important
to have those bits already set, both for index-only scans and also
because, absent that, the first autovacuum on each table will end up
scanning the whole things and dirtying tremendous gobs of data setting
all those bits, which is just the sort of ugly surprise that we don't
want to give people right after they've been forced to perform a
failover.

I think we can improve this a bit further by also introducing a
HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
FrozenXID. This allows us to freeze tuples aggressively - if we want
- without losing any forensic information. We can then modify the
above algorithm slightly, so that when we observe that a page is all
visible, we not only set PD_ALL_VISIBLE on the page but also
HEAP_XMIN_FROZEN on each tuple. The WAL record marking the page as
all-visible then doubles as a WAL record marking it frozen,
eliminating the need to dirty the page yet again at anti-wraparound
vacuum time. It'll still be a net increase in WAL volume (as Heikki
pointed out) but the added WAL volume is small compared with the I/O
involved in writing out the dirty heap pages (as Tom pointed out), so
it should hopefully be OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 16:40:39
Message-ID:	4CF67A87.5040101@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01.12.2010 18:25, Robert Haas wrote:
> I think we can improve this a bit further by also introducing a
> HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
> FrozenXID. This allows us to freeze tuples aggressively - if we want
> - without losing any forensic information. We can then modify the
> above algorithm slightly, so that when we observe that a page is all
> visible, we not only set PD_ALL_VISIBLE on the page but also
> HEAP_XMIN_FROZEN on each tuple.

Hmm, actually, if we're willing to believe PD_ALL_VISIBLE in the page
header over the xmin/xmax on the tuples, we could simply not bother
doing anti-wraparound vacuums for pages that have the flag set. I'm not
sure what changes that would require outside heapam.c, as we'd have to
be careful to not trust the xmin/xmax if the flag was set.

The first update on the page that clears the flag would need to freeze
all the tuples in that scheme.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 16:40:46
Message-ID:	17462.1291221646@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> As far as I can tell, there are basically two viable solutions on the
> table here.

> 1. Every time we observe a page as all-visible, (a) set the
> PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the
> bit in the visibility map page, bumping the LSN as usual, and (c) emit
> a WAL record indicating the relation and block number. On redo of
> this record, set both the page-level bit and the visibility map bit.
> The heap page may hit the disk before the WAL record, but that's OK;

Um, no it isn't. Suppose the heap page gets to disk but we crash before
the WAL record does. Now we have a persistent state where the heap page
is marked PD_ALL_VISIBLE but the corresponding VM bit is not set. The
VM bit will never become set, either, because operations on the heap
page will see PD_ALL_VISIBLE and assume it already is set. This state
of affairs might be acceptable from a correctness standpoint, but not
from a performance standpoint.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 16:43:39
Message-ID:	17527.1291221819@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Hmm, actually, if we're willing to believe PD_ALL_VISIBLE in the page
> header over the xmin/xmax on the tuples, we could simply not bother
> doing anti-wraparound vacuums for pages that have the flag set. I'm not
> sure what changes that would require outside heapam.c, as we'd have to
> be careful to not trust the xmin/xmax if the flag was set.

That seems pretty ugly/dangerous. If we're going to try to do something
here, I much prefer Robert's approach of marking each tuple in the tuple
header.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 16:44:12
Message-ID:	4CF67B5C.9030003@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01.12.2010 18:40, Tom Lane wrote:
> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>> As far as I can tell, there are basically two viable solutions on the
>> table here.
>
>> 1. Every time we observe a page as all-visible, (a) set the
>> PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the
>> bit in the visibility map page, bumping the LSN as usual, and (c) emit
>> a WAL record indicating the relation and block number. On redo of
>> this record, set both the page-level bit and the visibility map bit.
>> The heap page may hit the disk before the WAL record, but that's OK;
>
> Um, no it isn't. Suppose the heap page gets to disk but we crash before
> the WAL record does. Now we have a persistent state where the heap page
> is marked PD_ALL_VISIBLE but the corresponding VM bit is not set. The
> VM bit will never become set, either, because operations on the heap
> page will see PD_ALL_VISIBLE and assume it already is set. This state
> of affairs might be acceptable from a correctness standpoint, but not
> from a performance standpoint.

The next vacuum will fix it. We already handle that. It's no different
from the situation where neither change makes it to the disk.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 16:48:03
Message-ID:	17653.1291222083@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 01.12.2010 18:40, Tom Lane wrote:
>> Um, no it isn't. Suppose the heap page gets to disk but we crash before
>> the WAL record does. Now we have a persistent state where the heap page
>> is marked PD_ALL_VISIBLE but the corresponding VM bit is not set. The
>> VM bit will never become set, either, because operations on the heap
>> page will see PD_ALL_VISIBLE and assume it already is set. This state
>> of affairs might be acceptable from a correctness standpoint, but not
>> from a performance standpoint.

> The next vacuum will fix it. We already handle that. It's no different
> from the situation where neither change makes it to the disk.

Well, as long as we're careful not to optimize away setting the VM bit
on the grounds that PD_ALL_VISIBLE is already set, I suppose that will
work.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 17:22:41
Message-ID:	18183.1291224161@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I think we can improve this a bit further by also introducing a
> HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
> FrozenXID. This allows us to freeze tuples aggressively - if we want
> - without losing any forensic information.

So far so good ...

> We can then modify the
> above algorithm slightly, so that when we observe that a page is all
> visible, we not only set PD_ALL_VISIBLE on the page but also
> HEAP_XMIN_FROZEN on each tuple. The WAL record marking the page as
> all-visible then doubles as a WAL record marking it frozen,
> eliminating the need to dirty the page yet again at anti-wraparound
> vacuum time.

but this seems a lot more dubious/fragile. The basic problem is that
it's not clear whether HEAP_XMIN_FROZEN is a hint bit or essential
data. If you want to set it without the overhead of an LSN bump or a
possible FPI in WAL, then it's a hint bit. But if you're using it to
protect clog truncation then it's essential data. Perhaps you can make
this work but there are some nonobvious requirements:

1. Seeing PD_ALL_VISIBLE set does not excuse vacuum from having to
iterate through all the tuples on the page checking for
HEAP_XMIN_FROZEN. This is because the non-logged update of the page
might have been torn on the way to disk, such that PD_ALL_VISIBLE got
set but not all of the FROZEN bits did.

2. During an anti-wraparound vacuum, you *need to* emit a WAL record
when setting HEAP_XMIN_FROZEN. It's not a hint, any more than writing
FrozenXID is now.

Actually, #2 isn't even good enough. What if vacuum passes over a page
and finds all the FROZEN bits set, but the reason they're set is that
somebody else updated them in hint fashion microseconds before? It
seems possible that those bits might not make it to disk before a
subsequent crash. The only way to be really sure those bits are set is
to emit a WAL record that says to set them, whether or not they seem to
be set already. While the WAL record could be small, you'd need one for
every page, making the argument that this saves I/O somewhat dubious.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 17:23:50
Message-ID:	AANLkTi=d=8oJzvLbUmuoDdViSG9O+1uamC3Gj3OUs+r6@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 1, 2010 at 11:40 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 01.12.2010 18:25, Robert Haas wrote:
>>
>> I think we can improve this a bit further by also introducing a
>> HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
>> FrozenXID. This allows us to freeze tuples aggressively - if we want
>> - without losing any forensic information. We can then modify the
>> above algorithm slightly, so that when we observe that a page is all
>> visible, we not only set PD_ALL_VISIBLE on the page but also
>> HEAP_XMIN_FROZEN on each tuple.
>
> Hmm, actually, if we're willing to believe PD_ALL_VISIBLE in the page header
> over the xmin/xmax on the tuples, we could simply not bother doing
> anti-wraparound vacuums for pages that have the flag set. I'm not sure what
> changes that would require outside heapam.c, as we'd have to be careful to
> not trust the xmin/xmax if the flag was set.
>
> The first update on the page that clears the flag would need to freeze all
> the tuples in that scheme.

That seems more complex for no particular gain. I guess it saves an
infomask bit, but I'm willing to burn one to reduce code complexity.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 17:31:36
Message-ID:	AANLkTinjMjut9Xv2RV8W=8tUAsk+hCEj0QXNA1s=EdmE@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 1, 2010 at 12:22 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I think we can improve this a bit further by also introducing a
>> HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
>> FrozenXID. This allows us to freeze tuples aggressively - if we want
>> - without losing any forensic information.
>
> So far so good ...
>
>> We can then modify the
>> above algorithm slightly, so that when we observe that a page is all
>> visible, we not only set PD_ALL_VISIBLE on the page but also
>> HEAP_XMIN_FROZEN on each tuple. The WAL record marking the page as
>> all-visible then doubles as a WAL record marking it frozen,
>> eliminating the need to dirty the page yet again at anti-wraparound
>> vacuum time.
>
> but this seems a lot more dubious/fragile. The basic problem is that
> it's not clear whether HEAP_XMIN_FROZEN is a hint bit or essential
> data. If you want to set it without the overhead of an LSN bump or a
> possible FPI in WAL, then it's a hint bit. But if you're using it to
> protect clog truncation then it's essential data. Perhaps you can make
> this work but there are some nonobvious requirements:
>
> 1. Seeing PD_ALL_VISIBLE set does not excuse vacuum from having to
> iterate through all the tuples on the page checking for
> HEAP_XMIN_FROZEN. This is because the non-logged update of the page
> might have been torn on the way to disk, such that PD_ALL_VISIBLE got
> set but not all of the FROZEN bits did.

Good point. If we see the bit set in the visibility map set, it
should be safe to infer that the PD_ALL_VISIBLE bit and all
HEAP_XMIN_FROZEN bits are set. But if the visibility map bit is NOT
set, we must check PD_ALL_VISIBLE and, whether it's set or not, each
individual HEAP_XMIN_FROZEN bit.

> 2. During an anti-wraparound vacuum, you *need to* emit a WAL record
> when setting HEAP_XMIN_FROZEN. It's not a hint, any more than writing
> FrozenXID is now.
>
> Actually, #2 isn't even good enough. What if vacuum passes over a page
> and finds all the FROZEN bits set, but the reason they're set is that
> somebody else updated them in hint fashion microseconds before? It
> seems possible that those bits might not make it to disk before a
> subsequent crash. The only way to be really sure those bits are set is
> to emit a WAL record that says to set them, whether or not they seem to
> be set already. While the WAL record could be small, you'd need one for
> every page, making the argument that this saves I/O somewhat dubious.

I think that we would only ever allow HEAP_XMIN_FROZEN to be set as
part of a WAL-logged operation. Either we are marking the page
all-visible - in which case we're emitting the new WAL record type
XLOG_HEAP_ALLVISIBLE - or we're freezing individual tuples on a page
where very old and very new tuples are intermixed - in which case we
emit the existing XLOG_HEAP2_FREEZE.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 20:31:27
Message-ID:	1291235487.13479.70.camel@jdavis-ux.asterdata.local
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-12-01 at 11:25 -0500, Robert Haas wrote:
> 1. Every time we observe a page as all-visible, (a) set the
> PD_ALL_VISIBLE bit on the page, without bumping the LSN;

...

> 2. Every time we observe a page as all-visible, (a) set the
> PD_ALL_VISIBLE bit on the page, without bumping the LSN,

My concern here is that both of these proposals introduce something new
that is hint-bit-like, in the sense that we change a data page and
potentially write it to disk without necessarily writing WAL. This will
make it even harder to do CRCs in the future, which are also an
important feature.

I understand the motivation: we don't want to write a page to disk yet
one more time. I think this is only a problem for bulk loading, where it
does seem pretty ridiculous. But for cases other than bulk loading, it
doesn't seem like a problem (likely, there will be other changes to the
page anyway).

Is it worth taking a step back, and saying that we're trying to hard to
use the exact same mechanism for OLTP and bulk loading?

I know bulk loading poses a number of problems. But it might be easier
to make a passable bulk-loading mechanism (not necessarily a general
one) than to continue to fight the WAL-before-data invariant on every
new feature[1].

I'm not proposing anything in specific, just a potential re-framing of
the problem.

Regards,
Jeff Davis

[1] If we obey WAL-before-data everywhere WAL is required, the
correctness can be reasonably understood by mere mortals. But when we
start breaking it for special cases, observe how few people can even
participate in such discussions.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 20:59:49
Message-ID:	AANLkTikaULiXb_p2o2PJebEZ3runAYp9ovoBFHaggWJo@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 1, 2010 at 3:31 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Wed, 2010-12-01 at 11:25 -0500, Robert Haas wrote:
>> 1. Every time we observe a page as all-visible, (a) set the
>> PD_ALL_VISIBLE bit on the page, without bumping the LSN;
>
> ...
>
>> 2. Every time we observe a page as all-visible, (a) set the
>> PD_ALL_VISIBLE bit on the page, without bumping the LSN,
>
> My concern here is that both of these proposals introduce something new
> that is hint-bit-like, in the sense that we change a data page and
> potentially write it to disk without necessarily writing WAL. This will
> make it even harder to do CRCs in the future, which are also an
> important feature.

Well, there's an intermediate concept between "logged" and "unlogged",
which is an update that is WAL-logged but doesn't require torn-page
protection because writing some but not all of the 512-byte sectors in
the page won't break anything. We're using the LSN to track both
whether the page is logged and whether the update necessitates
torn-page protection. That might be fixable.

As for CRCs, there's a pretty direct chain of inference here:

1. CRCs are hard (really impossible) because we have hint bits.
2. Hint bits are necessary because an old XID can't be viewed as
guaranteed committed.
3. An old XID can't be viewed as guaranteed committed because we clean
up aborted transactions lazily rather than eagerly.

Changing (3) would amount to a rewrite of our whole MVCC architecture.
If we switched from per-tuple MVCC based on XIDs to per-page MVCC
based on LSNs and a rollback segment, all of this stuff would go out
the window. Hint bits, gone. Anti-wraparound VACUUM, gone. CRCs,
feasible. Visibility map... we might still need that, but the
page-level bits go away.

Of course, it would also create new problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 21:27:28
Message-ID:	21728.1291238848@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> If we switched from per-tuple MVCC based on XIDs to per-page MVCC
> based on LSNs and a rollback segment, all of this stuff would go out
> the window. Hint bits, gone. Anti-wraparound VACUUM, gone. CRCs,
> feasible. Visibility map... we might still need that, but the
> page-level bits go away.

> Of course, it would also create new problems.

Yup, we've seen that proposal before. It's called Oracle. There's
no good reason to believe that we'd have a net win after we were done
switching over ... not to mention the likelihood that they hold a ton
of patents about particular aspects of doing things that way.

regards, tom lane

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 22:24:22
Message-ID:	1291242262.21972.35.camel@jdavis-ux.asterdata.local
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-12-01 at 15:59 -0500, Robert Haas wrote:
> As for CRCs, there's a pretty direct chain of inference here:
>
> 1. CRCs are hard (really impossible) because we have hint bits.

I would disagree with "impossible". If we don't set hint bits during
reading; and when we do set them, we log them (including full page
writes); then we can do CRCs.

Those things have costs, but we might be willing to pay them if we had a
bulk loading strategy that avoids or mitigates the costs.

The reason we can't do CRCs now is because hint bits violate the
WAL-before-data rule; not because of hint bits themselves. We're talking
about adding another feature that breaks the rule, in a more complex way
than hint bits.

I just wanted to step back for a second and consider the problem from a
different angle before we committed to that.

Regards,
Jeff Davis

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-01 23:41:33
Message-ID:	CCE55EDE-8354-4E2B-A815-E5A29018D33F@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 1, 2010, at 2:59 PM, Robert Haas wrote:
> 2. Hint bits are necessary because an old XID can't be viewed as
> guaranteed committed.

Hmm... I thought hint bits were necessary because it's too expensive to query CLOG for every tuple. If my understanding is correct then if we fix the CLOG performance issue we don't need hint bits anymore.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-02 04:22:12
Message-ID:	AANLkTi=DOvFWWZFNxJObeAWiEu9dyAcTxLUQCgA1fNdt@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 1, 2010 at 5:24 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Wed, 2010-12-01 at 15:59 -0500, Robert Haas wrote:
>> As for CRCs, there's a pretty direct chain of inference here:
>>
>> 1. CRCs are hard (really impossible) because we have hint bits.
>
> I would disagree with "impossible". If we don't set hint bits during
> reading; and when we do set them, we log them (including full page
> writes); then we can do CRCs.
>
> Those things have costs, but we might be willing to pay them if we had a
> bulk loading strategy that avoids or mitigates the costs.
>
> The reason we can't do CRCs now is because hint bits violate the
> WAL-before-data rule; not because of hint bits themselves. We're talking
> about adding another feature that breaks the rule, in a more complex way
> than hint bits.
>
> I just wanted to step back for a second and consider the problem from a
> different angle before we committed to that.

Well, let's think about what we'd need to do to make CRCs work
reliably. There are two problems.

1. Currently, hint bits are not vulnerable to the torn-page problem,
because the hint bit change is to single byte, and neither of the two
possible values for the affected byte invalidate the contents of the
block. Thus, they do not need to be WAL-logged - we're happy if they
all make it to disk, but if some or none of them make it to disk,
that's OK. If we CRC the entire page, the torn pages are never
acceptable, so every action that modifies the page must be WAL-logged.

2. Currently, we allow hint bits on a page to be updated while holding
a shared-content lock; we also allow the page to be written while
holding only a shared-content lock. This makes it a bit
nondeterministic whether the hint bit update is included in the write,
but we don't care. If we were to compute a CRC and write that into
the page before writing it out to the OS, it would be unacceptable for
the page contents to change thereafter in any way.

So, to make CRCs work, we'd need to (a) WAL-log every hint bit update
and (b) change either hint bit updates or page write-outs to require
an exclusive content lock rather than a shared one. The first would
result in an increase in I/O, while the second would result in a
reduction in concurrency. Thinking about it a bit, I wonder if we
couldn't mitigate (b) quite a bit by adding a new level for buffer
content locks, share exclusive. This would conflict with itself and
with exclusive but not with share locks, and would be required to set
hint bits or write the buffer. When setting hint bits with only a
share lock, we'd attempt to do a non-blocking upgrade to share
exclusive. If that failed - because someone else already held a
share-exclusive lock - we'd just skip the hint bit update. I have no
idea what to do about (a), though.

*thinks some more*

Or maybe I do. One other thing I've been thinking about with regard
to hint bit updates is that we might choose to mark that are
hint-bit-updated as "untidy" rather than "dirty". The background
writer could treat these pages as dirty, but checkpoints and backends
doing desperation-buffer-reclamation could treat them as clean. This
would allow hint bit updates to trickle out to disk in the background,
without letting them bottleneck anything on the critical path. Maybe
we could do this - if CRCs are enabled and we are the background
writer cleaning scan, write dirty buffers in the usual way and write
untidy buffers to a "double-write buffer" (to borrow a page from
InnoDB) along with the current LSN. At the conclusion of the scan,
fsync() the double-write buffer and then write the buffers a second
time in the normal fashion if their mappings haven't changed and they
are still untidy. On redo, when you reach an LSN recorded in the
double-write buffer, restore the FPI. In general, a double-write
buffer is inferior to our existing FPI system, because you end up
needing to fsync both the double-write buffer and the WAL stream. But
it might be OK in this case, if it's all happening as background work.

With respect to your concerns about this method, after some thought, I
think #2 isn't an issue at all, because I don't believe we can risk
having our update to HEAP_XMIN_FROZEN stomped on by someone else
trying to set HEAP_XMIN_COMMITTED, so I think that when making a page
all-visible we'll need an exclusive (or share-exclusive) content lock
anyway. As to #1, I think we could restore the WAL-before-data rules
if we kept a bit somewhere in the buffer descriptor indicating whether
a given buffer has had an FPI since the last checkpoint. Then,
perhaps, WAL records that are torn-page-safe could bump the TLI
without emitting a FPI. The next WAL record to come along would be
able to determine that one was still needed. Of course, to make CRCs
work with this, you still need to emit FPIs or use a double-write
buffer. That sucks, and I don't know what to do about it. Since our
current hint-bit updates are not WAL-logged, a CRC implementation over
it could try to get by with chunking untidy buffers (either all the
time or just sometimes) without actually writing them. But these
updates WILL be WAL-logged, so you can't just refuse to write them
after the fact. Hmm...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-02 04:36:42
Message-ID:	AANLkTimtNC4P+0pGqSHMAx9mVnuJ9UA=ZkTivEwsoULd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 1, 2010 at 6:41 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> On Dec 1, 2010, at 2:59 PM, Robert Haas wrote:
>> 2. Hint bits are necessary because an old XID can't be viewed as
>> guaranteed committed.
>
> Hmm... I thought hint bits were necessary because it's too expensive to query CLOG for every tuple. If my understanding is correct then if we fix the CLOG performance issue we don't need hint bits anymore.

My point is - in InnoDB, when an XID aborts, it's not removed from
their equivalent of the ProcArray until it has been fully rolled back.
So if you see an XID prior to GlobalXmin, you don't need a hint bit
to tell you whether it's committed. Go directly to yes.

It is also correct to say that it's too expensive to query CLOG for
every tuple. But InnoDB isn't solving that by having a cheaper CLOG;
they're solving it by not having CLOG at all. I doubt that it's
possible to make CLOG lookups so cheap that we don't need hint bits
any more. Take a look at HeapTupleSatisfiesMVCC(). Those hint bit
tests are a single machine instruction. It's tough to beat that.
It's tough to get within two orders of magnitude. I'd like to, but I
don't see how.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-02 11:37:57
Message-ID:	87d3pk1k2y.fsf@hi-media-techno.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Or maybe I do. One other thing I've been thinking about with regard
> to hint bit updates is that we might choose to mark that are
> hint-bit-updated as "untidy" rather than "dirty". The background

Please review archives, you'll find the idea discussed and some patches
to implement it, by Simon. I suppose you could begin here:

http://archives.postgresql.org/pgsql-patches/2008-06/msg00113.php

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-02 14:37:37
Message-ID:	AANLkTimMEgSXBnns7jbaYRsqOO3mZ2Lg940MQQMowOSs@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 2, 2010 at 6:37 AM, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Or maybe I do. One other thing I've been thinking about with regard
>> to hint bit updates is that we might choose to mark that are
>> hint-bit-updated as "untidy" rather than "dirty". The background
>
> Please review archives, you'll find the idea discussed and some patches
> to implement it, by Simon. I suppose you could begin here:
>
> http://archives.postgresql.org/pgsql-patches/2008-06/msg00113.php

Thanks for the pointer. I guess that demonstrates that good ideas
will keep floating back up to the surface. It seems like the idea was
met with generally positive feedback, except that most people seemed
to want a slightly simpler system than what Simon was proposing. I
suspect that this is mostly suffering from a lack of round tuits.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-02 19:01:43
Message-ID:	1291316503.18031.300.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-12-01 at 23:22 -0500, Robert Haas wrote:
> Well, let's think about what we'd need to do to make CRCs work
> reliably. There are two problems.
>
> 1. [...] If we CRC the entire page, the torn pages are never
> acceptable, so every action that modifies the page must be WAL-logged.
>
> 2. Currently, we allow hint bits on a page to be updated while holding

[...]

The way I see it, here are the rules we are breaking, and why:

* We don't get an exclusive lock when dirtying a page with hint bits
- Why: we write while reading, and we want good concurrency.
- Why': because after a bulk load, we don't have any hint bits, and the
only way to get them set without VACUUM is to write while reading. I've
never been entirely sure why VACUUM isn't good enough in this case,
aside from the fact that a user might not run VACUUM (and autovacuum
might not either, if it was only a bulk load and no updates/deletes).

* We don't WAL log setting hint bits (which dirties a page)
- Why: because after a bulk load, we don't want to write the data a 4th
time

Hypothetically, if we had a bulk loading strategy, these problems would
go away, and we could follow the rules. Right? Is there a case other
than bulk loading which demands that we break these rules?

And, if we had a bulk loading path, we could probably get away with
writing the data only twice (today, we write it 3 times including the
hint bits) or maybe once if WAL archiving is off.

So, is there a case other than bulk loading for which we need to break
these rules? If not, perhaps we should consider bulk loading a different
problem, and simplify the design of all of these other features (and
allow new storage-touching features to come about, like CRCs, without
exponentially increasing the complexity with each one).

Regards,
Jeff Davis

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-02 19:16:09
Message-ID:	4CF79C19020000250003819A@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> And, if we had a bulk loading path, we could probably get away
> with writing the data only twice (today, we write it 3 times
> including the hint bits) or maybe once if WAL archiving is off.

If you're counting WAL writes, you're low. If you don't go out of
your way to avoid it, you are likely to write the data to the table
once during the bulk load, a second time on first read to set the
hint bits, and a third time to freeze data to prevent wrap-around.
The initial write may or may not be WAL-logged. The freezing
typically is WAL-logged.

So, you can easily write the data to disk four or five times. With
luck these are spread out uniformly or happen during off-peak
periods. Unmanaged, a WAL-logged freeze of bulk-loaded data is
somewhat more likely to occur, however, during hours of heavy OLTP
load, because transaction numbers are consumed so quickly.

Currently, a VACUUM FREEZE after a bulk load collapses at least two
of those writes to one. With luck, some pages might still be dirty
in cache, and you can save two of the writes.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-02 22:00:18
Message-ID:	AANLkTikNcbySP_HDS0ZoaEWmaA=JBRWhssstD7xTSmNc@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 2, 2010 at 2:01 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> * We don't get an exclusive lock when dirtying a page with hint bits
> - Why: we write while reading, and we want good concurrency.
> - Why': because after a bulk load, we don't have any hint bits, and the
> only way to get them set without VACUUM is to write while reading. I've
> never been entirely sure why VACUUM isn't good enough in this case,
> aside from the fact that a user might not run VACUUM (and autovacuum
> might not either, if it was only a bulk load and no updates/deletes).
>
> * We don't WAL log setting hint bits (which dirties a page)
> - Why: because after a bulk load, we don't want to write the data a 4th
> time
>
> Hypothetically, if we had a bulk loading strategy, these problems would
> go away, and we could follow the rules. Right? Is there a case other
> than bulk loading which demands that we break these rules?

I'm not really convinced that this problem is confined to bulk
loading. Every INSERT or UPDATE results in a new tuple that may need
hit bits set and eventually to be frozen. A bulk load is just a time
when you do lots of inserts all at once; it seems to me that a large
update would cause all the same problems, plus bloat. The triple I/O
problem exists for small transactions as well (and isn't desirable
there either); it's just less noticeable because the second and third
writes are, like the first one, small.

> And, if we had a bulk loading path, we could probably get away with
> writing the data only twice (today, we write it 3 times including the
> hint bits) or maybe once if WAL archiving is off.

It seems to me that a COPY command executed in a transaction with no
other open snapshots writing to a table created or truncated within
the same transaction should be able to write frozen tuples from the
get-go, regardless of anything else we do.

> So, is there a case other than bulk loading for which we need to break
> these rules? If not, perhaps we should consider bulk loading a different
> problem, and simplify the design of all of these other features (and
> allow new storage-touching features to come about, like CRCs, without
> exponentially increasing the complexity with each one).

I don't think we're exponentially increasing complexity - I think
we're incrementally improving our algorithms. If you want to propose
a bulk loading path, great. Propose away! But without something a
bit more concrete, I don't think it would be appropriate to hold off
making the visibility map crash-safe, on the off chance that our
design for so doing might complicate something else we want to do
later.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-02 23:37:05
Message-ID:	1291333025.30909.69.camel@jdavis-ux.asterdata.local
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-12-02 at 17:00 -0500, Robert Haas wrote:
> I'm not really convinced that this problem is confined to bulk
> loading. Every INSERT or UPDATE results in a new tuple that may need
> hit bits set and eventually to be frozen. A bulk load is just a time
> when you do lots of inserts all at once; it seems to me that a large
> update would cause all the same problems, plus bloat.

A big UPDATE does a lot of work, I don't see any near-term approach for
solving that. Avoiding WAL (and full page writes) for the updates to
PD_ALL_VISIBLE, hint bits, VM bits, etc., would probably be the least of
my concerns in that case.

> The triple I/O
> problem exists for small transactions as well (and isn't desirable
> there either); it's just less noticeable because the second and third
> writes are, like the first one, small.

Bulk loading poses some unique challenges because there is no
opportunity to set PD_ALL_VISIBLE or hint bits before the loading is
complete; and by that time, many checkpoints will have already happened,
and the pages have already hit disk. That means we need to re-read them,
modify them, and write them again (plus WAL, if we were following the
rules).

Small transactions don't suffer from the same problems. They generally
begin and end without an intervening checkpoint. That means that you
have an opportunity to set PD_ALL_VISIBLE or hint bits before the
checkpoint happens, thus avoiding unnecessary extra writes.

Additionally, small transaction workloads will generally have, to some
approximation, some working set of pages. So, even if you do a read of a
single tuple, write PD_ALL_VISIBLE and hint bits (thus dirtying the
page), there is a reasonable chance that someone will come by later and
do an insert/update/delete (thus forcing WAL anyway).

And if the small transaction workload is completely random and you
aren't touching the same pages between checkpoints, then setting hint
bits one-at-a-time is not a good strategy anyway. It would be much
better to do it in bulk with a VACUUM. And if VACUUM does anything
significant to a page, it's going to WAL anyway.

I'm having trouble seeing a case other than bulk-loading which causes a
real problem. Maybe a small-transaction workload with a few long-running
transactions? Even that doesn't seem so bad.

> It seems to me that a COPY command executed in a transaction with no
> other open snapshots writing to a table created or truncated within
> the same transaction should be able to write frozen tuples from the
> get-go, regardless of anything else we do.

Well, some transaction might pick up a snapshot between the time you
begin the copy and the time it commits. We'd need to prevent such a
transaction from actually reading the table.

> I don't think it would be appropriate to hold off
> making the visibility map crash-safe, on the off chance that our
> design for so doing might complicate something else we want to do
> later.

I'm not suggesting we hold off on it at all. To the contrary, I'm
suggesting that we simply log updates of PD_ALL_VISIBLE as well as VM
bits, at least until a performance problem presents itself. That will
_simplify_ the design.

Then, when a performance problem does present itself for a certain use
case, we can see how to fix it. If many cases are affected, then we
might choose one of these more creative solutions that breaks the rules
in controlled ways, understanding the trade-offs. If only bulk loading
is affected, we might choose to address that case directly.

Regards,
Jeff Davis

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-03 00:06:29
Message-ID:	AANLkTi=GOm-dMrMa4WrRSZ84zEj9TtfvY5nfM3mA0JuP@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 2, 2010 at 6:37 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>> It seems to me that a COPY command executed in a transaction with no
>> other open snapshots writing to a table created or truncated within
>> the same transaction should be able to write frozen tuples from the
>> get-go, regardless of anything else we do.
>
> Well, some transaction might pick up a snapshot between the time you
> begin the copy and the time it commits. We'd need to prevent such a
> transaction from actually reading the table.

Oh, hmm. That's awkward. I guess under present semantics it can see
the table - but not its contents - once the inserting transaction has
committed. That stinks.

>> I don't think it would be appropriate to hold off
>> making the visibility map crash-safe, on the off chance that our
>> design for so doing might complicate something else we want to do
>> later.
>
> I'm not suggesting we hold off on it at all. To the contrary, I'm
> suggesting that we simply log updates of PD_ALL_VISIBLE as well as VM
> bits, at least until a performance problem presents itself. That will
> _simplify_ the design.
>
> Then, when a performance problem does present itself for a certain use
> case, we can see how to fix it. If many cases are affected, then we
> might choose one of these more creative solutions that breaks the rules
> in controlled ways, understanding the trade-offs. If only bulk loading
> is affected, we might choose to address that case directly.

I don't think that you can seriously suggest that emitting that volume
of FPIs isn't going to be a problem immediately. We have to have some
solution to that problem out of the gate.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-03 00:40:44
Message-ID:	1291336844.32238.4.camel@jdavis-ux.asterdata.local
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-12-02 at 19:06 -0500, Robert Haas wrote:
> I don't think that you can seriously suggest that emitting that volume
> of FPIs isn't going to be a problem immediately. We have to have some
> solution to that problem out of the gate.

Fair enough. I think you understand my point, and it's easy enough to
remove complexity later if it makes sense to do so.

Regards,
Jeff Davis

From:	Florian Weimer <fweimer(at)bfk(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2010-12-03 14:25:20
Message-ID:	82oc93q6gf.fsf@mid.bfk.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Robert Haas:

> Those hint bit tests are a single machine instruction. It's tough
> to beat that. It's tough to get within two orders of magnitude.
> I'd like to, but I don't see how.

For some scans, it might be possible to hoist the checks out of inner
loops. (At least in principle, I'm not sure how much that would
interfere with the executor architecture.)

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

From:	Jesper Krogh <jesper(at)krogh(dot)cc>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2011-01-05 20:22:54
Message-ID:	4D24D31E.2040906@krogh.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2010-11-30 05:57, Robert Haas wrote:
> Last week, I posted a couple of possible designs for making the
> visibility map crash-safe, which did not elicit much comment. Since
> this is an important prerequisite to index-only scans, I'm trying
> again.

The logic seems to be:

* If the visibillity map should be crash-safe if should be WAL-logged.
* PD_ALL_VISIBLE is currently not being WAL-logged when vacuum sets it.
* WAL logging the visibillity map bit is not "that" bad (small size).
* WAL-logging the PD_ALL_VISIBLE bit would can WAL-record for the entire
relation to be written out (potentially huge).

Would the problem not be solved by not "trying to keep the two bits in
sync" but
simply removing the PD_ALL_VISIBLE bit in the page-header in favor
for the bit in the visibillity map, that is now WAL-logged and thus safe
to trust?

Then vacuum could emit WAL records for setting the visibillity map bits,
combined
with changes on the page could clear it?

The question probably boils down to:

Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE
bit serve?

I've probably just missed some logic?

Having index-only scans per-table ajustable would make quite some sense..

I have a couple of tables with a high turn-over rate that never get out
of the OS-cache
anyway, the benefit of index-only scans are quite small, especially if
they come with
additional overhead on INSERT/UPDATE/DELETE operations, whereas I also have
huge tables with a very small amount of changes. Just the saved IO of
not having to
go to the heap in some cases would be good.

I could see some benefits in having the index-only scan work on
tuple-level visibillity information
and not page-level, but that would require a bigger map
(allthough still less than 1% of the heap size if my calculations are
correct), but
would enable visibillity testing of a tuple without going to the heap
even other (unrelated)
changes happend on the same page.

Jesper

--
Jesper

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jesper Krogh <jesper(at)krogh(dot)cc>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2011-01-06 02:10:15
Message-ID:	AANLkTi=K=26Nkx48H9nmUcmJNcuWhL60VqCy8OXCRr28@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 5, 2011 at 3:22 PM, Jesper Krogh <jesper(at)krogh(dot)cc> wrote:
> Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit
> serve?

If we modify a page on which PD_ALL_VISIBLE isn't set, we don't
attempt to update the visibility map. In theory, this is an important
optimization to reduce contention on the visibility map page, since
there are something like 64K heap pages per visibility map page. In
practice, I'm not sure in what workloads it matters or by how much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jesper Krogh <jesper(at)krogh(dot)cc>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: crash-safe visibility map, take three
Date:	2011-01-06 07:00:29
Message-ID:	4D25688D.2010706@krogh.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2011-01-06 03:10, Robert Haas wrote:
> On Wed, Jan 5, 2011 at 3:22 PM, Jesper Krogh<jesper(at)krogh(dot)cc> wrote:
>> Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit
>> serve?
> If we modify a page on which PD_ALL_VISIBLE isn't set, we don't
> attempt to update the visibility map. In theory, this is an important
> optimization to reduce contention on the visibility map page, since
> there are something like 64K heap pages per visibility map page. In
> practice, I'm not sure in what workloads it matters or by how much

If that is the only reason, I would suggest just making the visibillity map
pages more sparse. If you just had 500 or 1K heap pages per visibillity map
page, then it would not change a thing for the performance. With 1K heap
pages
per VM map page a relation with 100GB of data would have a VM map of
the VM map of 100MB, even 100 heap pages per VM-map page would still
enable the database to have visibillity information of 100GB data stored
in 1GB
of memory.

But it would need testing and benchmarking to find out.

Jesper

--
Jesper

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: crash-safe visibility map, take three
Date:	2011-01-07 18:28:54
Message-ID:	7502A9A8-B5C8-4C82-8843-4AE0F638D207@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 5, 2011, at 8:10 PM, Robert Haas wrote:
> On Wed, Jan 5, 2011 at 3:22 PM, Jesper Krogh <jesper(at)krogh(dot)cc> wrote:
>> Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit
>> serve?
>
> If we modify a page on which PD_ALL_VISIBLE isn't set, we don't
> attempt to update the visibility map. In theory, this is an important
> optimization to reduce contention on the visibility map page, since
> there are something like 64K heap pages per visibility map page. In
> practice, I'm not sure in what workloads it matters or by how much.

What specific locking are you worried about? The page locks themselves? Isn't changing the bit essentially a single instruction operation?

This is sounding like premature optimization... ;)
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: crash-safe visibility map, take three
Date:	2011-01-08 02:59:36
Message-ID:	AANLkTinThcNA==iAuhOq9rrFT63Afer0k6LeYN1pJTUu@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 7, 2011 at 1:28 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> On Jan 5, 2011, at 8:10 PM, Robert Haas wrote:
>> On Wed, Jan 5, 2011 at 3:22 PM, Jesper Krogh <jesper(at)krogh(dot)cc> wrote:
>>> Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit
>>> serve?
>>
>> If we modify a page on which PD_ALL_VISIBLE isn't set, we don't
>> attempt to update the visibility map. In theory, this is an important
>> optimization to reduce contention on the visibility map page, since
>> there are something like 64K heap pages per visibility map page. In
>> practice, I'm not sure in what workloads it matters or by how much.
>
> What specific locking are you worried about? The page locks themselves? Isn't changing the bit essentially a single instruction operation?
>
> This is sounding like premature optimization... ;)

I'm not quite invested enough in this to get worried about it, but if
I were, I'd probably start with the buffer content lock, and move on
to the buffer header spinlock and the buf mapping locks. Changing the
bit is a single instruction once you've got the page pinned and
locked, but that's not free.

(And even if you could hypothetically figure out some clever lock-free
algorithm to avoid some of this work, there's still going to be cache
line contention, which is quite expensive as it turns out. See the
relatively recent discussions of why our backend startup cost is so
high.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company