Quick Links

Re: Protecting against unexpected zero-pages: proposal

Lists:	pgsql-hackers

From:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
To:	PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Protecting against unexpected zero-pages: proposal
Date:	2010-11-06 10:42:40
Message-ID:	AANLkTimt2xZDDUiRqMS3aTTRxVQY6ALZNhF5ou1_736w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

A customer of ours is quite bothered about finding zero pages in an index
after
a system crash. The task now is to improve the diagnosability of such an
issue
and be able to definitively point to the source of zero pages.

The proposed solution below has been vetted in-house at EnterpriseDB and am
posting here to see any possible problems we missed, and also if the
community
would be interested in incorporating this capability.

Background:
-----------
SUSE Linux, ATCA board, 4 dual core CPUs => 8 cores, 24 GB RAM, 140 GB disk,
PG 8.3.11. RAID-1 SAS with SCSIinfo reporting that write-caching is
disabled.

The corrupted index's file contents, based on hexdump:

It has a total of 525 pages (cluster block size is 8K: per
pg_controldata)
Blocks 0 to 278 look sane.
Blocks 279 to 518 are full of zeroes.
Block 519 to 522 look sane.
Block 523 is filled with zeroes.
Block 524 looks sane.

The tail end of blocks 278 and 522 have some non-zero data, meaning that
those
index pages have some valid 'Special space' contents. Also, head of blocks
519
and 524 look sane. These two findings imply that the zeroing action happened
at
8K page boundary. This is a standard ext3 FS with 4K block size, so this
raises
question as to how we can ascertain that this was indeed a hardware/FS
malfunction. And if it was a hardware/FS problem, then why didn't we see
zeroes
at 1/2 K boundary (generally the disk's sector size) or 4K boundary (default
ext3 FS block size) which does not align with an 8 K boundary.

The backup from before the crash does not have these zero-pages.

Disk Page Validity Check Using Magic Number
===========================================

Requirement:
------------
We have encountered quite a few zero pages in an index after a machine
crash,
causing this index to be unusable. Although REINDEX is an option but we have
no way of telling if these zero pages were caused by hardware or filesystem
or
by Postgres. Postgres code analysis shows that Postgres being the culprit is
a
very low probablity, and similarly, since our hardware is also considered of
good quality with hardware level RAID-1 over 2 disks, it is difficult to
consider
the hardware to be a problem. The ext3 filesystem being used is also quite a
time-tested piece of software, hence it becomes very difficult to point
fingers
at any of these 3 components for this corruption.

Postgres is being deployed as a component of a carrier-grade platform, and
it is
required to run unattended as much as possible. There is a High Availability
monitoring component that is tasked with performing switchover to a standby
node
in the event of any problem with the primary node. This HA component needs
to
perform regular checks on health of all the other components, including
Postgres,
and take corrective actions.

With the zero pages comes the difficulty of ascertaining whether these are
legitimate zero pages, (since Postgres considers zero pages as valid (maybe
leftover from previous extend-file followed by a crash)), or are these zero
pages
a result of FS/hardware failure.

We are required to definitively differentiate between zero pages from
Postgres
vs. zero pages caused by hardware failure. Obviously this is not possible by
the
very nature of the problem, so we explored a few ideas, including per-block
checksums in-block or in checksum-fork, S.M.A.R.T monitoring of disk drives,
PageInit() before smgrextend() in ReadBuffer_common(), and additional member
in
PageHeader for a magic number.

Following is an approach which we think is least invasive, and does not
threaten
code-breakage, yet provides a definitive detection of corruption/data-loss
outside Postgres with least performance penalty.

Implementation:
---------------

.) The basic idea is to have a magic number in every PageHeader before it is
written to disk, and check for this magic number when performing page
validity
checks.

.) To avoid adding a new field to PageHeader, and any code breakage, we
reuse
an existing member of the structure.

.) We exploit the following facts and assumptions:
-) Relations/files are extended 8 KB (BLCKSZ) at a time.
-) Every I/O unit contains PageHeader structure (table/index/fork files),
which in turn contains pd_lsn as the first member.
-) Every newly written block is considered to be zero filled.
-) PageIsNew() assumes that if pd_upper is 0 then the page is zero.
-) PageHeaderIsValid() allows zero filled pages to be considered valid.
-) Anyone wishing to use a new page has to do PageInit() on the page.
-) PageInit() does a MemSet(0) on the whole page.
-) XLogRecPtr={x,0} is considered invalid
-) XLogRecPtr={x, ~((uint32)0)} is not valid either (i.e. last byte of an
xlog
file (not segment)); we'll use this as the magic number.

... Above is my assumption, since it is not mentioned anywhere in the
code.
The XLogFileSize calculation seems to support this assumptiopn.

... If this assumption doesn't hold good, then the previous assumption
{x,0}
can also be used to implement this magic number (with x > 0).
-) There's only one implementation of Storage Manager, i.e. md.c.
-) smgr_extend() -> mdextend() is the only place where a relation is
extended.
-) Writing beyond EOF in a file causes the intermediate space to become a
hole,
and any reads from such a hole returns zero filled pages.
-) Anybody trying to extend a file makes sure that there's no cuncurrent
extension going on from somewhere else.
... This is ensured either by implicit nature of the calling code, or
by
calling LockRelationForExtension().

.) In mdextend(), if the buffer being written is zero filled, then we write
the
magic number in that page's pd_lsn.
... This check can be optimized to just check sizeof(pd_lsn) worth of
buffer.

.) In mdextend(), if the buffer is being written beyond current EOF, then we
forcibly write the intermediate blocks too, and write the magic number in
each of those.
... This needs an _mdnblocks() call and FileSeek(SEEK_END)+FileWrite()
calls
for every block in the hole.

... Creation of holes is being assumed to be a very limited corner case,
hence this performace hit is acceptable in these rare corner cases. Tests
are
being planned using real application, to check how many times this
occurs.

.) PageHeaderIsValid() needs to be modified to allow
MagicNumber-followed-by-zeroes
as a valid page (rather than a completely zero page)
... If the page is completely filled with zeroes, this confirms the fact
that
either the filesystem or the disk storage zeroed these pages, since
Postgres
never wrote zero pages to disk.

.) PageInit() and PageIsNew() require no change.

.) XLByteLT(), XLByteLE() and XLByteEQ() may be changed to contain
AssertMacro( !MagicNumber(a) && !MagicNumber(b) )

.) I haven't analyzed the effects of this change on the recovery code, but I
have a feeling that we might not need to change anything there.

.) We can create a contrib module (standalone binary or a loadable module)
that
goes through each disk page and checks it for being zero filled, and
raises
alarm if it finds any.

Thoughts welcome.
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh(dot)gurjeet(at){ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc:	PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-06 15:48:27
Message-ID:	19817.1289058507@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com> writes:
> .) The basic idea is to have a magic number in every PageHeader before it is
> written to disk, and check for this magic number when performing page
> validity
> checks.

Um ... and exactly how does that differ from the existing behavior?

> .) To avoid adding a new field to PageHeader, and any code breakage, we
> reuse
> an existing member of the structure.

The amount of fragility introduced by the assumptions you have to make
for this seems to me to be vastly riskier than the risk you are trying
to respond to.

regards, tom lane

From:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-07 04:23:15
Message-ID:	AANLkTikOxS94M3SVrMFWL+LNJUCHEEepG17=KD6nF9wn@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 6, 2010 at 11:48 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com> writes:
> > .) The basic idea is to have a magic number in every PageHeader before it
> is
> > written to disk, and check for this magic number when performing page
> > validity
> > checks.
>
> Um ... and exactly how does that differ from the existing behavior?
>

Right now a zero filled page considered valid, and is treated as a new page;
PageHeaderIsValid()->/* Check all-zeroes case */, and PageIsNew(). This
means that looking at a zero-filled page on disk (say after a crash) does
not give us any clue if it was indeed left zeroed by Postgres, or did
FS/storage failed to do their job.

With the proposed change, if it is a valid page (a page actually written by
Postgres) it will either have a sensible LSN or the magic-LSN; the LSN will
never be zero. OTOH, if we encounter a zero filled page ( => LSN={0,0)} ) it
clearly would implicate elements outside Postgres in making that page zero.

> The amount of fragility introduced by the assumptions you have to make
> for this seems to me to be vastly riskier than the risk you are trying
> to respond to.
>
>
I understand that it is a pretty low-level change, but IMHO the change is
minimal and is being applied in well understood places. All the assumptions
listed have been effective for quite a while, and I don't see these
assumptions being affected in the near future. Most crucial assumptions we
have to work with are, that XLogPtr{n, 0xFFFFFFFF} will never be used, and
that mdextend() is the only place that extends a relation (until we
implement an md.c sibling, say flash.c or tape.c; the last change to md.c
regarding mdextend() was in January 2007).

Only mdextend() and PageHeaderIsValid() need to know this change in
behaviour, and all the other APIs work and behave the same as they do now.

This change would increase the diagnosability of zero-page issues, and help
the users point fingers at right places.

Regards,
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh(dot)gurjeet(at){ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-07 05:04:27
Message-ID:	AANLkTi=p_p2_QPbtHVVcUQzPk7LDwiWr7ixxxW81pTQz@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 7, 2010 at 4:23 AM, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com> wrote:
> I understand that it is a pretty low-level change, but IMHO the change is
> minimal and is being applied in well understood places. All the assumptions
> listed have been effective for quite a while, and I don't see these
> assumptions being affected in the near future. Most crucial assumptions we
> have to work with are, that XLogPtr{n, 0xFFFFFFFF} will never be used, and
> that mdextend() is the only place that extends a relation (until we
> implement an md.c sibling, say flash.c or tape.c; the last change to md.c
> regarding mdextend() was in January 2007).

I think the assumption that isn't tested here is what happens if the
server crashes. The logic may work fine as long as nothing goes wrong
but if something does it has to be fool-proof.

I think having zero-filled blocks at the end of the file if it has
been extended but hasn't been fsynced is an expected failure mode of a
number of filesystems. The log replay can't assume seeing such a block
is a problem since that may be precisely the result of the crash that
caused the replay. And if you disable checking for this during WAL
replay then you've lost your main chance to actually detect the
problem.

Another issue -- though I think a manageable one -- is that I expect
we'll want to be be using posix_fallocate() sometime soon. That will
allow efficient guaranteed pre-allocated space with better contiguous
layout than currently. But ext4 can only pretend to give zero-filled
blocks, not any random bitpattern we request. I can see this being an
optional feature that is just not compatible with using
posix_fallocate() though.

It does seem like this is kind of part and parcel of adding checksums
to blocks. It's arguably kind of silly to add checksums to blocks but
have an commonly produced bitpattern in corruption cases go
undetected.

--
greg

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-08 15:10:18
Message-ID:	AANLkTi=JtCYr1eKXaCCV-h6H9noXWtfjsLs61uqXpH8q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 7, 2010 at 1:04 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> It does seem like this is kind of part and parcel of adding checksums
> to blocks. It's arguably kind of silly to add checksums to blocks but
> have an commonly produced bitpattern in corruption cases go
> undetected.

Getting back to the checksum debate (and this seems like a
semi-version of the checksum debate), now that we have forks, could we
easily add block checksumming to a fork? IT would mean writing to 2
files but that shouldn't be a problem, because until the checkpoint is
done (and thus both writes), the full-page-write in WAL is going to
take precedence on recovery.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-08 16:16:54
Message-ID:	25661.1289233014@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Aidan Van Dyk <aidan(at)highrise(dot)ca> writes:
> Getting back to the checksum debate (and this seems like a
> semi-version of the checksum debate), now that we have forks, could we
> easily add block checksumming to a fork? IT would mean writing to 2
> files but that shouldn't be a problem, because until the checkpoint is
> done (and thus both writes), the full-page-write in WAL is going to
> take precedence on recovery.

Doesn't seem like a terribly good design: damage to a checksum page
would mean that O(1000) data pages are now thought to be bad.

More generally, this re-opens the question of whether data in secondary
forks is authoritative or just hints. Currently, we treat it as just
hints, for both FSM and VM, and thus sidestep the problem of
guaranteeing its correctness. To use a secondary fork for checksums,
you'd need to guarantee correctness of writes to it. This is the same
problem that index-only scans are hung up on, ie making the VM reliable.
I forget whether Heikki had a credible design sketch for making that
happen, but in any case it didn't look easy.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc:	PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-08 16:32:12
Message-ID:	25798.1289233932@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com> writes:
> On Sat, Nov 6, 2010 at 11:48 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Um ... and exactly how does that differ from the existing behavior?

> Right now a zero filled page considered valid, and is treated as a new page;
> PageHeaderIsValid()->/* Check all-zeroes case */, and PageIsNew(). This
> means that looking at a zero-filled page on disk (say after a crash) does
> not give us any clue if it was indeed left zeroed by Postgres, or did
> FS/storage failed to do their job.

I think this is really a non-problem. You said earlier that the
underlying filesystem uses 4K blocks. Filesystem misfeasance would
therefore presumably affect 4K at a time. If you see that both halves
of an 8K block are zero, it's far more likely that Postgres left it that
way than that the filesystem messed up. Of course, if only one half of
an 8K page went to zeroes, you know the filesystem or disk did it.

There are also crosschecks that you can apply: if it's a heap page, are
there any index pages with pointers to it? If it's an index page, are
there downlink or sibling links to it from elsewhere in the index?
A page that Postgres left as zeroes would not have any references to it.

IMO there are a lot of methods that can separate filesystem misfeasance
from Postgres errors, probably with greater reliability than this hack.
I would also suggest that you don't really need to prove conclusively
that any particular instance is one or the other --- a pattern across
multiple instances will tell you what you want to know.

> This change would increase the diagnosability of zero-page issues, and help
> the users point fingers at right places.

[ shrug... ] If there were substantial user clamor for diagnosing
zero-page issues, I might be for this. As is, I think it's a non
problem. What's more, if I did believe that this was a safe and
reliable technique, I'd be unhappy about the opportunity cost of
reserving it for zero-page testing rather than other purposes.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Greg Stark <gsstark(at)mit(dot)edu>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-08 17:00:42
Message-ID:	26131.1289235642@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Aidan Van Dyk <aidan(at)highrise(dot)ca> writes:
>> Getting back to the checksum debate (and this seems like a
>> semi-version of the checksum debate), now that we have forks, could we
>> easily add block checksumming to a fork?

> More generally, this re-opens the question of whether data in secondary
> forks is authoritative or just hints. Currently, we treat it as just
> hints, for both FSM and VM, and thus sidestep the problem of
> guaranteeing its correctness. To use a secondary fork for checksums,
> you'd need to guarantee correctness of writes to it.

... but wait a minute. What if we treated the checksum as a hint ---
namely, on checksum failure, we just log a warning rather than doing
anything drastic? A warning is probably all you want to happen anyway.

A corrupted page of checksums would then show up as warnings for most or
all of a range of data pages, and it'd be pretty obvious (if the data
seemed OK) where the failure had been.

So maybe Aidan's got a good idea here. It would sure be a lot easier
to shoehorn checksum checking in as an optional feature if the checksums
were kept someplace else.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-08 17:53:44
Message-ID:	AANLkTin+ku5XCVn80ceO7marozf3Qenb9PHGBCe4QJM1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 8, 2010 at 5:00 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> So maybe Aidan's got a good idea here. It would sure be a lot easier
> to shoehorn checksum checking in as an optional feature if the checksums
> were kept someplace else.

Would it? I thought the only problem was the hint bits being set
behind the checksummers back. That'll still happen even if it's
written to a different place.

--
greg

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-08 17:59:45
Message-ID:	AANLkTingLW8aM=S3O0q+2iQj6-KU2TECJguW3tUAfnb8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 8, 2010 at 12:53 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Mon, Nov 8, 2010 at 5:00 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> So maybe Aidan's got a good idea here. It would sure be a lot easier
>> to shoehorn checksum checking in as an optional feature if the checksums
>> were kept someplace else.
>
> Would it? I thought the only problem was the hint bits being set
> behind the checksummers back. That'll still happen even if it's
> written to a different place.

The problem that putting checksums in a different place solves is the
page layout (binary upgrade) problem. You're still doing to need to
"buffer" the page as you calculate the checksum and write it out.
buffering that page is absolutely necessary no mater where you put the
checksum, unless you've got an exclusive lock that blocks even hint
updates on the page.

But if we can start using forks to put "other data", that means that
keeping the page layouts is easier, and thus binary upgrades are much
more feasible.

At least, that was my thought WRT checksums being out-of-page.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 13:45:01
Message-ID:	AANLkTimTwMwEBtod=gjufKtTC=_YBJ+Ei=aGV2-8d157@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 8, 2010 at 5:59 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> The problem that putting checksums in a different place solves is the
> page layout (binary upgrade) problem. You're still doing to need to
> "buffer" the page as you calculate the checksum and write it out.
> buffering that page is absolutely necessary no mater where you put the
> checksum, unless you've got an exclusive lock that blocks even hint
> updates on the page.

But buffering the page only means you've got some consistent view of
the page. It doesn't mean the checksum will actually match the data in
the page that gets written out. So when you read it back in the
checksum may be invalid.

I wonder if we could get by by having some global counter on the page
which you increment when you set a hint bit. That way when we you read
the page back in you could compare the counter on the page and the
counter for the checksum and if the checksum counter is behind ignore
the checksum? It would be nice to do better but I'm not sure we can.

>
> But if we can start using forks to put "other data", that means that
> keeping the page layouts is easier, and thus binary upgrades are much
> more feasible.
>

The difficulty with the page layout didn't come from the checksum
itself. We can add 4 or 8 bytes to the page header easily enough. The
difficulty came from trying to move the hint bits for all the tuples
to a dedicated area. That means three resizable areas so either one of
them would have to be relocatable or some other solution (like not
checksumming the line pointers and putting the hint bits in the line
pointers). If we're willing to have invalid checksums whenever the
hint bits get set then this wouldn't be necessary.

--
greg

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 14:28:36
Message-ID:	AANLkTikuuLMFL0gv9XL7=jfxHQ8CYRfgXC_UoLC9N-KA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 8:45 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:

> But buffering the page only means you've got some consistent view of
> the page. It doesn't mean the checksum will actually match the data in
> the page that gets written out. So when you read it back in the
> checksum may be invalid.

I was assuming that if the code went through the trouble to buffer the
shared page to get a "stable, non-changing" copy to use for
checksumming/writing it, it would write() the buffered copy it just
made, not the original in shared memory... I'm not sure how that
write could be in-consistent.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 15:25:23
Message-ID:	AANLkTi=mfepkzVBPySd9faugJvjdkg=7A-2NsX3tOvZx@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 2:28 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> On Tue, Nov 9, 2010 at 8:45 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>
>> But buffering the page only means you've got some consistent view of
>> the page. It doesn't mean the checksum will actually match the data in
>> the page that gets written out. So when you read it back in the
>> checksum may be invalid.
>
> I was assuming that if the code went through the trouble to buffer the
> shared page to get a "stable, non-changing" copy to use for
> checksumming/writing it, it would write() the buffered copy it just
> made, not the original in shared memory... I'm not sure how that
> write could be in-consistent.

Oh, I'm mistaken. The problem was that buffering the writes was
insufficient to deal with torn pages. Even if you buffer the writes if
the machine crashes while only having written half the buffer out then
the checksum won't match. If the only changes on the page were hint
bit updates then there will be no full page write in the WAL log to
repair the block.

It's possible that *that* situation is rare enough to let the checksum
raise a warning but not an error.

But personally I'm pretty loath to buffer every page write. The state
of the art are zero-copy processes and we should be looking to reduce
copies rather than increase them. Though I suppose if we did a
zero-copy CRC that might actually get us this buffered write for free.

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 15:27:53
Message-ID:	AANLkTimu5Ac21xu366f+8sXYwWXs_RPmJVd84XSqLFEx@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> Oh, I'm mistaken. The problem was that buffering the writes was
> insufficient to deal with torn pages. Even if you buffer the writes if
> the machine crashes while only having written half the buffer out then
> the checksum won't match. If the only changes on the page were hint
> bit updates then there will be no full page write in the WAL log to
> repair the block.

Huh, this implies that if we did go through all the work of
segregating the hint bits and could arrange that they all appear on
the same 512-byte sector and if we buffered them so that we were
writing the same bits we checksummed then we actually *could* include
them in the CRC after all since even a torn page will almost certainly
not tear an individual sector.

--
greg

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 16:26:56
Message-ID:	CEE702B9-D762-4BCD-A0A2-B1947C016F10@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Nov 9, 2010, at 9:27 AM, Greg Stark wrote:
> On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> Oh, I'm mistaken. The problem was that buffering the writes was
>> insufficient to deal with torn pages. Even if you buffer the writes if
>> the machine crashes while only having written half the buffer out then
>> the checksum won't match. If the only changes on the page were hint
>> bit updates then there will be no full page write in the WAL log to
>> repair the block.
>
> Huh, this implies that if we did go through all the work of
> segregating the hint bits and could arrange that they all appear on
> the same 512-byte sector and if we buffered them so that we were
> writing the same bits we checksummed then we actually *could* include
> them in the CRC after all since even a torn page will almost certainly
> not tear an individual sector.

If there's a torn page then we've crashed, which means we go through crash recovery, which puts a valid page (with valid CRC) back in place from the WAL. What am I missing?

BTW, I agree that at minimum we need to leave the option of only raising a warning when we hit a checksum failure. Some people might want Postgres to treat it as an error by default, but most folks will at least want the option to look at their (corrupt) data.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 16:44:04
Message-ID:	AANLkTi=NFr9kP6bhfwfmB5TEmdwwidbXe9UyqR7z30mF@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 12:32 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> There are also crosschecks that you can apply: if it's a heap page, are
> there any index pages with pointers to it? If it's an index page, are
> there downlink or sibling links to it from elsewhere in the index?
> A page that Postgres left as zeroes would not have any references to it.
>
> IMO there are a lot of methods that can separate filesystem misfeasance
> from Postgres errors, probably with greater reliability than this hack.
> I would also suggest that you don't really need to prove conclusively
> that any particular instance is one or the other --- a pattern across
> multiple instances will tell you what you want to know.
>

Doing this postmortem on a regular deployment and fixing the problem would
not be too difficult. But this platform, which Postgres is a part of, would
be mostly left unattended once deployed (pardon me for not sharing the
details, as I am not sure if I can).

An external HA component is supposed to detect any problems (by querying
Postgres or by external means) and take an evasive action. It is this
automation of problem detection that we are seeking.

As Greg pointed out, even with this hack in place, we might still get zero
pages from the FS (say, when ext3 does metadata journaling but not block
journaling). In that case we'd rely on recovery's WAL replay of relation
extension to reintroduce the magic number in pages.

> What's more, if I did believe that this was a safe and
> reliable technique, I'd be unhappy about the opportunity cost of
> reserving it for zero-page testing rather than other purposes.
>
>
This is one of those times where you are a bit too terse for me. What does
zero-page imply that this hack wouldn't?

Regards,
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh(dot)gurjeet(at){ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 17:01:45
Message-ID:	AANLkTik5fTNgM9y3i5zXgCgHW1dRVbJY5dQVyPj8MKOL@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 4:26 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>> On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>>> Oh, I'm mistaken. The problem was that buffering the writes was
>>> insufficient to deal with torn pages. Even if you buffer the writes if
>>> the machine crashes while only having written half the buffer out then
>>> the checksum won't match. If the only changes on the page were hint
>>> bit updates then there will be no full page write in the WAL log to
>>> repair the block.
>
> If there's a torn page then we've crashed, which means we go through crash recovery, which puts a valid page (with valid CRC) back in place from the WAL. What am I missing?

"If the only changes on the page were hint bit updates then there will
be no full page write in the WAL to repair the block"

--
greg

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 17:06:41
Message-ID:	AANLkTi=ypc=nd4opsVntRdM+OvCbcKDkxjeH=oVMizQe@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 11:26 AM, Jim Nasby <jim(at)nasby(dot)net> wrote:

>> Huh, this implies that if we did go through all the work of
>> segregating the hint bits and could arrange that they all appear on
>> the same 512-byte sector and if we buffered them so that we were
>> writing the same bits we checksummed then we actually *could* include
>> them in the CRC after all since even a torn page will almost certainly
>> not tear an individual sector.
>
> If there's a torn page then we've crashed, which means we go through crash recovery, which puts a valid page (with valid CRC) back in place from the WAL. What am I missing?

The problem case is where hint-bits have been set. Hint bits have
always been "we don't really care, but we write them".

A torn-page on hint-bit-only writes is ok, because with a torn page
(assuming you dont' get zero-ed pages), you get the old or new chunks
of the complete 8K buffer, but they are identical except for only
hint-bits, which eiterh the old or new state is sufficient.

But with a check-sum, now, getting a torn page w/ only hint-bit
updates now becomes noticed. Before, it might have happened, but we
wouldn't have noticed or cared.

So, for getting checksums, we have to offer up a few things:
1) zero-copy writes, we need to buffer the write to get a consistent
checksum (or lock the buffer tight)
2) saving hint-bits on an otherwise unchanged page. We either need to
just not write that page, and loose the work the hint-bits did, or do
a full-page WAL of it, so the torn-page checksum is fixed

Both of these are theoretical performance tradeoffs. How badly do we
want to verify on read that it is *exactly* what we thought we wrote?

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc:	PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 17:15:13
Message-ID:	23601.1289322913@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com> writes:
> On Tue, Nov 9, 2010 at 12:32 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> IMO there are a lot of methods that can separate filesystem misfeasance
>> from Postgres errors, probably with greater reliability than this hack.

> Doing this postmortem on a regular deployment and fixing the problem would
> not be too difficult. But this platform, which Postgres is a part of, would
> be mostly left unattended once deployed (pardon me for not sharing the
> details, as I am not sure if I can).

> An external HA component is supposed to detect any problems (by querying
> Postgres or by external means) and take an evasive action. It is this
> automation of problem detection that we are seeking.

To be blunt, this argument is utter nonsense. The changes you propose
would still require manual analysis of any detected issues in order to
do anything useful about them. Once you know that there is, or isn't,
a filesystem-level error involved, what are you going to do next?
You're going to go try to debug the component you know is at fault,
that's what. And that problem is still AI-complete.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 17:31:01
Message-ID:	AANLkTimvrd89rdg4UdXRpXH378RJnPt0WevNLSL6vR19@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> So, for getting checksums, we have to offer up a few things:
> 1) zero-copy writes, we need to buffer the write to get a consistent
> checksum (or lock the buffer tight)
> 2) saving hint-bits on an otherwise unchanged page. We either need to
> just not write that page, and loose the work the hint-bits did, or do
> a full-page WAL of it, so the torn-page checksum is fixed

Actually the consensus the last go-around on this topic was to
segregate the hint bits into a single area of the page and skip them
in the checksum. That way we don't have to do any of the above. It's
just that that's a lot of work.

--
greg

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 19:05:57
Message-ID:	AANLkTinkLkAQNYV5PL+6h=qOBmRbQ++A6Bc9y20vOE-T@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 12:31 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
>> So, for getting checksums, we have to offer up a few things:
>> 1) zero-copy writes, we need to buffer the write to get a consistent
>> checksum (or lock the buffer tight)
>> 2) saving hint-bits on an otherwise unchanged page. We either need to
>> just not write that page, and loose the work the hint-bits did, or do
>> a full-page WAL of it, so the torn-page checksum is fixed
>
> Actually the consensus the last go-around on this topic was to
> segregate the hint bits into a single area of the page and skip them
> in the checksum. That way we don't have to do any of the above. It's
> just that that's a lot of work.

And it still allows silent data corruption, because bogusly clearing a
hint bit is, at the moment, harmless, but bogusly setting one is not.
I really have to wonder how other products handle this. PostgreSQL
isn't the only database product that uses MVCC - not by a long shot -
and the problem of detecting whether an XID is visible to the current
snapshot can't be ours alone. So what do other people do about this?
They either don't cache the information about whether the XID is
committed in-page (in which case, are they just slower or do they have
some other means of avoiding the performance hit?) or they cache it in
the page (in which case, they either WAL log it or they don't checksum
it). I mean, there aren't any other options, are there?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Kenneth Marshall <ktm(at)rice(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 19:14:29
Message-ID:	20101109191429.GF27429@aart.is.rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 09, 2010 at 02:05:57PM -0500, Robert Haas wrote:
> On Tue, Nov 9, 2010 at 12:31 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> >> So, for getting checksums, we have to offer up a few things:
> >> 1) zero-copy writes, we need to buffer the write to get a consistent
> >> checksum (or lock the buffer tight)
> >> 2) saving hint-bits on an otherwise unchanged page. ?We either need to
> >> just not write that page, and loose the work the hint-bits did, or do
> >> a full-page WAL of it, so the torn-page checksum is fixed
> >
> > Actually the consensus the last go-around on this topic was to
> > segregate the hint bits into a single area of the page and skip them
> > in the checksum. That way we don't have to do any of the above. It's
> > just that that's a lot of work.
>
> And it still allows silent data corruption, because bogusly clearing a
> hint bit is, at the moment, harmless, but bogusly setting one is not.
> I really have to wonder how other products handle this. PostgreSQL
> isn't the only database product that uses MVCC - not by a long shot -
> and the problem of detecting whether an XID is visible to the current
> snapshot can't be ours alone. So what do other people do about this?
> They either don't cache the information about whether the XID is
> committed in-page (in which case, are they just slower or do they have
> some other means of avoiding the performance hit?) or they cache it in
> the page (in which case, they either WAL log it or they don't checksum
> it). I mean, there aren't any other options, are there?
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

That would imply that we need to have a CRC for just the hint bit
section or some type of ECC calculation that can detect bad hint
bits independent of the CRC for the rest of the page.

Regards,
Ken

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 19:15:01
Message-ID:	1289329940-sup-5862@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Robert Haas's message of mar nov 09 16:05:57 -0300 2010:

> And it still allows silent data corruption, because bogusly clearing a
> hint bit is, at the moment, harmless, but bogusly setting one is not.
> I really have to wonder how other products handle this. PostgreSQL
> isn't the only database product that uses MVCC - not by a long shot -
> and the problem of detecting whether an XID is visible to the current
> snapshot can't be ours alone. So what do other people do about this?
> They either don't cache the information about whether the XID is
> committed in-page (in which case, are they just slower or do they have
> some other means of avoiding the performance hit?) or they cache it in
> the page (in which case, they either WAL log it or they don't checksum
> it). I mean, there aren't any other options, are there?

Maybe allocate enough shared memory for pg_clog buffers back to the
freeze horizon, and just don't use hint bits? Maybe some intermediate
solution, i.e. allocate a large bunch of pg_clog buffers, and do
WAL-logged setting of hint bits only for tuples that go further back.

I remember someone had a patch to set all the bits in a page that passed
a threshold of some kind. Ah, no, that was for freezing tuples.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 19:37:05
Message-ID:	4CD9A2E1.8090708@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> PostgreSQL
> isn't the only database product that uses MVCC - not by a long shot -
> and the problem of detecting whether an XID is visible to the current
> snapshot can't be ours alone. So what do other people do about this?
> They either don't cache the information about whether the XID is
> committed in-page (in which case, are they just slower or do they have
> some other means of avoiding the performance hit?) or they cache it in
> the page (in which case, they either WAL log it or they don't checksum
> it).

Well, most of the other MVCC-in-table DBMSes simply don't deal with
large, on-disk databases. In fact, I can't think of one which does,
currently; while MVCC has been popular for the New Databases, they're
all focused on "in-memory" databases. Oracle and InnoDB use rollback
segments.

Might be worth asking the BDB folks.

Personally, I think we're headed inevitably towards having a set of
metadata bitmaps for each table, like we do currently for the FSM.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 20:05:45
Message-ID:	AANLkTikDz-kheq4e3WcfPcyLwrmJ0iCJwHh2ZKznSonK@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 7:37 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Well, most of the other MVCC-in-table DBMSes simply don't deal with
> large, on-disk databases. In fact, I can't think of one which does,
> currently; while MVCC has been popular for the New Databases, they're
> all focused on "in-memory" databases. Oracle and InnoDB use rollback
> segments.

Well rollback segments are still MVCC. However Oracle's MVCC is
block-based. So they only have to do the visibility check once per
block, not once per row. Once they find the right block version they
can process all the rows on it.

Also Oracle's snapshots are just the log position. Instead of having
to check whether every transaction committed or not, they just find
the block version which was last modified before the log position for
when their transaction started.

> Might be worth asking the BDB folks.
>
> Personally, I think we're headed inevitably towards having a set of
> metadata bitmaps for each table, like we do currently for the FSM.

Well we already have a metadata bitmap for transaction visibility.
It's called the clog. There's no point in having one structured
differently around the table.

The whole point of the hint bits is that it's in the same place as the data.

--
greg

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 20:12:23
Message-ID:	4CD9AB27.7000509@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> The whole point of the hint bits is that it's in the same place as the data.

Yes, but the hint bits are currently causing us trouble on several
features or potential features:

* page-level CRC checks
* eliminating vacuum freeze for cold data
* index-only access
* replication
* this patch
* etc.

At a certain point, it's worth the trouble to handle them differently
because of the other features that enables or makes much easier.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 20:25:25
Message-ID:	AANLkTi=Yq1x0SNQiTd3-0STWbcf0FUfor_UeFK+1Urwv@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 8:12 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> The whole point of the hint bits is that it's in the same place as the data.
>
> Yes, but the hint bits are currently causing us trouble on several
> features or potential features:

Then we might have to get rid of hint bits. But they're hint bits for
a metadata file that already exists, creating another metadata file
doesn't solve anything.

Though incidentally all of the other items you mentioned are generic
problems caused by with MVCC, not hint bits.

--
greg

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 21:23:39
Message-ID:	AANLkTin2+w7UW-xVvTs2N4+uoNZMMg2SVqophoR7TkCR@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:

> Then we might have to get rid of hint bits. But they're hint bits for
> a metadata file that already exists, creating another metadata file
> doesn't solve anything.

Is there any way to instrument the writes of dirty buffers from the
share memory, and see how many of the pages normally being written are
not backed by WAL (hint-only updates)? Just "dumping" those buffers
without writes would allow at least *checksums* to go throug without
loosing all the benifits of the hint bits.

I've got a hunch (with no proof) that the penalty of not writing them
will be born largely by small database installs. Large OLTP databases
probably won't have pages without a WAL'ed change and hint-bits set,
and large data warehouse ones will probably vacuum freeze big tables
on load to avoid the huge write penalty the 1st time they scan the
tables...

</waving hands>

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 21:42:46
Message-ID:	4CD9C056.1070704@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Though incidentally all of the other items you mentioned are generic
> problems caused by with MVCC, not hint bits.

Yes, but the hint bits prevent us from implementing workarounds.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 21:50:23
Message-ID:	AANLkTikHoaJK_yy4OJWD0BCM86MuvkJDjOKCWLqf8knv@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 2:05 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Nov 9, 2010 at 12:31 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
>>> So, for getting checksums, we have to offer up a few things:
>>> 1) zero-copy writes, we need to buffer the write to get a consistent
>>> checksum (or lock the buffer tight)
>>> 2) saving hint-bits on an otherwise unchanged page. We either need to
>>> just not write that page, and loose the work the hint-bits did, or do
>>> a full-page WAL of it, so the torn-page checksum is fixed
>>
>> Actually the consensus the last go-around on this topic was to
>> segregate the hint bits into a single area of the page and skip them
>> in the checksum. That way we don't have to do any of the above. It's
>> just that that's a lot of work.
>
> And it still allows silent data corruption, because bogusly clearing a
> hint bit is, at the moment, harmless, but bogusly setting one is not.
> I really have to wonder how other products handle this. PostgreSQL
> isn't the only database product that uses MVCC - not by a long shot -
> and the problem of detecting whether an XID is visible to the current
> snapshot can't be ours alone. So what do other people do about this?
> They either don't cache the information about whether the XID is
> committed in-page (in which case, are they just slower or do they have
> some other means of avoiding the performance hit?) or they cache it in
> the page (in which case, they either WAL log it or they don't checksum
> it). I mean, there aren't any other options, are there?

An examination of the MySQL source code reveals their answer. In
row_vers_build_for_semi_consistent_read(), which I can't swear is the
right place but seems to be, there is this comment:

/* We assume that a rolled-back transaction stays in
TRX_ACTIVE state until all the changes have been
rolled back and the transaction is removed from
the global list of transactions. */

Which makes sense. If you never leave rows from aborted transactions
in the heap forever, then the list of aborted transactions that you
need to remember for MVCC purposes will remain relatively small and
you can just include those XIDs in your MVCC snapshot. Our problem is
that we have no particular bound on the number of aborted transactions
whose XIDs may still be floating around, so we can't do it that way.

To impose a similar bound in PostgreSQL, you'd need to maintain the
set of aborted XIDs and the relations that need to be vacuumed for
each one. As you vacuum, you prune any tuples with aborted xmins
(which is WAL-logged already anyway) and additionally WAL-log clearing
the xmax for each tuple with an aborted xmax. Thus, when you
finishing vacuuming the relation, the aborted XID is no longer present
anywhere in it. When you vacuum the last relation for a particular
XID, that XID no longer exists in the relation files anywhere and you
can remove it from the list of aborted XIDs. I think that WAL logging
the list of XIDs and list of unvacuumed relations for each at each
checkpoint would be sufficient for crash safety. If you did this, you
could then assume that any XID which precedes your snapshot's xmin is
committed.

1. When a big abort happens, you may have to carry that XID around in
every snapshot - and avoid advancing RecentGlobalXmin - for quite a
long time.
2. You have to WAL log marking the XMAX of an aborted transaction invalid.
3. You have to WAL log the not-yet-cleaned-up XIDs and the relations
each one needs vacuumed at each checkpoint.
4. There would presumably be some finite limit on the size of the
shared memory structure for aborted transactions. I don't think
there'd be any reason to make it particularly small, but if you sat
there and aborted transactions at top speed you might eventually run
out of room, at which point any transactions you started wouldn't be
able to abort until vacuum made enough progress to free up an entry.
5. It would be pretty much impossible to run with autovacuum turned
off, and in fact you would likely need to make it a good deal more
aggressive in the specific case of aborted transactions, to mitigate
problems #1, #3, and #4.

I'm not sure how bad those things would be, or if there are more that
I'm missing (besides the obvious "it would be a lot of work").

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 22:03:09
Message-ID:	4CD9C51D.4030301@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/9/10 1:50 PM, Robert Haas wrote:
> 5. It would be pretty much impossible to run with autovacuum turned
> off, and in fact you would likely need to make it a good deal more
> aggressive in the specific case of aborted transactions, to mitigate
> problems #1, #3, and #4.

6. This would require us to be more aggressive about VACUUMing old-cold
relations/page, e.g. VACUUM FREEZE. This it would make one of our worst
issues for data warehousing even worse.

What about having this map (and other hintbits) be per-relation? Hmmm.
That wouldn't work for DDL, I suppose ...

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 22:15:45
Message-ID:	4CD973B102000025000374C7@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> 6. This would require us to be more aggressive about VACUUMing
> old-cold relations/page, e.g. VACUUM FREEZE. This it would make
> one of our worst issues for data warehousing even worse.

I continue to feel that it is insane that when a table is populated
within the same database transaction which created it (e.g., a bulk
load of a table or partition), that we don't write the tuples with
hint bits set for commit and xmin frozen. By the time any but the
creating transaction can see the tuples, *if* any other transaction
is ever able to see the tuples, these will be the correct values;
we really should be able to deal with it within the creating
transaction somehow.

If we ever handle that, would #6 be a moot point, or do you think
it's still a significant issue?

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 22:30:09
Message-ID:	AANLkTin58_+fvGC=O=PZq28m=uzgTPEvh5B181kZSfjo@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 5:03 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 11/9/10 1:50 PM, Robert Haas wrote:
>> 5. It would be pretty much impossible to run with autovacuum turned
>> off, and in fact you would likely need to make it a good deal more
>> aggressive in the specific case of aborted transactions, to mitigate
>> problems #1, #3, and #4.
>
> 6. This would require us to be more aggressive about VACUUMing old-cold
> relations/page, e.g. VACUUM FREEZE. This it would make one of our worst
> issues for data warehousing even worse.

Uh, no it doesn't. It only requires you to be more aggressive about
vacuuming the transactions that are in the aborted-XIDs array. It
doesn't affect transaction wraparound vacuuming at all, either
positively or negatively. You still have to freeze xmins before they
flip from being in the past to being in the future, but that's it.

> What about having this map (and other hintbits) be per-relation? Hmmm.
> That wouldn't work for DDL, I suppose ...

"This map"? I suppose you could track aborted XIDs per relation
instead of globally, but I don't see why that would affect DDL any
differently than anything else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 22:30:35
Message-ID:	AANLkTi=1_ucX0z_wB5hmHomPG14zZrvoEcN6xssGwaDP@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 5:15 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>> 6. This would require us to be more aggressive about VACUUMing
>> old-cold relations/page, e.g. VACUUM FREEZE. This it would make
>> one of our worst issues for data warehousing even worse.
>
> I continue to feel that it is insane that when a table is populated
> within the same database transaction which created it (e.g., a bulk
> load of a table or partition), that we don't write the tuples with
> hint bits set for commit and xmin frozen. By the time any but the
> creating transaction can see the tuples, *if* any other transaction
> is ever able to see the tuples, these will be the correct values;
> we really should be able to deal with it within the creating
> transaction somehow.

I agree.

> If we ever handle that, would #6 be a moot point, or do you think
> it's still a significant issue?

I think it's a moot point anyway, per previous email.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 22:41:26
Message-ID:	AANLkTintPQKjFBEMGc3Ww_db7wmTt8FeF=VqLgeQ69AG@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 3:05 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Tue, Nov 9, 2010 at 7:37 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Well, most of the other MVCC-in-table DBMSes simply don't deal with
>> large, on-disk databases. In fact, I can't think of one which does,
>> currently; while MVCC has been popular for the New Databases, they're
>> all focused on "in-memory" databases. Oracle and InnoDB use rollback
>> segments.
>
> Well rollback segments are still MVCC. However Oracle's MVCC is
> block-based. So they only have to do the visibility check once per
> block, not once per row. Once they find the right block version they
> can process all the rows on it.
>
> Also Oracle's snapshots are just the log position. Instead of having
> to check whether every transaction committed or not, they just find
> the block version which was last modified before the log position for
> when their transaction started.

That is cool. One problem is that it might sometimes result in
additional I/O. A transaction begins and writes a tuple. We must
write a preimage of the page (or at least, sufficient information to
reconstruct a preimage of the page) to the undo segment. If the
transaction commits relatively quickly, and all transactions which
took their snapshots before the commit end either by committing or by
aborting, we can discard that information from the undo segment
without ever writing it to disk. However, if that doesn't happen, the
undo log page may get evicted, and we're now doing three writes (WAL,
page, undo) rather than just two (WAL, page). That's no worse than an
update where the old and new tuples land on different pages, but it IS
worse than an update where the old and new tuples are on the same
page, or at least I think it is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 22:45:44
Message-ID:	4CD9CF18.2080509@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert,

> Uh, no it doesn't. It only requires you to be more aggressive about
> vacuuming the transactions that are in the aborted-XIDs array. It
> doesn't affect transaction wraparound vacuuming at all, either
> positively or negatively. You still have to freeze xmins before they
> flip from being in the past to being in the future, but that's it.

Sorry, I was trying to say that it's similar to the freeze issue, not
that it affects freeze. Sorry for the lack of clarity.

What I was getting at is that this could cause us to vacuum
relations/pages which would otherwise never be vaccuumed (or at least,
not until freeze). Imagine a very large DW table which is normally
insert-only and seldom queried, but once a month or so the insert aborts
and rolls back.

I'm not saying that your proposal isn't worth testing. I'm just saying
that it may prove to be a net loss to overall system efficiency.

>> If we ever handle that, would #6 be a moot point, or do you think
>> > it's still a significant issue?

Kevin, the case which your solution doesn't fix is the common one of
"log tables" which keep adding records continuously, with < 5% inserts
or updates. That may seem like a "corner case" but such a table,
partitioned or unpartitioned, exists in around 1/3 of the commercial
applications I've worked on, so it's a common pattern.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 23:30:58
Message-ID:	20421.1289345458@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> Though incidentally all of the other items you mentioned are generic
>> problems caused by with MVCC, not hint bits.

> Yes, but the hint bits prevent us from implementing workarounds.

If we got rid of hint bits, we'd need workarounds for the ensuing
massive performance loss. There is no reason whatsoever to imagine
that we'd come out ahead in the end.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Jim Nasby <jim(at)nasby(dot)net>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-09 23:42:09
Message-ID:	20570.1289346129@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> <dons asbestos underpants>
> 4. There would presumably be some finite limit on the size of the
> shared memory structure for aborted transactions. I don't think
> there'd be any reason to make it particularly small, but if you sat
> there and aborted transactions at top speed you might eventually run
> out of room, at which point any transactions you started wouldn't be
> able to abort until vacuum made enough progress to free up an entry.

Um, that bit is a *complete* nonstarter. The possibility of a failed
transaction always has to be allowed. What if vacuum itself gets an
error for example? Or, what if the system crashes?

I thought for a bit about inverting the idea, such that there were a
limit on the number of unvacuumed *successful* transactions rather than
the number of failed ones. But that seems just as unforgiving: what if
you really need to commit a transaction to effect some system state
change? An example might be dropping some enormous table that you no
longer need, but vacuum is going to insist on plowing through before
it'll let you have any more transactions.

I'm of the opinion that any design that presumes it can always fit all
the required transaction-status data in memory is probably not even
worth discussing. There always has to be a way for status data to spill
to disk. What's interesting is how you can achieve enough locality of
access so that most of what you need to look at is usually in memory.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-10 00:04:05
Message-ID:	AANLkTikx2Lm7i4_pWFV3h9bZed8Thv7Jq6_eSAsXOz8-@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 5:45 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Robert,
>
>> Uh, no it doesn't. It only requires you to be more aggressive about
>> vacuuming the transactions that are in the aborted-XIDs array. It
>> doesn't affect transaction wraparound vacuuming at all, either
>> positively or negatively. You still have to freeze xmins before they
>> flip from being in the past to being in the future, but that's it.
>
> Sorry, I was trying to say that it's similar to the freeze issue, not
> that it affects freeze. Sorry for the lack of clarity.
>
> What I was getting at is that this could cause us to vacuum
> relations/pages which would otherwise never be vaccuumed (or at least,
> not until freeze). Imagine a very large DW table which is normally
> insert-only and seldom queried, but once a month or so the insert aborts
> and rolls back.

Oh, I see. In that case, under the proposed scheme, you'd get an
immediate vacuum of everything inserted into the table since the last
failed insert. Everything prior to the last failed insert would be
OK, since the visibility map bits would already be set for those
pages. Yeah, that would be annoying.

There's a related problem with index-only scans. If a large DW table
which is normally insert-only, but which IS queried regularly, it
won't be able to use index-only scans effectively because without
regularly vacuuming, the visibility map bits won't be set. We've
previously discussed the possibility of having the background writer
set hint bits before writing the pages, and maybe it could even set
the all-visible bit and update the visibility map, too. But that
won't help if the transaction inserts a large enough quantity of data
that it starts spilling buffers to disk before it commits.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-10 00:36:35
Message-ID:	AANLkTimwMxWe1F94A6ifu0JPJt0zxy=3+CwPwCd3xOZm@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 10, 2010 at 1:15 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Once you know that there is, or isn't,
> a filesystem-level error involved, what are you going to do next?
> You're going to go try to debug the component you know is at fault,
> that's what. And that problem is still AI-complete.
>
>
If we know for sure that Postgres was not at fault then we have standby node
to failover to, where Postgres warm standby is being maintained by streaming
replication.

Regards
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh(dot)gurjeet(at){ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Jim Nasby <jim(at)nasby(dot)net>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-10 00:50:47
Message-ID:	AANLkTik5G-KeNGu0VqZCAbP1Qn-ZDDc2qZu38q1n_0JE@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 6:42 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> <dons asbestos underpants>
>> 4. There would presumably be some finite limit on the size of the
>> shared memory structure for aborted transactions. I don't think
>> there'd be any reason to make it particularly small, but if you sat
>> there and aborted transactions at top speed you might eventually run
>> out of room, at which point any transactions you started wouldn't be
>> able to abort until vacuum made enough progress to free up an entry.
>
> Um, that bit is a *complete* nonstarter. The possibility of a failed
> transaction always has to be allowed. What if vacuum itself gets an
> error for example? Or, what if the system crashes?

I wasn't proposing that it was impossible to abort, only that aborts
might have to block. I admit I don't know what to do about VACUUM
itself failing. A transient failure mightn't be so bad, but if you
find yourself permanently unable to eradicate the XIDs left behind by
an aborted transaction, you'll eventually have to shut down the
database, lest the XID space wrap around.

Actually, come to think of it, there's no reason you COULDN'T spill
the list of aborted-but-not-yet-cleaned-up XIDs to disk. It's just
that XidInMVCCSnapshot() would get reeeeeeally expensive after a
while.

> I thought for a bit about inverting the idea, such that there were a
> limit on the number of unvacuumed *successful* transactions rather than
> the number of failed ones. But that seems just as unforgiving: what if
> you really need to commit a transaction to effect some system state
> change? An example might be dropping some enormous table that you no
> longer need, but vacuum is going to insist on plowing through before
> it'll let you have any more transactions.

The number of relevant aborted XIDs tends naturally to decline to zero
as vacuum does its thing, while the number of relevant committed XIDs
tends to grow very, very large (it starts to decline only when we
start freezing things), so remembering the not-yet-cleaned-up aborted
XIDs seems likely to be cheaper. In fact, in many cases, the set of
not-yet-cleaned-up aborted XIDs will be completely empty.

> I'm of the opinion that any design that presumes it can always fit all
> the required transaction-status data in memory is probably not even
> worth discussing.

Well, InnoDB does it.

> There always has to be a way for status data to spill
> to disk. What's interesting is how you can achieve enough locality of
> access so that most of what you need to look at is usually in memory.

We're not going to get any more locality of reference than we're
already getting from hint bits, are we? The advantage of trying to do
timely cleanup of aborted transactions is that you can assume that any
XID before RecentGlobalXmin is committed, without checking CLOG and
without having to update hint bits and write out the ensuing dirty
pages. If we could make CLOG access cheap enough that we didn't need
hint bits, that would also solve that problem, but nobody (including
me) seems to think that's feasible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Protecting against unexpected zero-pages: proposal
Date:	2010-11-10 03:31:39
Message-ID:	AANLkTimtP7=4ke0DY+shMab5k3zAyELEDJGiu7kq4UDX@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 9, 2010 at 7:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Nov 9, 2010 at 5:45 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Robert,
>>
>>> Uh, no it doesn't. It only requires you to be more aggressive about
>>> vacuuming the transactions that are in the aborted-XIDs array. It
>>> doesn't affect transaction wraparound vacuuming at all, either
>>> positively or negatively. You still have to freeze xmins before they
>>> flip from being in the past to being in the future, but that's it.
>>
>> Sorry, I was trying to say that it's similar to the freeze issue, not
>> that it affects freeze. Sorry for the lack of clarity.
>>
>> What I was getting at is that this could cause us to vacuum
>> relations/pages which would otherwise never be vaccuumed (or at least,
>> not until freeze). Imagine a very large DW table which is normally
>> insert-only and seldom queried, but once a month or so the insert aborts
>> and rolls back.
>
> Oh, I see. In that case, under the proposed scheme, you'd get an
> immediate vacuum of everything inserted into the table since the last
> failed insert. Everything prior to the last failed insert would be
> OK, since the visibility map bits would already be set for those
> pages. Yeah, that would be annoying.

Ah, but it might be fixable. You wouldn't really need to do a
full-fledged vacuum. It would be sufficient to scan the heap pages
that might contain the XID we're trying to clean up after, without
touching the indexes. Instead of actually removing tuples with an
aborted XMIN, you could just mark the line pointers LP_DEAD. Tuples
with an aborted XMAX don't require touching the indexes anyway. So as
long as you have some idea which segment of the relation was
potentially dirtied by that transaction, you could just scan those
blocks and update the item pointers and/or XMAX values for the
offending tuples without doing anything else (although you'd probably
want to opportunistically grab the buffer cleanup lock and defragment
if possible).

Unfortunately, I'm now realizing another problem. During recovery,
you have to assume that any XIDs that didn't commit are aborted; under
the scheme I proposed upthread, if a transaction that was in-flight at
crash time had begun prior to the last checkpoint, you wouldn't know
which relations it had potentially dirtied. Ouch. But I think this
is fixable, too. Let's invent a new on-disk structure called the
content-modified log. Transactions that want to insert, update, or
delete tuples allocate pages from this structure. The header of each
page stores the XID of the transaction that owns that page and the ID
of the database to which that transaction is bound. Following the
header, there are a series of records of the form: tablespace OID,
table OID, starting page number, ending page number. Each such record
indicates that the given XID may have put its XID on disk within the
given page range of the specified relation. Each checkpoint flushes
the dirty pages of the modified-content log to disk along with
everything else. Thus, on redo, we can reconstruct the additional
entries that need to be added to the log from the contents of WAL
subsequent to the redo pointer.

If a transaction commits, we can remove all of its pages from the
modified-content log; in fact, if a transaction begins and commits
without an intervening checkpoint, the pages never need to hit the
disk at all. If a transaction aborts, its modified-content log pages
must stick around until we've eradicated any copies of its XID in the
relation data files. We maintain a global value for the oldest
aborted XID which is not yet fully cleaned up (let's called this the
OldestNotQuiteDeadYetXID). When we see an XID which precedes
OldestNotQuiteDeadYetXID, we know it's committed. Otherwise, we check
whether the XID precedes the xmin of our snapshot. If it does, we
have to check whether the XID is committed or aborted (it must be one
or the other). If it does not, we use our snapshot, as now. Checking
XIDs between OldestNotQuiteDeadYetXID and our snapshot's xmin is
potentially expensive, but (1) if there aren't many aborted
transactions, this case shouldn't arise very often; (2) if the XID
turns out to be aborted and we can get an exclusive buffer content
lock, we can nuke that copy of the XID to save the next guy the
trouble of examining it; and (3) we can maintain a size-limited
per-backend cache of this information, which should help in the normal
cases where there either aren't that many XIDs that fall into this
category or our transaction doesn't see all that many of them.

This also addresses Tom's concern about needing to store all the
information in memory, and the need to WAL-log not-yet-cleaned-up XIDs
at each checkpoint. You still need to aggressively clean up after
aborted transactions, either using our current vacuum mechanism or the
"just zap the XIDs" shortcut described above.

(An additional interesting point about this design is that you could
potentially also use it to drive vacuum activity for transactions that
commit, especially if we were to also store a flag indicating whether
each page range contained updates/deletes or only inserts.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Rethinking hint bits WAS: Protecting against unexpected zero-pages: proposal
Date:	2010-11-14 20:52:39
Message-ID:	4CE04C17.9050601@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> If we got rid of hint bits, we'd need workarounds for the ensuing
> massive performance loss. There is no reason whatsoever to imagine
> that we'd come out ahead in the end.

Oh, there's no question that we need something which serves the same
purpose as the existing hit bits. But there's a lot of question about
whether our existing implementation is optimal.

For example, imagine if the hint bits were moved to a separate per-table
bitmap outside the table instead of being stored with each row, as the
current FSM is. Leaving aside the engineering required for this (which
would be considerable, especially when it comes to consistency and
durability), this would potentially allow solutions to the following issues:

* Index-only access
* I/O associated with hint bit setting
* Vacuum freezing old-cold data
* Page-level CRCs
* Rsyncing tables for replication

Alternately, we could attack this by hint bit purpose. For example, if
we restructured the CLOG so that it was an efficient in-memory index
(yes, I'm being handwavy), then having the XID-is-visible hint bits
might become completely unnecessary. We could then also improve the
visibility map to be reliable and include "frozen" bits as well.

Overall, what I'm pointing out is that our current implementation of
hint bits is blocking not one by several major features and causing our
users performance pain. It's time to look for an implementation which
doesn't have the same problems we're familiar with.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Rethinking hint bits WAS: Protecting against unexpected zero-pages: proposal
Date:	2010-11-14 21:40:17
Message-ID:	AANLkTinPZTY3dwaCcabRNKfM2xwo_a2O=aDaqiMCdZuV@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 14, 2010 at 8:52 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> For example, imagine if the hint bits were moved to a separate per-table
> bitmap outside the table instead of being stored with each row, as the
> current FSM is.

How many times do we have to keep going around the same block?

We *already* have separate bitmap outside the table for transaction
commit bits. It's the clog.

The only reason the hint bits exist is to cache that so we don't need
to do extra I/O to check tuple visibility. If the hint bits are moved
outside the table then they serve no purpose whatsover. Then you have
an additional I/O to attempt to save an additional I/O.

The only difference between the clog and your proposal is that the
clog is two bits per transaction and your proposal is 4 bits per
tuple. The per-tuple idea guarantees that the extra I/O will be very
localized which isn't necessarily true for the clog but the clog is
small enough that it probably is true anyways. And even if there's no
I/O the overhead to consult the clog/per-table fork in memory is
probably significant.

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: Rethinking hint bits WAS: Protecting against unexpected zero-pages: proposal
Date:	2010-11-14 22:15:39
Message-ID:	12163.1289772939@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Sun, Nov 14, 2010 at 8:52 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> For example, imagine if the hint bits were moved to a separate per-table
>> bitmap outside the table instead of being stored with each row, as the
>> current FSM is.

> How many times do we have to keep going around the same block?

> We *already* have separate bitmap outside the table for transaction
> commit bits. It's the clog.

> The only reason the hint bits exist is to cache that so we don't need
> to do extra I/O to check tuple visibility. If the hint bits are moved
> outside the table then they serve no purpose whatsover. Then you have
> an additional I/O to attempt to save an additional I/O.

Well, not quite. The case this could improve is index-only scans:
you could go (or so he hopes) directly from the index to the hint
bits given the TID stored by the index. A clog lookup is not possible
without XMIN & XMAX, which we do not keep in index entries.

But I'm just as skeptical as you are about this being a net win.
It'll pessimize too much other stuff.

Josh is ignoring the proposal that is on the table and seems actually
workable, which is to consult the visibility map during index-only
scans. For mostly-static tables this would save trips to the heap for
very little extra I/O. The hard part is to make the VM reliable, but
that is not obviously harder than making separately-stored hint bits
reliable.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: Rethinking hint bits WAS: Protecting against unexpected zero-pages: proposal
Date:	2010-11-14 23:11:19
Message-ID:	4CE06C97.8000407@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/14/2010 05:15 PM, Tom Lane wrote:
> Josh is ignoring the proposal that is on the table and seems actually
> workable, which is to consult the visibility map during index-only
> scans. For mostly-static tables this would save trips to the heap for
> very little extra I/O. The hard part is to make the VM reliable, but
> that is not obviously harder than making separately-stored hint bits
> reliable.

I thought we had agreement in the past that this was the way we should
proceed.

cheers

andrew

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: Rethinking hint bits WAS: Protecting against unexpected zero-pages: proposal
Date:	2010-11-15 18:16:45
Message-ID:	4CE1790D.9070805@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg, Tom,

> We *already* have separate bitmap outside the table for transaction
> commit bits. It's the clog.

You didn't read my whole e-mail. I talk about the CLOG further down.

> Josh is ignoring the proposal that is on the table and seems actually
> workable, which is to consult the visibility map during index-only
> scans. For mostly-static tables this would save trips to the heap for
> very little extra I/O. The hard part is to make the VM reliable, but
> that is not obviously harder than making separately-stored hint bits
> reliable.

No, I'm not. I'm pointing out that it doesn't unblock the other 4
features/improvements I mentioned, *all* of which would be unblocked by
not storing the hint bits in the table, whatever means we use to do so.
You, for your part, are consistently ignoring these other issues.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: Rethinking hint bits WAS: Protecting against unexpected zero-pages: proposal
Date:	2010-11-15 18:24:48
Message-ID:	332.1289845488@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> No, I'm not. I'm pointing out that it doesn't unblock the other 4
> features/improvements I mentioned, *all* of which would be unblocked by
> not storing the hint bits in the table, whatever means we use to do so.
> You, for your part, are consistently ignoring these other issues.

I'm not ignoring them; I just choose to work on other issues, since
there is no viable proposal for fixing them. I don't intend to put
my time into dead ends.

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: Rethinking hint bits WAS: Protecting against unexpected zero-pages: proposal
Date:	2010-11-15 19:06:21
Message-ID:	4CE184AD.5030401@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I'm not ignoring them; I just choose to work on other issues, since
> there is no viable proposal for fixing them. I don't intend to put
> my time into dead ends.

So, that's a "show me a patch and we'll talk"? Understood, then.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: Rethinking hint bits WAS: Protecting against unexpected zero-pages: proposal
Date:	2010-11-15 19:16:06
Message-ID:	AANLkTi=KwBFrbJgbrh_8_CzGXoA7YzLcDR5PBicay=e8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 15, 2010 at 2:06 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> I'm not ignoring them; I just choose to work on other issues, since
>> there is no viable proposal for fixing them. I don't intend to put
>> my time into dead ends.
>
> So, that's a "show me a patch and we'll talk"? Understood, then.

Or even just a proposal.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: Rethinking hint bits WAS: Protecting against unexpected zero-pages: proposal
Date:	2010-11-15 19:41:51
Message-ID:	5161.1289850111@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Nov 15, 2010 at 2:06 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> I'm not ignoring them; I just choose to work on other issues, since
>>> there is no viable proposal for fixing them. I don't intend to put
>>> my time into dead ends.

>> So, that's a "show me a patch and we'll talk"? Understood, then.

> Or even just a proposal.

Well, he did have a proposal ... it just wasn't very credible. Moving
the hint bits around is at best a zero-sum game; it seems likely to
degrade cases we now handle well more than it improves cases we don't.
I think what we need is a fundamentally new idea, and I've not seen one.

regards, tom lane

From:	Jim Nasby <Jim(at)Nasby(dot)net>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: Rethinking hint bits WAS: Protecting against unexpected zero-pages: proposal
Date:	2010-12-06 17:06:49
Message-ID:	7FC2D2C0-C100-460C-A2E3-9090A449C025@Nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

<html><body class="ApplePlainTextBody" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">On Nov 14, 2010, at 3:40 PM, Greg Stark wrote: <blockquote type="cite">On Sun, Nov 14, 2010 at 8:52 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote: </blockquote><blockquote type="cite"><blockquote type="cite">For example, imagine if the hint bits were moved to a separate per-table </blockquote></blockquote><blockquote type="cite"><blockquote type="cite">bitmap outside the table instead of being stored with each row, as the </blockquote></blockquote><blockquote type="cite"><blockquote type="cite">current FSM is. </blockquote></blockquote><blockquote type="cite"> </blockquote><blockquote type="cite">How many times do we have to keep going around the same block? </blockquote><blockquote type="cite"> </blockquote><blockquote type="cite">We *already* have separate bitmap outside the table for transaction </blockquote><blockquote type="cite">commit bits. It's the clog. </blockquote><blockquote type="cite"> </blockquote><blockquote type="cite">The only reason the hint bits exist is to cache that so we don't need </blockquote><blockquote type="cite">to do extra I/O to check tuple visibility. If the hint bits are moved </blockquote><blockquote type="cite">outside the table then they serve no purpose whatsover. Then you have </blockquote><blockquote type="cite">an additional I/O to attempt to save an additional I/O. </blockquote> Are you sure hint bits are only for IO savings? Calculating visibility from CLOG involves a hell of a lot more CPU than checking a hint bit. It would be extremely interesting if the CPU overhead wasn't very noticeable however. That would mean we *only* have to worry about CLOG IO, and there's probably a lot of ways around that (memory mapping CLOG is one possibility), especially considering that 4G isn't exactly a large amount of memory these days. -- Jim C. Nasby, Database Architect                   jim(at)nasby(dot)net 512.569.9461 (cell)                         http://jim.nasby.net </body></html>

Attachment	Content-Type	Size
unknown_filename	text/html	2.2 KB