Re: Placing hints in line pointers

Lists: pgsql-hackers
From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Placing hints in line pointers
Date: 2013-06-01 14:45:02
Message-ID: CA+U5nMLzCXuK-hix4OJXFMeu--0W8=vWtLW-U8boOncZ=LMzdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Notes on a longer term idea...

An item pointer (also called line pointer) is used to allow an
external pointer to an item, while allowing us to place the tuple that
anywhere on the page. An ItemId is 4 bytes long and currently consists
of (see src/include/storage/itemid.h)...

typedef struct ItemIdData
{
unsigned lp_off:15, /* offset to tuple (from start of page) */
lp_flags:2, /* state of item pointer, see below */
lp_len:15; /* byte length of tuple */
} ItemIdData;

The offset to the tuple is 15 bits, which is sufficient to point to
32768 separate byte positions, and hence why we limit ourselves to
32kB blocks.

If we use 4 byte alignment for tuples, then that would mean we
wouldn't ever use the lower 2 bits of lp_off, nor would we use the
lower 2 bits of lp_len. They are always set at zero. (Obviously, with
8 byte alignment we would have 3 bits spare in each, but I'm looking
for something that works the same on various architectures for
simplicity).

So my suggestion is to make lp_off and lp_len store the values in
terms of 4 byte chunks, which would allow us to rework the data
structure like this...

typedef struct ItemIdData
{
unsigned lp_off:13, /* offset to tuple (from start of
page), number of 4 byte chunks */
lp_xmin_hint:2, /* committed and invalid hints for xmin */
lp_flags:2, /* state of item pointer, see below */
lp_len:13; /* byte length of tuple, number of 4
byte chunks */
lp_xmax_hint:2, /*committed and invalid hints for xmax */
} ItemIdData;

i.e. we have room for 4 additional bits and we use those to put the
tuple hints for xmin and xmax

Doing this would have two purposes:

* We wouldn't need to follow the pointer if the row is marked aborted.
This would save a random memory access for that tuple

* It would isolate the tuple hint values into a smaller area of the
block, so we would be able to avoid the annoyance of recalculating the
checksums for the whole block when a single bit changes.

We wouldn't need to do a FPW when a hint changes, we would only need
to take a copy of the ItemId array, which is much smaller. And it
could be protected by its own checksum.

(In addition, if we wanted, this could be used to extend block size to
64kB if we used 8-byte alignment for tuples)

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Placing hints in line pointers
Date: 2013-06-10 02:43:58
Message-ID: 1370832238.7746.11.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 2013-06-01 at 15:45 +0100, Simon Riggs wrote:
> Doing this would have two purposes:
>
> * We wouldn't need to follow the pointer if the row is marked aborted.
> This would save a random memory access for that tuple

That's quite similar to LP_DEAD, right? You could potentially set this
new hint sooner, which may be an advantage.

> We wouldn't need to do a FPW when a hint changes, we would only need
> to take a copy of the ItemId array, which is much smaller. And it
> could be protected by its own checksum.

I like the direction of this idea.

I don't see exactly how you want to make this safe, though. During
replay, we can't just replace the ItemId array, because the other
contents on the page might be newer (e.g. some tuples may have been
cleaned out, leaving the item pointers pointing to the wrong place).

> (In addition, if we wanted, this could be used to extend block size to
> 64kB if we used 8-byte alignment for tuples)

I think that supporting 64KB blocks has merit.

Interesting observation about the extra bits available. That would be an
on-disk format change, so perhaps this is something to add to the list
of "waiting for a format change"?

Regards,
Jeff Davis


From: Greg Stark <stark(at)mit(dot)edu>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Placing hints in line pointers
Date: 2013-06-10 09:40:19
Message-ID: CAM-w4HOg3VJSsqTnqss5gdjYRY+LwYtkFiJnoOO0dEbRH5azbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 10, 2013 at 3:43 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>> We wouldn't need to do a FPW when a hint changes, we would only need
>> to take a copy of the ItemId array, which is much smaller. And it
>> could be protected by its own checksum.
>
> I like the direction of this idea.

One of the previous proposals was to move all the hint bits to a
dedicated area. This had a few problems:

1) Three areas would have meant we wold have needed some tricky
ability to relocate the third area since the current "grow from both
ends" technique doesn't scale to three. Throwing them in with the line
pointers might nicely solve this.

2) We kept finding more hint bits lying around. There are hint bits in
the heap page header, there are hint bits in indexes, and I thought we
might have found more hint bits in the tuple headers than the 4 you
note. I'm not sure this is still true incidentally, I think the
PD_ALLVISIBLE flag might no longer be a hint bit? Was that the only
one in the page header? Are the index hint bits also relocatable to
the line pointers in the index pages?

3) The reduction in the checksum coverage. Personally I thought this
was a red herring -- they're hint bits, they're whole raison d'etre is
to be a performance optimization. But if you toss the line pointers in
with them then I see a problem. Protecting the line pointers is
critically important. A flipped bit in a line pointer could cause all
kinds of weirdness. And I don't think it would be easy to protect them
with their own checksum. You would have the same problems you
currently have of a process updating the hint bit behind your back
while calculating the checksum or after your last WAL record but
before the block is flushed.

Now if this is combined with the other idea -- masking out *just* the
hint bits from the checksum I wonder if moving them to the line
pointers doesn't make that more feasible. Since they would be in a
consistent location for every line pointer, instead of having to check
on each iteration if we're looking at the beginning of a tuple, and
the only thing we would be counting on being correct before checking
the checksum would be the number of line pointers (rather than every
line pointer offset).

--
greg