Logical to physical page mapping

Lists: pgsql-hackers
From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Logical to physical page mapping
Date: 2012-10-27 05:01:50
Message-ID: 508B6ABE.2030801@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I just had this thought a few minutes ago, discussed it briefly with
RhodiumToad on #postgresql and wanted to put it out here for discussion.
Feel free to rip it apart. It probably is a bit "al-dente" at this point
and needs more cooking.

The reason why we need full_page_writes is that we need to guard against
torn pages or partial writes. So what if smgr would manage a mapping
between logical page numbers and their physical location in the relation?

At the moment where we today require a full page write into WAL, we
would mark the buffer as "needs relocation". The smgr would then write
this page into another physical location whenever it is time to write it
(via the background writer, hopefully). After that page is flushed, it
would update the page location pointer, or whatever we want to call it.
A thus free'd physical page location can be reused, once the location
pointer has been flushed to disk. This is a critical ordering of writes.
First the page at the new location, second the pointer to the current
location. Doing so would make write(2) appear atomic to us, which is
exactly what we need for crash recovery.

In addition to that, vacuum would now be able to tell smgr "hey, this
page is completely empty". Instead of doing the second "empty page for
truncate" scan, smgr could slowly migrate pages on first touch after a
checkpoint towards the head of the file, into these empty pages. This
way it would free pages at the end and now smgr is completely at liberty
to truncate them off whenever it sees fit. No extra scan require, just a
little more bookkeeping. This would not only be the case for heap pages,
but for empty index pages as well. Shrinking/truncating indexes is
something, we are completely unable to do today. Whenever the buffer
manager is asked for such a page that doesn't exist physically any more,
it would just initialize an empty one of that kind (heap/index) in a
buffer and mark it "needs relocation". It would get recreated physically
on eviction/checkpoint without freeing any previously occupied space.

Comments?
Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-27 13:43:05
Message-ID: 18433.1351345385@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> The reason why we need full_page_writes is that we need to guard against
> torn pages or partial writes. So what if smgr would manage a mapping
> between logical page numbers and their physical location in the relation?

> At the moment where we today require a full page write into WAL, we
> would mark the buffer as "needs relocation". The smgr would then write
> this page into another physical location whenever it is time to write it
> (via the background writer, hopefully). After that page is flushed, it
> would update the page location pointer, or whatever we want to call it.
> A thus free'd physical page location can be reused, once the location
> pointer has been flushed to disk. This is a critical ordering of writes.
> First the page at the new location, second the pointer to the current
> location. Doing so would make write(2) appear atomic to us, which is
> exactly what we need for crash recovery.

I think you're just moving the atomic-write problem from the data pages
to wherever you keep these pointers.

regards, tom lane


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-27 18:41:51
Message-ID: 508C2AEF.1040004@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27.10.2012 16:43, Tom Lane wrote:
> Jan Wieck<JanWieck(at)Yahoo(dot)com> writes:
>> The reason why we need full_page_writes is that we need to guard against
>> torn pages or partial writes. So what if smgr would manage a mapping
>> between logical page numbers and their physical location in the relation?
>
>> At the moment where we today require a full page write into WAL, we
>> would mark the buffer as "needs relocation". The smgr would then write
>> this page into another physical location whenever it is time to write it
>> (via the background writer, hopefully). After that page is flushed, it
>> would update the page location pointer, or whatever we want to call it.
>> A thus free'd physical page location can be reused, once the location
>> pointer has been flushed to disk. This is a critical ordering of writes.
>> First the page at the new location, second the pointer to the current
>> location. Doing so would make write(2) appear atomic to us, which is
>> exactly what we need for crash recovery.

Hmm, aka copy-on-write.

> I think you're just moving the atomic-write problem from the data pages
> to wherever you keep these pointers.

If the pointers are stored as simple 4-byte integers, you probably could
assume that they're atomic, and won't be torn.

There's a lot of practical problems in adding another level of
indirection to every page access, though. It'll surely add some overhead
to every access, even if the data never changes. And it's not at all
clear to me that it would perform better than full_page_writes. You're
writing and flushing out roughly the same amount of data AFAICS.

What exactly is the problem with full_page_writes that we're trying to
solve?

- Heikki


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jan Wieck <JanWieck(at)yahoo(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-27 20:09:41
Message-ID: CAGTBQpZ5ZRDPbo1ZjH7fogv1Se2PhS=J6otk35LGN8aDH3ESzw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Oct 27, 2012 at 3:41 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
>
>> I think you're just moving the atomic-write problem from the data pages
>> to wherever you keep these pointers.
>
>
> If the pointers are stored as simple 4-byte integers, you probably could
> assume that they're atomic, and won't be torn.

That could backfire.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jan Wieck <JanWieck(at)yahoo(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-27 20:57:45
Message-ID: 25570.1351371465@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
> On Sat, Oct 27, 2012 at 3:41 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>>> I think you're just moving the atomic-write problem from the data pages
>>> to wherever you keep these pointers.

>> If the pointers are stored as simple 4-byte integers, you probably could
>> assume that they're atomic, and won't be torn.

> That could backfire.

Yeah, the potential loss in resiliency is scary. Assume for the sake of
argument that we're storing these indirection pointers in 8K pages, 2000
or so to the page. If you get a read failure on a regular heap page,
you've lost 8K worth of data. If you get a read failure on an
indirection page, you've lost 16MB worth. (Though perhaps the pointers
could be reconstructed, given additional overhead data on each regular
heap page; but that wouldn't be very cheap.) Also, the write traffic
on a pointer page is 2000X as much as it is on an average heap page,
offering 2000X the opportunities for the disk hardware to drop a bit,
or for SSD storage to wear down, etc. (Perhaps it's not that bad for
typical access patterns, but then again perhaps it's worse, given the
already noted strict requirements for write ordering.) So it seems like
there'd be a rather nasty magnification of consequences for hardware
errors.

You've also got something like a 2000X concentration of access
contention for pointer pages compared to regular pages.

regards, tom lane


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jan Wieck <JanWieck(at)Yahoo(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-27 21:12:59
Message-ID: 508C4E5B.6090606@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 28/10/12 07:41, Heikki Linnakangas wrote:
> On 27.10.2012 16:43, Tom Lane wrote:
>> Jan Wieck<JanWieck(at)Yahoo(dot)com> writes:
>>> The reason why we need full_page_writes is that we need to guard
>>> against
>>> torn pages or partial writes. So what if smgr would manage a mapping
>>> between logical page numbers and their physical location in the
>>> relation?
>>
>>> At the moment where we today require a full page write into WAL, we
>>> would mark the buffer as "needs relocation". The smgr would then write
>>> this page into another physical location whenever it is time to
>>> write it
>>> (via the background writer, hopefully). After that page is flushed, it
>>> would update the page location pointer, or whatever we want to call it.
>>> A thus free'd physical page location can be reused, once the location
>>> pointer has been flushed to disk. This is a critical ordering of
>>> writes.
>>> First the page at the new location, second the pointer to the current
>>> location. Doing so would make write(2) appear atomic to us, which is
>>> exactly what we need for crash recovery.
>
> Hmm, aka copy-on-write.
>
>> I think you're just moving the atomic-write problem from the data pages
>> to wherever you keep these pointers.
>
> If the pointers are stored as simple 4-byte integers, you probably
> could assume that they're atomic, and won't be torn.
>
> There's a lot of practical problems in adding another level of
> indirection to every page access, though. It'll surely add some
> overhead to every access, even if the data never changes. And it's not
> at all clear to me that it would perform better than full_page_writes.
> You're writing and flushing out roughly the same amount of data AFAICS.
>
> What exactly is the problem with full_page_writes that we're trying to
> solve?
>
> - Heikki
>
>
Would a 4 byte pointer be adequate for a 64 bit machine with well over
4GB used by Postgres?

Cheers,
Gavin


From: Greg Stark <stark(at)mit(dot)edu>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jan Wieck <JanWieck(at)yahoo(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-28 13:43:07
Message-ID: CAM-w4HNsSVX3qAWqDTPK0iJmpgx3FPwSXcphDgzdH0=rj1ngRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Oct 27, 2012 at 8:41 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> If the pointers are stored as simple 4-byte integers, you probably could
> assume that they're atomic, and won't be torn.

Actually I think any fixed-size data structure would be fine. We're
worried about storage on disk here, not in memory. As long as the
pointers don't move around on the page then if part of a page is
written and part not then the only pointer possibly corrupted would be
the one spanning the old and new partial pages. That could be avoided
or detected easily.

The problem with torn pages on heap pages comes about only because
tuples are variable sized. As a result the page has line pointers that
need to be in sync with other parts of the page and vacuum needs to be
able to move tuples around on the page.

(Hm. I wonder if we could actually get away with only doing full page
writes on vacuum if we could detect torn pages from vacuum reliably
and avoid trying to read any data from them until they get restored by
the full page write.)

--
greg


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-28 22:35:19
Message-ID: 508DB327.6000805@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/27/2012 2:41 PM, Heikki Linnakangas wrote:
> And it's not at all
> clear to me that it would perform better than full_page_writes. You're
> writing and flushing out roughly the same amount of data AFAICS.

I think this assumption is wrong. full_page_writes=on means we write the
full page content to WAL on first change after a checkpoint. A change
after a checkpoint logically means that the same page is dirty now and
must also be written latest during the next checkpoint, which means 16K
written minimum for every page changed after a checkpoint.

> What exactly is the problem with full_page_writes that we're trying to
> solve?

Full page writes are meant to guard against torn pages. That is the case
when an 8K page is written by the underlying OS/filesystem/HW in smaller
chunks (for example 512 byte sectors), and in the case of a crash some
of these chunks make it, others don't. Without full_page_writes, crash
recovery can work if all 8K made it, or nothing made it (aka atomic
writes). But it will fail otherwise.

The amount of WAL generated with full_page_writes=on is quite
substantial. For pgbench for example the ratio 20:1. Meaning with
full_page_writes you write 20x the amount you do without.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-29 02:50:27
Message-ID: CAEYLb_Wpvsb3+MQP7WR3fGPrFDWjkrewQb69mbtG+k4fTAejig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 28 October 2012 22:35, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
> The amount of WAL generated with full_page_writes=on is quite substantial.
> For pgbench for example the ratio 20:1. Meaning with full_page_writes you
> write 20x the amount you do without.

Sure, but as always, pgbench pessimises everything by having writes
follow a uniform distribution, which is completely unrepresentative of
the natural world. This will literally maximise the number of pristine
full page images that need to be included. The actual frequency of
checkpoints is a crucial factor here too, and you didn't mention
anything about that.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-29 03:39:11
Message-ID: 508DFA5F.2000702@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/28/2012 10:50 PM, Peter Geoghegan wrote:
> On 28 October 2012 22:35, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
>> The amount of WAL generated with full_page_writes=on is quite substantial.
>> For pgbench for example the ratio 20:1. Meaning with full_page_writes you
>> write 20x the amount you do without.
>
> Sure, but as always, pgbench pessimises everything by having writes
> follow a uniform distribution, which is completely unrepresentative of
> the natural world. This will literally maximise the number of pristine
> full page images that need to be included. The actual frequency of
> checkpoints is a crucial factor here too, and you didn't mention
> anything about that.

Higher buffer cache hit rates certainly reduce that ratio.

Well, I guess it was just one of those random thoughts that can't work
in the end or aren't worth the work anyway.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-29 11:05:39
Message-ID: CA+TgmobE70334xF2-xdPQz-xfc_WN9FQYATsD7QJ=8SAaQNhNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Oct 27, 2012 at 1:01 AM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
> The reason why we need full_page_writes is that we need to guard against
> torn pages or partial writes. So what if smgr would manage a mapping between
> logical page numbers and their physical location in the relation?

This sounds a lot like http://en.wikipedia.org/wiki/Shadow_paging

According to my copy of Gray and Reuter, shadow paging is in fact a
workable way of providing atomicity and durability, but as of its
writing (1992) shadow paging had been essentially abandoned because it
didn't have very good performance characteristics. One of the big
problems is that you lose locality of reference - e.g. there's nothing
at all sequential about a sequential scan if, below the mapping layer,
the blocks are scattered about the disk, which is a likely outcome, if
they are frequently updated, or in the long run even if they are only
occasionally updated.

It's occurred to me before to think that this might work if we did it,
not at the block level, but at some higher level, with say 64MB
segments. That wouldn't impinge too much on sequential access, but it
would allow vacuum to clip out an entire 64MB segment anywhere in the
relation if it happened to be empty, or perhaps to rewrite a 64MB
segment of a relation without rewriting the whole thing. But it
wouldn't do anything about torn pages.

Another idea that's been previously proposed (and which is used by
MySQL, and previously proposed by VMware for inclusion in PostgreSQL)
for torn-page avoidance is that of a double-write buffer - i.e.
instead of including full page images in WAL, write them to the
double-write buffer; if we crash, start by restoring all the pages
from the double-write buffer; then, replay WAL. This avoids passing
the full-page images through the WAL stream sent from master to slave,
because the slave can have its own double-write buffer. This would
probably also allow slaves to perform restart-points at arbitrary
locations independent of where the master performs checkpoints. In
the patch as proposed, the double-write buffer was kept very small, in
the hopes of keeping it within the presumed BBWC, so that
very-frequent fsyncs would all reference the same pages and therefore
all be absorbed by the cache. This delivers terrible performance
without a BBWC, though, because the fsyncs are so frequent.
Alternatively, you could imagine a large double-write buffer which
only gets flushed once per checkpoint cycle or so - i.e. basically
what we have now, but just separating the FPW traffic from the main
WAL stream.

Indeed, you could extend that a bit futher: why throw out the
double-write buffer just because there's been a checkpoint cycle? In
a workload like pgbench, it seems likely that the same pages will be
written over and over again. You could have a checkpoint whose
purpose is only to minimize the recovery time in cases where no pages
are torn. You could then also have a less frequent "super-checkpoint"
cycle and retain WAL back to the last "super-checkpoint". In the
hopefully-unikely event that we detect a torn page (through a checksum
failure, presumably) then we hunt backwards through WAL (something our
current infrastructure doesn't really support) and find the last FPI
for that torn page and then begin selective replay from that point,
scanning through all of the WAL since the last super-checkpoint and
replaying all and only records pertaining to that page. But when no
pages are torn then you only need to recover from the last "normal"
checkpoint. I have heard reports (on this mailing list, I think) that
Oracle does something like this, but I haven't tried to verify for
myself whether that is in fact the case.

Yet another idea we've tossed around is to make only vacuum records
include FPWs, and have the more common heap insert/update/delete
operations include enough information that they can still be applied
correctly even if the page has been "torn" by the previous replay of
such a record. This would involve modifying the recovery algorithm so
that, until consistency is reached, we replay all records, regardless
of LSN, which would cost some extra I/O, but maybe not too much to
live with? It would also require that, for example, a heap-insert
record mention the line pointer index used for the insertion;
currently, we count on the previous state of the page to tell us that.
For checkpoint cycles of reasonable length, the cost of storing the
line pointer in every WAL record seems like it'll be less than the
cost needing to write an FPI for the page once per checkpoint cycle,
but this isn't certain to be the case for all workloads.

OK, I'll stop babbling now...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-30 08:16:30
Message-ID: 508F8CDE.3050605@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/29/2012 12:05 PM, Robert Haas wrote:
> OK, I'll stop babbling now...

Not perceived as babbling here. Thanks for that nice round-up of options
and ideas around the torn page problem.

Regards

Markus Wanner


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jan Wieck <JanWieck(at)yahoo(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-11-06 03:11:56
Message-ID: 20121106031156.GL19099@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 29, 2012 at 07:05:39AM -0400, Robert Haas wrote:
> Yet another idea we've tossed around is to make only vacuum records
> include FPWs, and have the more common heap insert/update/delete
> operations include enough information that they can still be applied
> correctly even if the page has been "torn" by the previous replay of
> such a record. This would involve modifying the recovery algorithm so
> that, until consistency is reached, we replay all records, regardless
> of LSN, which would cost some extra I/O, but maybe not too much to
> live with? It would also require that, for example, a heap-insert
> record mention the line pointer index used for the insertion;
> currently, we count on the previous state of the page to tell us that.
> For checkpoint cycles of reasonable length, the cost of storing the
> line pointer in every WAL record seems like it'll be less than the
> cost needing to write an FPI for the page once per checkpoint cycle,
> but this isn't certain to be the case for all workloads.

This last idea has the most promise for me. Vacuum is far less common
than row modification writes.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +