Re: Is anybody actually using XLR_BKP_REMOVABLE?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Is anybody actually using XLR_BKP_REMOVABLE?
Date: 2011-12-13 00:27:39
Message-ID: 14874.1323736059@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> I'll volunteer. Assume you can reuse the flag and I will patch afterwards.

> Thanks for the offer, but after thinking about it a bit more I realized
> that this change is quite trivial, so I just went ahead and did it along
> with the change in XLR_MAX_BKP_BLOCKS. This seems better since both
> related changes are in one commit, and we can't forget to do it.

BTW, just for the archives' sake: some digging in the git history showed
that my memory was faulty about the pre-2007 limit of XLR_MAX_BKP_BLOCKS
having been 4. It was originally 2, and then in 2003 we increased it to
3 (cf commit 799bc58dc7ed9899facfc8302040749cb0a9af2f). So at the time
the last xl_info bit got taken over for XLR_BKP_REMOVABLE, it had in
fact been unused, and that probably explains why we didn't think harder
about whether there would be a less expensive way to do it.

I still think allowing 4 pages per WAL entry is a good thing, though,
and so am not inclined to withdraw the proposal. But perhaps it would
be worth explaining why this is necessary for SP-GiST. The case where
it comes up is trying to split a list of leaf-page tuples when we need
to add another entry to the list but there's no room on the page.
SP-GiST doesn't allow such lists to cross pages (which I think is a
reasonable restriction, both to avoid excess seeks and because the list
links can thereby be 2 bytes not 6). So what it has to do here is
insert an upper-page tuple ("inner tuple" in the patch's jargon) to
describe the set of leaf page tuples that have now been split into two
or more lists. This requires touching:
1. The leaf page currently holding the list to be modified.
2. Another leaf page that has enough free space for the overrun.
3. An inner page (inner pages and leaf pages are disjoint in SP-GiST)
where there's enough room to put the new inner tuple.
4. The inner page holding the inner tuple that is the parent of the
leaf-page list; we have to update its downlink to point to the
new inner tuple instead of the leaf list.

The code will try to put the new inner tuple on the same page as the
original parent tuple, but if there's no room there, there's no way to
get around the fact that there are four different pages involved here.

If somebody held a gun to my head and said "do it with only three",
what I'd try to do is make the update of the parent tuple into a
separately logged WAL action. However this is not trivial, or at least
it's not trivial to recover during replay if the database crashes after
logging the first action --- there is not enough information on-disk to
figure out what needs to be done. Also, I believe we've been trying
to get rid of that sort of recovery-time cleanup requirement, because
of hot standby. So on the whole I think extending the XLogInsert
machinery to allow 4 backed-up pages is the best solution.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter van Hardenberg 2011-12-13 00:51:54 Re: JSON for PG 9.2
Previous Message Peter van Hardenberg 2011-12-12 23:55:08 Re: WIP: URI connection string support for libpq