Quick Links

Re: CompactCheckpointerRequestQueue versus pad bytes

Lists:	pgsql-hackers

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-15 21:36:24
Message-ID:	23130.1342388184@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

CompactCheckpointerRequestQueue supposes that it can use an entry of the
checkpointer request queue directly as a hash table key. This will work
reliably only if there are no pad bytes in the CheckpointerRequest
struct, which means in turn that neither RelFileNodeBackend nor
RelFileNode can contain any pad bytes.

It might have accidentally failed to fail if tested on a compiler that
gives a full 32 bits to enum ForkNumber, but there absolutely would be
padding there if ForkNumber is allocated as short or char.

As best I can tell, a failure from uninitialized padding would not cause
visible misbehavior but only make it not notice that two requests are
identical, so that the queue compaction would not accomplish as much as
it should. Nonetheless, this seems pretty broken.

We could fairly cheaply dodge the problem with padding after ForkNumber
if we were to zero out the whole request array at shmem initialization,
so that any such pad bytes are guaranteed zero. However, padding in
RelFileNodeBackend would be more annoying to deal with, and at least
in the current instantiation of those structs it's probably impossible
anyway. Should we document those structs as required to not contain
any padding, or do what's needful in checkpointer.c to not depend on
there not being padding?

regards, tom lane

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 06:07:05
Message-ID:	005a01cd6319$3a043960$ae0cac20$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Tom Lane
Sent: Monday, July 16, 2012 3:06 AM

> It might have accidentally failed to fail if tested on a compiler that
> gives a full 32 bits to enum ForkNumber, but there absolutely would be
> padding there if ForkNumber is allocated as short or char.

> As best I can tell, a failure from uninitialized padding would not cause
> visible misbehavior but only make it not notice that two requests are
> identical, so that the queue compaction would not accomplish as much as
> it should. Nonetheless, this seems pretty broken.

> We could fairly cheaply dodge the problem with padding after ForkNumber
> if we were to zero out the whole request array at shmem initialization,
> so that any such pad bytes are guaranteed zero. However, padding in
> RelFileNodeBackend would be more annoying to deal with, and at least
> in the current instantiation of those structs it's probably impossible
> anyway. Should we document those structs as required to not contain
> any padding, or do what's needful in checkpointer.c to not depend on
> there not being padding?

If we just document those structs, then how to handle the case where
ForkNumber
is allocated as short or char?

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 15:24:37
Message-ID:	CA+TgmoaJct82yt5HQ2spXb5VKk4QA=0mAyXzttBwAWJUw3i3cA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jul 15, 2012 at 5:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> We could fairly cheaply dodge the problem with padding after ForkNumber
> if we were to zero out the whole request array at shmem initialization,
> so that any such pad bytes are guaranteed zero. However, padding in
> RelFileNodeBackend would be more annoying to deal with, and at least
> in the current instantiation of those structs it's probably impossible
> anyway. Should we document those structs as required to not contain
> any padding, or do what's needful in checkpointer.c to not depend on
> there not being padding?

I would expect that every method we could devise for allocating a
shared memory segment would produce all-zero bytes. There was a time
long ago when the OS would simply hand over previously-freed pages
with their existing contents, but I believe that was recognized as a
security problems more than 20 years ago - maybe 30 - and I can't
believe there is any OS we care about supporting that fails to zero
pages before handing them out. Of course you can't count on malloc()
to return zero'd memory, but that's because the process might get a
page (all zeros) from the OS, allocate it, free it, and then
reallocate it for an unrelated purpose. But we have no method that I
know of for freeing shared memory, and even if we did, the memory used
by the fsync queue is allocated during startup and therefore
presumably prior to any hypothetical ShmemFree operations that might
occur subsequently.

So I'm having a hard time understanding under what imaginable set of
circumstances this might break.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 15:43:04
Message-ID:	9023.1342453384@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sun, Jul 15, 2012 at 5:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> We could fairly cheaply dodge the problem with padding after ForkNumber
>> if we were to zero out the whole request array at shmem initialization,
>> so that any such pad bytes are guaranteed zero. However, padding in
>> RelFileNodeBackend would be more annoying to deal with, and at least
>> in the current instantiation of those structs it's probably impossible
>> anyway. Should we document those structs as required to not contain
>> any padding, or do what's needful in checkpointer.c to not depend on
>> there not being padding?

> I would expect that every method we could devise for allocating a
> shared memory segment would produce all-zero bytes.

Well, it'd likely produce all-something bytes, but I don't believe
shmget is documented to produce zeroes. In any case we are not in
the habit of relying on that and I don't see why we'd do so here rather
than explicitly zeroing the relatively small amount of memory involved.

> So I'm having a hard time understanding under what imaginable set of
> circumstances this might break.

Padding inside RelFileNodeBackend would break it, because
ForwardFsyncRequest copies the rnode as a struct. So that's why I'm
asking whether we want to establish an explicit requirement that that
struct not contain any padding.

It would not be that hard to avoid the problem: we could make
CompactCheckpointerRequestQueue pre-zero a tag variable and then copy
the live fields into it. Unless there is some other place in the system
that depends on RelFileNodeBackend being non-padded, and will break in a
more visible fashion with padding, I'm now inclined to do it like that.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 15:44:29
Message-ID:	500436DD.50904@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16.07.2012 18:24, Robert Haas wrote:
> On Sun, Jul 15, 2012 at 5:36 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> We could fairly cheaply dodge the problem with padding after ForkNumber
>> if we were to zero out the whole request array at shmem initialization,
>> so that any such pad bytes are guaranteed zero. However, padding in
>> RelFileNodeBackend would be more annoying to deal with, and at least
>> in the current instantiation of those structs it's probably impossible
>> anyway. Should we document those structs as required to not contain
>> any padding, or do what's needful in checkpointer.c to not depend on
>> there not being padding?
>
> I would expect that every method we could devise for allocating a
> shared memory segment would produce all-zero bytes. There was a time
> long ago when the OS would simply hand over previously-freed pages
> with their existing contents, but I believe that was recognized as a
> security problems more than 20 years ago - maybe 30 - and I can't
> believe there is any OS we care about supporting that fails to zero
> pages before handing them out.

I wouldn't rely on that, though. I wouldn't be surprised if there was
some debugging flag or similar that initialized all pages to random
values or 0xdeadbeef or something, before handing them out to the
application. We could easily zero all shared memory on allocation
ourselves, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 15:57:39
Message-ID:	9346.1342454259@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> So I'm having a hard time understanding under what imaginable set of
>> circumstances this might break.

> Padding inside RelFileNodeBackend would break it, because
> ForwardFsyncRequest copies the rnode as a struct. So that's why I'm
> asking whether we want to establish an explicit requirement that that
> struct not contain any padding.

BTW, I'd be a lot happier about assuming that bare RelFileNode contains
no padding, because that's at least got all the fields the same type.
So that brings us back to the question of why this code is supporting
fsync requests for local relations in the first place. Couldn't we have
it ignore those, and then only ship RelFileNode to the checkpointer?

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 16:04:49
Message-ID:	CA+TgmoZCb-5hXNoXk_SUYZN+jFTJ=ySjyfUzq_E9mBPcDFUVng@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 16, 2012 at 11:57 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> So I'm having a hard time understanding under what imaginable set of
>>> circumstances this might break.
>
>> Padding inside RelFileNodeBackend would break it, because
>> ForwardFsyncRequest copies the rnode as a struct. So that's why I'm
>> asking whether we want to establish an explicit requirement that that
>> struct not contain any padding.
>
> BTW, I'd be a lot happier about assuming that bare RelFileNode contains
> no padding, because that's at least got all the fields the same type.
> So that brings us back to the question of why this code is supporting
> fsync requests for local relations in the first place. Couldn't we have
> it ignore those, and then only ship RelFileNode to the checkpointer?

That's an awfully good point. I think that was just sloppy coding on
my part (cf commit debcec7dc31a992703911a9953e299c8d730c778). +1 for
changing it as you suggest.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 16:16:11
Message-ID:	9744.1342455371@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Jul 16, 2012 at 11:57 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> BTW, I'd be a lot happier about assuming that bare RelFileNode contains
>> no padding, because that's at least got all the fields the same type.
>> So that brings us back to the question of why this code is supporting
>> fsync requests for local relations in the first place. Couldn't we have
>> it ignore those, and then only ship RelFileNode to the checkpointer?

> That's an awfully good point. I think that was just sloppy coding on
> my part (cf commit debcec7dc31a992703911a9953e299c8d730c778). +1 for
> changing it as you suggest.

OK, so I think the current proposal is:

1. Document that RelFileNode must not contain padding.

2. Change the fsync forwarding code to ignore backend-local relations,
and include only RelFileNode not RelFileNodeBackend in requests.

3. Zero the checkpointer requests[] array at shmem init time, so as
to ensure consistency of any pad bytes elsewhere in the request structs.

I will see about making this happen. Since the fsync queue compaction
code got back-patched awhile ago, we need to back-patch the relevant
parts of this too.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 16:16:23
Message-ID:	CA+TgmobBa3cM4eLzVJM+dz88F4x1x5b5ydA9hFOas3uwu6YJJA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 16, 2012 at 11:44 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I wouldn't rely on that, though. I wouldn't be surprised if there was some
> debugging flag or similar that initialized all pages to random values or
> 0xdeadbeef or something, before handing them out to the application. We
> could easily zero all shared memory on allocation ourselves, though.

Well, the documentation for mmap (which we're currently using) on Linux says:

MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initial‐
ized to zero. The fd and offset arguments are ignored; however,
some implementations require fd to be -1 if MAP_ANONYMOUS (or
MAP_ANON) is specified, and portable applications should ensure
this. The use of MAP_ANONYMOUS in conjunction with MAP_SHARED
is only supported on Linux since kernel 2.4.

shmget says:

When a new shared memory segment is created, its contents are initial‐
ized to zero values, and its associated data structure, shmid_ds (see
shmctl(2)), is initialized as follows:

And shm_open says:

A new shared memory object initially has zero length — the
size of the object can be set using ftruncate(2). The newly
allocated bytes of a shared memory object are automatically
initialized to 0.

The documentation on MacOS X isn't quite as explicit, but I'd still be
astonished if we found any other behavior. TBH, I'd be kind of
surprised if this is the only place in our code base that relies on
the initial contents of shared memory being all-zeros. If we really
care about that we probably ought to make --enable-cassert write
0xdeadbeef all over the whole shared-memory segment on startup, or
something like that, because otherwise it's only a matter of time
before someone will break it. Personally I'd like to see some
evidence that the problem is more than strictly hypothetical before we
spend time on it, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 16:27:00
Message-ID:	11802.1342456020@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> The documentation on MacOS X isn't quite as explicit, but I'd still be
> astonished if we found any other behavior. TBH, I'd be kind of
> surprised if this is the only place in our code base that relies on
> the initial contents of shared memory being all-zeros.

Maybe so, but if we find any others, I'll be wanting to change them too.
It's bad practice and worse documentation for modules to be silently
assuming that anything has a value they didn't explicitly give it.

A related practice that probably costs us a lot more, in both code space
and time, is that most (all?) places that create Node objects explicitly
initialize every field of the Node struct, even though makeNode() has
a palloc0 underneath it and so setting fields to zero is redundant.
I believe that this is a good practice anyway, for documentation and
code greppability reasons.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 17:39:46
Message-ID:	CA+TgmobYPCLZhg12kYMnUHcmuUjU1WRA7hidxAUqUsdMi3GBgQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 16, 2012 at 12:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> The documentation on MacOS X isn't quite as explicit, but I'd still be
>> astonished if we found any other behavior. TBH, I'd be kind of
>> surprised if this is the only place in our code base that relies on
>> the initial contents of shared memory being all-zeros.
>
> Maybe so, but if we find any others, I'll be wanting to change them too.
> It's bad practice and worse documentation for modules to be silently
> assuming that anything has a value they didn't explicitly give it.
>
> A related practice that probably costs us a lot more, in both code space
> and time, is that most (all?) places that create Node objects explicitly
> initialize every field of the Node struct, even though makeNode() has
> a palloc0 underneath it and so setting fields to zero is redundant.
> I believe that this is a good practice anyway, for documentation and
> code greppability reasons.

I don't really agree; I find it nice and clean to assume that
functions that promise to return zero'd memory really do. In my book,
the main reason for keeping things as they are is that it's probably
not costing enough to matter very much. Which is true here, too, so
I'm not going to make a huge stink, but I still think it's a waste of
effort.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 18:47:24
Message-ID:	15273.1342464444@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Jul 16, 2012 at 11:57 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> So that brings us back to the question of why this code is supporting
>>> fsync requests for local relations in the first place. Couldn't we have
>>> it ignore those, and then only ship RelFileNode to the checkpointer?

>> That's an awfully good point. I think that was just sloppy coding on
>> my part (cf commit debcec7dc31a992703911a9953e299c8d730c778). +1 for
>> changing it as you suggest.

> 2. Change the fsync forwarding code to ignore backend-local relations,
> and include only RelFileNode not RelFileNodeBackend in requests.

So I started to do this, and immediately hit a roadblock: although it's
clear that we can discard any attempt to fsync a backend-local relation,
it's not so clear that we don't need to queue UNLINK_RELATION_REQUEST
operations for local relations.

I think that this is in fact okay. The reason for delaying unlink until
after the next checkpoint is that if we did not, and the relfilenode got
re-used for an unrelated relation, and then we crashed and had to replay
WAL, we would replay any WAL referencing the old relation into the
unrelated relation's storage, with potential bad consequences if you're
unlucky. However, no WAL should ever be generated for a backend-local
relation, so there is nothing to guard against and hence no harm in
immediately unlinking backend-local rels when they are deleted. So we
can adjust mdunlink to include SmgrIsTemp() among the reasons to unlink
immediately rather than doing the truncate-and-register_unlink dance.

If anybody sees a hole in this reasoning, speak now ...

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 19:04:47
Message-ID:	CA+Tgmob4oxXE7oY7K-wgGtFzeVzaXW_aZy9QfJ_EGWfu=k=dug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 16, 2012 at 2:47 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> On Mon, Jul 16, 2012 at 11:57 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> So that brings us back to the question of why this code is supporting
>>>> fsync requests for local relations in the first place. Couldn't we have
>>>> it ignore those, and then only ship RelFileNode to the checkpointer?
>
>>> That's an awfully good point. I think that was just sloppy coding on
>>> my part (cf commit debcec7dc31a992703911a9953e299c8d730c778). +1 for
>>> changing it as you suggest.
>
>> 2. Change the fsync forwarding code to ignore backend-local relations,
>> and include only RelFileNode not RelFileNodeBackend in requests.
>
> So I started to do this, and immediately hit a roadblock: although it's
> clear that we can discard any attempt to fsync a backend-local relation,
> it's not so clear that we don't need to queue UNLINK_RELATION_REQUEST
> operations for local relations.
>
> I think that this is in fact okay. The reason for delaying unlink until
> after the next checkpoint is that if we did not, and the relfilenode got
> re-used for an unrelated relation, and then we crashed and had to replay
> WAL, we would replay any WAL referencing the old relation into the
> unrelated relation's storage, with potential bad consequences if you're
> unlucky. However, no WAL should ever be generated for a backend-local
> relation, so there is nothing to guard against and hence no harm in
> immediately unlinking backend-local rels when they are deleted. So we
> can adjust mdunlink to include SmgrIsTemp() among the reasons to unlink
> immediately rather than doing the truncate-and-register_unlink dance.
>
> If anybody sees a hole in this reasoning, speak now ...

Hmm, yeah, I have a feeling this might be why I didn't do this when I
created RelFileNodeBackend. But I think your reasoning is correct.
Sticking the above text in a comment might not be out of order,
however.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-16 23:17:35
Message-ID:	8687.1342480655@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Jul 16, 2012 at 2:47 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> So I started to do this, and immediately hit a roadblock: although it's
>> clear that we can discard any attempt to fsync a backend-local relation,
>> it's not so clear that we don't need to queue UNLINK_RELATION_REQUEST
>> operations for local relations.
>>
>> I think that this is in fact okay. The reason for delaying unlink until
>> after the next checkpoint is that if we did not, and the relfilenode got
>> re-used for an unrelated relation, and then we crashed and had to replay
>> WAL, we would replay any WAL referencing the old relation into the
>> unrelated relation's storage, with potential bad consequences if you're
>> unlucky. However, no WAL should ever be generated for a backend-local
>> relation, so there is nothing to guard against and hence no harm in
>> immediately unlinking backend-local rels when they are deleted. So we
>> can adjust mdunlink to include SmgrIsTemp() among the reasons to unlink
>> immediately rather than doing the truncate-and-register_unlink dance.
>>
>> If anybody sees a hole in this reasoning, speak now ...

> Hmm, yeah, I have a feeling this might be why I didn't do this when I
> created RelFileNodeBackend. But I think your reasoning is correct.
> Sticking the above text in a comment might not be out of order,
> however.

Most of this info was already in the comment for mdunlink, so I just
added a bit there.

The attached patch covers everything discussed in this thread, except
for the buggy handling of stats, which I think should be fixed in a
separate patch since it's only relevant to 9.2+.

I had thought that we might get a performance boost here by saving fsync
queue traffic, but I see that md.c was already not calling
register_dirty_segment for temp rels, so there's no joy there. We might
win a bit by not queuing deletion of temp rels, though. There's also
some distributed savings by eliminating one field from the request queue
and hash table, but probably not enough to notice. I think the main
reason to do this is just to tighten up the question of whether pad
bytes matter.

Haven't started on back-patching yet. Some but not all of this will
need to go back as far as we back-patched the queue compaction logic.

regards, tom lane

Attachment	Content-Type	Size
fsync-request-queue-cleanup.patch	text/x-patch	27.3 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-17 00:35:46
Message-ID:	CA+TgmobEGtOGy2efGp26Jc6Uc3k5DoKrO5g0kzqWF=L-vFkb_Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 16, 2012 at 7:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The attached patch covers everything discussed in this thread, except
> for the buggy handling of stats, which I think should be fixed in a
> separate patch since it's only relevant to 9.2+.

With respect to this chunk:

+ * We do not need to go through this dance for temp relations, though, because
+ * we never make WAL entries for temp rels, and so a temp rel poses no threat
+ * to the health of a regular rel that has taken over its relfilenode number.

...I would say that a clearer way to put this is that temporary
relations use a different file naming convention than permanent
relations and therefore there can never be any confusion between the
two.

Other than that, looks fine to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-17 01:58:43
Message-ID:	11429.1342490323@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> With respect to this chunk:

> + * We do not need to go through this dance for temp relations, though, because
> + * we never make WAL entries for temp rels, and so a temp rel poses no threat
> + * to the health of a regular rel that has taken over its relfilenode number.

> ...I would say that a clearer way to put this is that temporary
> relations use a different file naming convention than permanent
> relations and therefore there can never be any confusion between the
> two.

Yeah, that's an entirely independent reason why there's probably no
issue in recent releases. The rationale as stated is back-patchable
to earlier releases, though.

BTW, I wonder whether the code that checks for relfilenode conflict
when selecting a pg_class or relfilenode OID tries both file naming
conventions? If not, should we make it do so?

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-17 03:37:59
Message-ID:	CA+TgmoaoVPO9LhJz5k4qFmgZ6TJfSxQroc_Vm_rMdGBHwBheiw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 16, 2012 at 9:58 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> BTW, I wonder whether the code that checks for relfilenode conflict
> when selecting a pg_class or relfilenode OID tries both file naming
> conventions? If not, should we make it do so?

I don't believe it does, nor do I see what we would gain by making it to do so.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-17 17:26:43
Message-ID:	13311.1342546003@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Jul 16, 2012 at 9:58 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> BTW, I wonder whether the code that checks for relfilenode conflict
>> when selecting a pg_class or relfilenode OID tries both file naming
>> conventions? If not, should we make it do so?

> I don't believe it does, nor do I see what we would gain by making it to do so.

What we would gain is ensuring that we aren't using the same relfilenode
for both a regular table and a temp table. Do you really want to assume
that such a conflict is 100% safe? It sounds pretty scary to me, and
even if we were sure the backend would never get confused, what about
client-side code that's looking at relfilenode?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-17 18:01:10
Message-ID:	14060.1342548070@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> I had thought that we might get a performance boost here by saving fsync
> queue traffic, but I see that md.c was already not calling
> register_dirty_segment for temp rels, so there's no joy there.

Actually, wait a second. We were smart enough to not send fsync
requests in the first place for temp rels. But we were not smart enough
to not call ForgetRelationFsyncRequests when deleting a temp rel,
which made for an entirely useless scan through the pending-fsyncs
table. So there could be win there, on top of not forwarding the actual
unlink operation.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: CompactCheckpointerRequestQueue versus pad bytes
Date:	2012-07-18 12:40:57
Message-ID:	CA+TgmoY2S5Ng4SCcg0xWJ2Oy8KDp-0iHrKo7v6X_qbYSg7kv8Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 17, 2012 at 1:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Jul 16, 2012 at 9:58 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> BTW, I wonder whether the code that checks for relfilenode conflict
>>> when selecting a pg_class or relfilenode OID tries both file naming
>>> conventions? If not, should we make it do so?
>
>> I don't believe it does, nor do I see what we would gain by making it to do so.
>
> What we would gain is ensuring that we aren't using the same relfilenode
> for both a regular table and a temp table. Do you really want to assume
> that such a conflict is 100% safe? It sounds pretty scary to me, and
> even if we were sure the backend would never get confused, what about
> client-side code that's looking at relfilenode?

Well, when I was working on that patch, I spent a lot of time trying
to ensure that it was in fact safe. So I hope it is. Also, note that
that could perfectly well happen anyway in any prior release if the
relations happened to live in different tablespaces. Anyone assuming
that <dboid,relfilenode> is unique is kidding themselves, because it
is not guaranteed to be and has never been guaranteed to be. Yes,
there could be client code out there that assumes
<dboid,tsoid,relfilenode> is unique and such code will need adjustment
for 9.1+. But I bet there isn't very much. The thing that broke a
lot of client code in that commit was the replacement of relistemp
with relpersistence; we already decided we didn't care about that (and
it's too late to change it now anyway) so I can't really get excited
about this. I think that code assuming that anything other than a
RelFileNodeBackend is sufficient to uniquely identify a relation is
just buggy, and if there is any, we should fix it. All remaining uses
of RelFileNode rather than RelFileNodeBackend should be cases where we
know that the backend ID has got to be InvalidBackendId.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company