Re: Block write statistics WIP

Lists: pgsql-hackers
From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Block write statistics WIP
Date: 2013-05-19 08:15:09
Message-ID: 51988A0D.5080600@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I have some time now for working on the mystery of why there are no
block write statistics in the database. I hacked out the statistics
collection and reporting side already, rough patch attached if you want
to see the boring parts.

I guessed that there had to be a gotcha behind why this wasn't done
before now, and I've found one so far. All of the read statistics are
collected with a macro that expects to know a Relation number. Callers
to ReadBuffer pass one. On the write side, the two places that
increment the existing write counters (pg_stat_statements, vacuum) are
MarkBufferDirty and MarkBufferDirtyHint. Neither of those is given a
Relation though. Inspecting the Buffer passed can only find the buffer
tag's RelFileNode.

I've thought of two paths to get a block write count out of that so far:

-Provide a function to find the Relation from the RelFileNode. There is
a warning about the perils of assuming you can map that way from a
buftag value in buf_internals.h though:

"Note: the BufferTag data must be sufficient to determine where to write
the block, without reference to pg_class or pg_tablespace entries. It's
possible that the backend flushing the buffer doesn't even believe the
relation is visible yet (its xact may have started before the xact that
created the rel). The storage manager must be able to cope anyway."

-Modify every caller of MarkDirty* to include a relation when that
information is available. There are about 200 callers of those
functions around, so that won't be fun. I noted that many of them are
in the XLog replay functions, which won't have the relation--but those
aren't important to count anyway.

Neither of these options feels very good to me, so selecting between the
two feels like picking the lesser of two evils. I'm fine with chugging
through all of the callers to modify MarkDirty*, but I didn't want to do
that only to have the whole approach rejected as wrong. That's why I
stopped here to get some feedback.

The way that MarkDirty requires this specific underlying storage manager
behavior to work properly strikes me as as a bit of a layering violation
too. I'd like the read and write paths to have a similar API, but here
they don't even operate on the same type of inputs. Addressing that is
probably harder than just throwing a hack on the existing code though.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment Content-Type Size
block-write-stats-v1.patch text/plain 13.8 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block write statistics WIP
Date: 2013-05-20 11:51:08
Message-ID: 519A0E2C.9090809@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 19.05.2013 11:15, Greg Smith wrote:
> I've thought of two paths to get a block write count out of that so far:
>
> -Provide a function to find the Relation from the RelFileNode. There is
> a warning about the perils of assuming you can map that way from a
> buftag value in buf_internals.h though:
>
> "Note: the BufferTag data must be sufficient to determine where to write
> the block, without reference to pg_class or pg_tablespace entries. It's
> possible that the backend flushing the buffer doesn't even believe the
> relation is visible yet (its xact may have started before the xact that
> created the rel). The storage manager must be able to cope anyway."
>
> -Modify every caller of MarkDirty* to include a relation when that
> information is available. There are about 200 callers of those functions
> around, so that won't be fun. I noted that many of them are in the XLog
> replay functions, which won't have the relation--but those aren't
> important to count anyway.
>
> Neither of these options feels very good to me, so selecting between the
> two feels like picking the lesser of two evils. I'm fine with chugging
> through all of the callers to modify MarkDirty*, but I didn't want to do
> that only to have the whole approach rejected as wrong. That's why I
> stopped here to get some feedback.

Adding a Relation argument to all the Mark* calls seems fine to me. I
don't find it ugly at all. The biggest objection would be that if there
are extensions out there that call those functions, they would need to
be changed, but I think that's fine.

> The way that MarkDirty requires this specific underlying storage manager
> behavior to work properly strikes me as as a bit of a layering violation
> too. I'd like the read and write paths to have a similar API, but here
> they don't even operate on the same type of inputs. Addressing that is
> probably harder than just throwing a hack on the existing code though.

To be honest, I don't understand what you mean by that. ?

- Heikki


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block write statistics WIP
Date: 2013-05-23 23:10:44
Message-ID: 519EA1F4.4080809@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/20/13 7:51 AM, Heikki Linnakangas wrote:
>> The way that MarkDirty requires this specific underlying storage manager
>> behavior to work properly strikes me as as a bit of a layering violation
>> too. I'd like the read and write paths to have a similar API, but here
>> they don't even operate on the same type of inputs. Addressing that is
>> probably harder than just throwing a hack on the existing code though.
>
> To be honest, I don't understand what you mean by that. ?

Let's say you were designing a storage layer API from scratch for what
Postgres does. That might take a relation as its input, like ReadBuffer
does. Hiding the details of how that turns into a physical file
operation would be a useful goal of such a layer. I'd then consider it
a problem if that exposed things like the actual mapping of relations
into files to callers.

What we actually have right now is this MarkDirty function that operates
on BufferTag data, which points directly to the underlying file. I see
that as cutting the storage API in half and calling a function in the
middle of the implementation. It strikes me as kind of weird that the
read side and write side are not more symmetrical.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block write statistics WIP
Date: 2013-05-24 01:33:22
Message-ID: 519EC362.8080008@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 23.05.2013 19:10, Greg Smith wrote:
> On 5/20/13 7:51 AM, Heikki Linnakangas wrote:
>>> The way that MarkDirty requires this specific underlying storage
>>> manager behavior to work properly strikes me as as a bit of a
>>> layering violation too. I'd like the read and write paths to have
>>> a similar API, but here they don't even operate on the same type
>>> of inputs. Addressing that is probably harder than just throwing
>>> a hack on the existing code though.
>>
>> To be honest, I don't understand what you mean by that. ?
>
> Let's say you were designing a storage layer API from scratch for
> what Postgres does. That might take a relation as its input, like
> ReadBuffer does. Hiding the details of how that turns into a physical
> file operation would be a useful goal of such a layer. I'd then
> consider it a problem if that exposed things like the actual mapping
> of relations into files to callers.

Ok, got it.

> What we actually have right now is this MarkDirty function that
> operates on BufferTag data, which points directly to the underlying
> file. I see that as cutting the storage API in half and calling a
> function in the middle of the implementation.

Well, no, the BufferTag struct is internal to the buffer manager
implementation. It's not part of the API; it's an implementation detail
of the buffer manager.

> It strikes me as kind of weird that the read side and write side are
> not more symmetrical.

It might seem weird if you expect the API to be similar to POSIX read()
and write(), where you can read() an arbitrary block at any location,
and write() an arbitrary block at any location. A better comparison
would be e.g open() and close(). open() returns a file descriptor, which
you pass to other functions to operate on the file. When you're done,
you call close(fd). The file descriptor is completely opaque to the user
program, you do all the operations through the functions that take the
fd as argument. Similarly, ReadBuffer() returns a Buffer, which is
completely opaque to the caller, and you do all the operations through
various functions and macros that operate on the Buffer. When you're
done, you release the buffer with ReleaseBuffer().

(sorry for the digression from the original topic, I don't have any
problem with what adding an optional Relation argument to MarkBuffer if
you need that :-) )

- Heikki


From: Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Greg Smith <greg(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block write statistics WIP
Date: 2013-07-01 07:10:39
Message-ID: 51D12B6F.5060306@uptime.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I'm looking into this patch as a reviewer.

(2013/05/24 10:33), Heikki Linnakangas wrote:
> On 23.05.2013 19:10, Greg Smith wrote:
>> On 5/20/13 7:51 AM, Heikki Linnakangas wrote:
>> It strikes me as kind of weird that the read side and write side are
>> not more symmetrical.
>
> It might seem weird if you expect the API to be similar to POSIX read()
> and write(), where you can read() an arbitrary block at any location,
> and write() an arbitrary block at any location. A better comparison
> would be e.g open() and close(). open() returns a file descriptor, which
> you pass to other functions to operate on the file. When you're done,
> you call close(fd). The file descriptor is completely opaque to the user
> program, you do all the operations through the functions that take the
> fd as argument. Similarly, ReadBuffer() returns a Buffer, which is
> completely opaque to the caller, and you do all the operations through
> various functions and macros that operate on the Buffer. When you're
> done, you release the buffer with ReleaseBuffer().
>
> (sorry for the digression from the original topic, I don't have any
> problem with what adding an optional Relation argument to MarkBuffer if
> you need that :-) )

Or should we add some pointer, which is accociated with the Relation,
into BufferDesc? Maybe OID?

I'm thinking of FlushBuffer() too. How can we count physical write
for each relation in FlushBuffer()? Or any other appropriate place?

BTW, Greg's first patch could not be applied to the latest HEAD.
I guess it need to have some fix, I couldn't clafiry it though.

Regards,
--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block write statistics WIP
Date: 2013-07-05 07:51:51
Message-ID: 51D67B17.2080203@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/1/13 3:10 AM, Satoshi Nagayasu wrote:
> Or should we add some pointer, which is accociated with the Relation,
> into BufferDesc? Maybe OID?

That is the other option here, I looked at it but didn't like it. The
problem is that at the point when a new page is created, it's not always
clear yet what relation it's going to hold. That means that if the
buffer page itself is where you look up the relation OID, every code
path that manipulates relation IDs will need to worry about maintaining
this information. It's possible to do it that way, but I don't know
that it's any less work than making all the write paths keep track of
it. It would need some extra locking around updating the relation tag
in the buffer pages too, and I'd prefer not to

The other thing that I like about the writing side is that I can
guarantee the code is correct pretty easily. Yes, it's going to take
days worth of modifying writing code. But the compile will force you to
fix all of them, and once they're all updated I'd be surprised if
something unexpected happened.

If instead the data moves onto the buffer page header instead, we could
be chasing bugs similar to the relcache ones forever. Also, as a tie
breaker, buffer pages will get bigger and require more locking this way.
Making writers tell us the relation doesn't need any of that.

> I'm thinking of FlushBuffer() too. How can we count physical write
> for each relation in FlushBuffer()? Or any other appropriate place?

SMgrRelation is available there, so tagging the relation should be easy
in this case.

> BTW, Greg's first patch could not be applied to the latest HEAD.
> I guess it need to have some fix, I couldn't clafiry it though.

Since this is a WIP patch, what I was looking for was general design
approach feedback, with my bit as a simple example. I didn't expect
anyone to spend much time doing a formal review of that quick hack.
You're welcome to try and play with that code to add things, I'm not
very attached to it yet. I've basically gotten what I wanted to here
from this submission; I'm marking it returned with feedback. If anyone
else has comments, I'm still open to discussion here too.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com