Quick Links

Separate BLCKSZ for data and logging

Lists:	pgsql-hackers

From:	Mark Wong <markw(at)osdl(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Separate BLCKSZ for data and logging
Date:	2006-03-16 16:21:32
Message-ID:	200603161619.k2GGJtDZ023827@smtp.osdl.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi all,

I've been wondering if there might be anything to gain by having a
separate block size for logging and data. I thought I might try
defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get
myself into.

I wasn't able to find any previous discussion but pehaps 'separate
BLKSZ' were poor parameters to use. Any thoughts?

Thanks,
Mark

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Mark Wong" <markw(at)osdl(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Separate BLCKSZ for data and logging
Date:	2006-03-16 16:30:11
Message-ID:	36e682920603160830i204e82b2o3d61bc26bececb6a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/16/06, Mark Wong <markw(at)osdl(dot)org> wrote:
>
> I've been wondering if there might be anything to gain by having a
> separate block size for logging and data. I thought I might try
> defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get
> myself into.

If you're going to try it out, here's a starting point based on the block
sizes used by Oracle:

512 bytes on Linux, Solaris, AIX, Windows
1K on HP-UX and Tru64
2K on SCO
4K on MVS

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Mark Wong <markw(at)osdl(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Separate BLCKSZ for data and logging
Date:	2006-03-16 19:37:07
Message-ID:	1142537827.3859.505.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2006-03-16 at 08:21 -0800, Mark Wong wrote:

> I've been wondering if there might be anything to gain by having a
> separate block size for logging and data. I thought I might try
> defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get
> myself into.
>
> I wasn't able to find any previous discussion but pehaps 'separate
> BLKSZ' were poor parameters to use. Any thoughts?

I see your thinking.... presumably a performance tuning thought?

Overall, the two things are fairly separate, apart from the fact that we
do currently log whole data blocks straight to the log. Usually just
one, but possibly 2 or three. So I have a feeling that things would
become less efficient if you did this, not more.

But its a good line of thought and I'll have a look at that.

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Mark Wong <markw(at)osdl(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Separate BLCKSZ for data and logging
Date:	2006-03-16 20:21:52
Message-ID:	19810.1142540512@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Overall, the two things are fairly separate, apart from the fact that we
> do currently log whole data blocks straight to the log. Usually just
> one, but possibly 2 or three. So I have a feeling that things would
> become less efficient if you did this, not more.

> But its a good line of thought and I'll have a look at that.

I too think reducing the size of WAL blocks might be a win, because
we currently always write whole blocks, and so a series of small
transactions will be rewriting the same 8K block multiple times.
If the filesystem's native block size is less than 8K, matching that
size should theoretically make things faster.

Whether it makes enough difference to be worth the trouble is another
question ...

regards, tom lane

From:	Mark Wong <markw(at)osdl(dot)org>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Separate BLCKSZ for data and logging
Date:	2006-03-16 20:22:58
Message-ID:	200603162021.k2GKLKDZ005213@smtp.osdl.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 16 Mar 2006 19:37:07 +0000
Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On Thu, 2006-03-16 at 08:21 -0800, Mark Wong wrote:
>
> > I've been wondering if there might be anything to gain by having a
> > separate block size for logging and data. I thought I might try
> > defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get
> > myself into.
> >
> > I wasn't able to find any previous discussion but pehaps 'separate
> > BLKSZ' were poor parameters to use. Any thoughts?
>
> I see your thinking.... presumably a performance tuning thought?

Yeah. :)

> Overall, the two things are fairly separate, apart from the fact that we
> do currently log whole data blocks straight to the log. Usually just
> one, but possibly 2 or three. So I have a feeling that things would
> become less efficient if you did this, not more.

I was hoping that in the case where 2 or more data blocks are written to
the log that they could written once within a single larger log block.
The log block size must be larger than the data block size, of course.

Thanks,
Mark

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Wong <markw(at)osdl(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Separate BLCKSZ for data and logging
Date:	2006-03-16 20:51:54
Message-ID:	1142542314.3859.534.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2006-03-16 at 12:22 -0800, Mark Wong wrote:

> I was hoping that in the case where 2 or more data blocks are written to
> the log that they could written once within a single larger log block.
> The log block size must be larger than the data block size, of course.

I think Tom's right... the OS blocksize is smaller than BLCKSZ, so
reducing the size might help with a very high transaction load when
commits are required very frequently. At checkpoint it sounds like we
might benefit from a large WAL blocksize because of all the additional
blocks written, but we often write more than one block at a time anyway,
and that still translates to multiple OS blocks whichever way you cut
it, so I'm not convinced yet.

On Thu, 2006-03-16 at 15:21 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > Overall, the two things are fairly separate, apart from the fact that we
> > do currently log whole data blocks straight to the log. Usually just
> > one, but possibly 2 or three. So I have a feeling that things would
> > become less efficient if you did this, not more.
>
> > But its a good line of thought and I'll have a look at that.
>
> I too think reducing the size of WAL blocks might be a win, because
> we currently always write whole blocks, and so a series of small
> transactions will be rewriting the same 8K block multiple times.
> If the filesystem's native block size is less than 8K, matching that
> size should theoretically make things faster.

Might it be possible to do this: When committing, if the current WAL
page is less than half-full wait for a single spin-lock cycle and then
do the write? (With the spin-lock, I mean on a single CPU we wait zero,
on a multi-CPU we wait a while). This is effectively a modification of
the group commit idea, but not to wait every time - only when it is
write-efficient to do so. (And we'd make that optional, too). We could
then ditch the remnant of the group-commit code.

Best Regards, Simon Riggs

From:	Mark Wong <markw(at)osdl(dot)org>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Separate BLCKSZ for data and logging
Date:	2006-03-16 23:29:05
Message-ID:	200603162327.k2GNRRDZ014683@smtp.osdl.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 16 Mar 2006 20:51:54 +0000
Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On Thu, 2006-03-16 at 12:22 -0800, Mark Wong wrote:
>
> > I was hoping that in the case where 2 or more data blocks are written to
> > the log that they could written once within a single larger log block.
> > The log block size must be larger than the data block size, of course.
>
> I think Tom's right... the OS blocksize is smaller than BLCKSZ, so
> reducing the size might help with a very high transaction load when
> commits are required very frequently. At checkpoint it sounds like we
> might benefit from a large WAL blocksize because of all the additional
> blocks written, but we often write more than one block at a time anyway,
> and that still translates to multiple OS blocks whichever way you cut
> it, so I'm not convinced yet.
>
> On Thu, 2006-03-16 at 15:21 -0500, Tom Lane wrote:
> > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > > Overall, the two things are fairly separate, apart from the fact that we
> > > do currently log whole data blocks straight to the log. Usually just
> > > one, but possibly 2 or three. So I have a feeling that things would
> > > become less efficient if you did this, not more.
> >
> > > But its a good line of thought and I'll have a look at that.
> >
> > I too think reducing the size of WAL blocks might be a win, because
> > we currently always write whole blocks, and so a series of small
> > transactions will be rewriting the same 8K block multiple times.
> > If the filesystem's native block size is less than 8K, matching that
> > size should theoretically make things faster.
>
> Might it be possible to do this: When committing, if the current WAL
> page is less than half-full wait for a single spin-lock cycle and then
> do the write? (With the spin-lock, I mean on a single CPU we wait zero,
> on a multi-CPU we wait a while). This is effectively a modification of
> the group commit idea, but not to wait every time - only when it is
> write-efficient to do so. (And we'd make that optional, too). We could
> then ditch the remnant of the group-commit code.

Sounds like there is some agreement that this could be an interesting
exercise. I'll see what I can do.

Thanks,
Mark

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Separate BLCKSZ for data and logging
Date:	2006-03-17 01:51:21
Message-ID:	dvd4ss$82m$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> wrote
>
> I think Tom's right... the OS blocksize is smaller than BLCKSZ, so
> reducing the size might help with a very high transaction load when
> commits are required very frequently. At checkpoint it sounds like we
> might benefit from a large WAL blocksize because of all the additional
> blocks written, but we often write more than one block at a time anyway,
> and that still translates to multiple OS blocks whichever way you cut
> it, so I'm not convinced yet.
>

As I observed from other database system, they really did something like
this. You can see the disk write sequence is something like this:

512
512
2048
4196
32768
512
...

That is, the xlog write bytes will always align to the disk sector size
(required by O_DIRECT), and try to write out as much as possible (but within
a upper bound like 32768 I guess). As I understand, this change would not
take too much trouble, maybe a local change in XlogWrite() is enough.

Regards,
Qingqing