Quick Links

Relation extension scalability

Lists:	pgsql-hackers

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Relation extension scalability
Date:	2015-03-29 18:56:19
Message-ID:	20150329185619.GA29062@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

Currently bigger shared_buffers settings don't combine well with
relations being extended frequently. Especially if many/most pages have
a high usagecount and/or are dirty and the system is IO constrained.

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized

The problems come from 4) and 5) potentially each taking a fair
while. If the working set mostly fits into shared_buffers 4) can
requiring iterating over all shared buffers several times to find a
victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
dirty limits 5) can take a couple seconds.

I've prototyped solving this for heap relations moving the smgrnblocks()
+ smgrextend() calls to RelationGetBufferForTuple(). With some care
(including a retry loop) it's possible to only do those two under the
extension lock. That indeed fixes problems in some of my tests.

I'm not sure whether the above is the best solution however. For one I
think it's not necessarily a good idea to opencode this in hio.c - I've
not observed it, but this probably can happen for btrees and such as
well. For another, this is still a exclusive lock while we're doing IO:
smgrextend() wants a page to write out, so we have to be careful not to
overwrite things.

There's two things that seem to make sense to me:

First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
and introduce a bufmgr function specifically for extension.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend(). If we instead only do the write once we've read
in & locked the page exclusively there's no need for the extension
lock. We probably still should write out the new page to the OS
immediately once we've initialized it; to avoid creating sparse files.

The other reason we need the extension lock is that code like
lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
pages that are about to be initilized by the extending backend. I think
we should just remove that code and deal with the problem by retrying in
the extending backend; that's why I think moving extension to a
different file might be helpful.

I've attached my POC for heap extension, but it's really just a POC at
this point.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-WIP-Saner-heap-extension.patch	text/x-patch	5.1 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-03-29 19:21:44
Message-ID:	27689.1427656904@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
> block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized

> The problems come from 4) and 5) potentially each taking a fair
> while.

Right, so basically we want to get those steps out of the exclusive lock
scope.

> There's two things that seem to make sense to me:

> First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
> and introduce a bufmgr function specifically for extension.

I think that removing P_NEW is likely to require a fair amount of
refactoring of calling code, so I'm not thrilled with doing that.
On the other hand, perhaps all that code would have to be touched
anyway to modify the scope over which the extension lock is held.

I'm afraid this would break stuff rather thoroughly, in particular
handling of out-of-disk-space cases. And I really don't see how you get
consistent behavior at all for multiple concurrent callers if there's no
locking.

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added". On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.

regards, tom lane

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-03-29 19:52:47
Message-ID:	20150329195247.GA4878@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-03-29 15:21:44 -0400, Tom Lane wrote:
> > There's two things that seem to make sense to me:
>
> > First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
> > and introduce a bufmgr function specifically for extension.
>
> I think that removing P_NEW is likely to require a fair amount of
> refactoring of calling code, so I'm not thrilled with doing that.
> On the other hand, perhaps all that code would have to be touched
> anyway to modify the scope over which the extension lock is held.

It's not *that* many locations that need to extend relations. In my
playing around it seemed to me they all would need to be modified
anyway; if we want to remove/reduce the scope of extension locks to deal
with the fact that somebody else could have started to use the buffer.

Hm. Not a bad point.

> And I really don't see how you get
> consistent behavior at all for multiple concurrent callers if there's no
> locking.

What I was thinking is something like this:

while (true)
{
targetBuffer = AcquireFromFSMEquivalent();

if (targetBuffer == InvalidBuffer)
targetBuffer = ExtendRelation();

LockBuffer(targetBuffer, BUFFER_LOCK_EXCLUSIVE);

if (BufferHasEnoughSpace(targetBuffer))
break;

LockBuffer(BUFFER_LOCK_UNLOCK);
}

where ExtendRelation() would basically work like
while (true)
{
targetBlock = (lseek(fd, BLCKSZ, SEEK_END) - BLCKSZ)/8192;
buffer = ReadBuffer(rel, targetBlock);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
page = BufferGetPage(buffer);
if (PageIsNew(page))
{
PageInit(page);
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
FlushBuffer(buffer);
break;
}
else
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
continue;
}
}

Obviously it's a tad more complex than that pseudocode, but I think that
basically should work. Except that it, as you can say, lead to some
oddities with out of space handling. I think it should actually be ok,
it might just be confusing to the user.

I think we might be able to address those issues by not using
lseek(SEEK_SET) but instead
fcntl(fd, F_SETFL, O_APPEND, 1);
write(fd, pre-init-block, BLCKSZ);
fcntl(fd, F_SETFL, O_APPEND, 0);
newblock = (lseek(SEEK_CUR) - BLCKSZ)/BLCKSZ;

by using O_APPEND and a pre-initialized block we can be sure to write a
block at the end, that's valid, and shouldn't run afould of any out of
space issues that we don't already have.

Unfortunately I'm not sure whether fcntl for O_APPEND is portable :(

> One idea that might help is to change smgrextend's API so that it doesn't
> need a buffer to write from, but just has an API of "add a prezeroed block
> on-disk and tell me the number of the block you added". On the other
> hand, that would then require reading in the block after allocating a
> buffer to hold it (I don't think you can safely assume otherwise) so the
> added read step might eat any savings.

Yea, I was thinking that as well. We simply could skip the reading step
by setting up the contents in the buffer manager without a read in this
case...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-03-29 20:07:49
Message-ID:	28823.1427659669@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2015-03-29 15:21:44 -0400, Tom Lane wrote:
>> One idea that might help is to change smgrextend's API so that it doesn't
>> need a buffer to write from, but just has an API of "add a prezeroed block
>> on-disk and tell me the number of the block you added". On the other
>> hand, that would then require reading in the block after allocating a
>> buffer to hold it (I don't think you can safely assume otherwise) so the
>> added read step might eat any savings.

> Yea, I was thinking that as well. We simply could skip the reading step
> by setting up the contents in the buffer manager without a read in this
> case...

No, you can't, at least not if the point is to not be holding any
exclusive lock by the time you go to talk to the buffer manager. There
will be nothing stopping some other backend from writing into that page of
the file before you can get hold of it. If the buffer they used to do the
write has itself gotten recycled, there is nothing left at all to tell you
your page image is out of date.

regards, tom lane

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-03-29 20:13:08
Message-ID:	20150329201308.GB4878@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-03-29 16:07:49 -0400, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2015-03-29 15:21:44 -0400, Tom Lane wrote:
> >> One idea that might help is to change smgrextend's API so that it doesn't
> >> need a buffer to write from, but just has an API of "add a prezeroed block
> >> on-disk and tell me the number of the block you added". On the other
> >> hand, that would then require reading in the block after allocating a
> >> buffer to hold it (I don't think you can safely assume otherwise) so the
> >> added read step might eat any savings.
>
> > Yea, I was thinking that as well. We simply could skip the reading step
> > by setting up the contents in the buffer manager without a read in this
> > case...
>
> No, you can't, at least not if the point is to not be holding any
> exclusive lock by the time you go to talk to the buffer manager. There
> will be nothing stopping some other backend from writing into that page of
> the file before you can get hold of it. If the buffer they used to do the
> write has itself gotten recycled, there is nothing left at all to tell you
> your page image is out of date.

That's why I'd proposed restructuring things so that the actual
extension/write to the file only happens once we have the buffer manager
exclusive lock on the individual buffer. While not trvia to implement it
doesn't look prohibitively complex.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-03-30 00:02:06
Message-ID:	CA+Tgmoa8nddwpMA4Emn7sVoNrQ883mn8ZJBiqXa8dm3puKn=qQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
> block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized
>
> The problems come from 4) and 5) potentially each taking a fair
> while. If the working set mostly fits into shared_buffers 4) can
> requiring iterating over all shared buffers several times to find a
> victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> dirty limits 5) can take a couple seconds.

Interesting. I had always assumed the bottleneck was waiting for the
filesystem to extend the relation.

> Secondly I think we could maybe remove the requirement of needing an
> extension lock alltogether. It's primarily required because we're
> worried that somebody else can come along, read the page, and initialize
> it before us. ISTM that could be resolved by *not* writing any data via
> smgrextend()/mdextend(). If we instead only do the write once we've read
> in & locked the page exclusively there's no need for the extension
> lock. We probably still should write out the new page to the OS
> immediately once we've initialized it; to avoid creating sparse files.
>
> The other reason we need the extension lock is that code like
> lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
> pages that are about to be initilized by the extending backend. I think
> we should just remove that code and deal with the problem by retrying in
> the extending backend; that's why I think moving extension to a
> different file might be helpful.

I thought the primary reason we did this is because we wanted to
write-and-fsync the block so that, if we're out of disk space, any
attendant failure will happen before we put data into the block. Once
we've initialized the block, a subsequent failure to write or fsync it
will be hard to recover from; basically, we won't be able to
checkpoint any more. If we discover the problem while the block is
still all-zeroes, the transaction that uncovers the problem errors
out, but the system as a whole is still OK.

Or at least, I think. Maybe I'm misunderstanding.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-03-30 00:47:09
Message-ID:	20150330004709.GC4878@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-03-29 20:02:06 -0400, Robert Haas wrote:
> On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
> > As a quick recap, relation extension basically works like:
> > 1) We lock the relation for extension
> > 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> > 3) smgrnblocks() is used to find the new target block
> > 4) We search for a victim buffer (via BufferAlloc()) to put the new
> > block into
> > 5) If dirty the victim buffer is cleaned
> > 6) The relation is extended using smgrextend()
> > 7) The page is initialized
> >
> > The problems come from 4) and 5) potentially each taking a fair
> > while. If the working set mostly fits into shared_buffers 4) can
> > requiring iterating over all shared buffers several times to find a
> > victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> > dirty limits 5) can take a couple seconds.
>
> Interesting. I had always assumed the bottleneck was waiting for the
> filesystem to extend the relation.

That might be the case sometimes, but it's not what I've actually
observed so far. I think most modern filesystems doing preallocation
resolved this to some degree.

> > Secondly I think we could maybe remove the requirement of needing an
> > extension lock alltogether. It's primarily required because we're
> > worried that somebody else can come along, read the page, and initialize
> > it before us. ISTM that could be resolved by *not* writing any data via
> > smgrextend()/mdextend(). If we instead only do the write once we've read
> > in & locked the page exclusively there's no need for the extension
> > lock. We probably still should write out the new page to the OS
> > immediately once we've initialized it; to avoid creating sparse files.
> >
> > The other reason we need the extension lock is that code like
> > lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
> > pages that are about to be initilized by the extending backend. I think
> > we should just remove that code and deal with the problem by retrying in
> > the extending backend; that's why I think moving extension to a
> > different file might be helpful.
>
> I thought the primary reason we did this is because we wanted to
> write-and-fsync the block so that, if we're out of disk space, any
> attendant failure will happen before we put data into the block.

Well, we only write and register a fsync. Afaics we don't actually
perform the fsync it at that point. I don't think having to do the
fsync() necessarily precludes removing the extension lock.

> Once we've initialized the block, a subsequent failure to write or
> fsync it will be hard to recover from;

At the very least the buffer shouldn't become dirty before we
successfully wrote once, right. It seems quite doable to achieve that
without the lock though. We'll have to do the write without going
through the buffer manager, but that seems doable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-03-30 01:05:04
Message-ID:	10634.1427677504@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I thought the primary reason we did this is because we wanted to
> write-and-fsync the block so that, if we're out of disk space, any
> attendant failure will happen before we put data into the block. Once
> we've initialized the block, a subsequent failure to write or fsync it
> will be hard to recover from; basically, we won't be able to
> checkpoint any more. If we discover the problem while the block is
> still all-zeroes, the transaction that uncovers the problem errors
> out, but the system as a whole is still OK.

Yeah. As Andres says, the fsync is not an important part of that,
but we do expect that ENOSPC will happen during the initial write()
if it's going to happen.

To some extent that's an obsolete assumption, I'm afraid --- I believe
that some modern filesystems don't necessarily overwrite the previous
version of a block, which would mean that they are capable of failing
with ENOSPC even during a re-write of a previously-written block.
However, the possibility of filesystem misfeasance of that sort doesn't
excuse us from having a clear recovery strategy for failures during
relation extension.

regards, tom lane

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-03-30 04:03:57
Message-ID:	CAA4eK1J-fWbZZNJKnrB60gh9T_gEjddcPePnb+TiFehoFeuGdg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 30, 2015 at 12:26 AM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
>
> Hello,
>
> Currently bigger shared_buffers settings don't combine well with
> relations being extended frequently. Especially if many/most pages have
> a high usagecount and/or are dirty and the system is IO constrained.
>
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
> block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized
>
>
> The problems come from 4) and 5) potentially each taking a fair
> while. If the working set mostly fits into shared_buffers 4) can
> requiring iterating over all shared buffers several times to find a
> victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> dirty limits 5) can take a couple seconds.
>

In the past, I have observed in one of the Write-oriented tests that
backend's have to flush the pages by themselves many a times, so
in above situation that can lead to more severe bottleneck.

> I've prototyped solving this for heap relations moving the smgrnblocks()
> + smgrextend() calls to RelationGetBufferForTuple(). With some care
> (including a retry loop) it's possible to only do those two under the
> extension lock. That indeed fixes problems in some of my tests.
>

So do this means that the problem is because of contention on extension
lock?

> I'm not sure whether the above is the best solution however.

Another thing to note here is that during extension we are extending
just one block, won't it make sense to increment it by some bigger
number (we can even take input from user for the same where user
can specify how much to autoextend a relation when the relation doesn't
have any empty space). During mdextend(), we might increase just one
block, however we can register the request for background process to
increase the size similar to what is done for fsync.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-03-30 10:45:13
Message-ID:	20150330104513.GD4878@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:
> In the past, I have observed in one of the Write-oriented tests that
> backend's have to flush the pages by themselves many a times, so
> in above situation that can lead to more severe bottleneck.

Yes.

> > I've prototyped solving this for heap relations moving the smgrnblocks()
> > + smgrextend() calls to RelationGetBufferForTuple(). With some care
> > (including a retry loop) it's possible to only do those two under the
> > extension lock. That indeed fixes problems in some of my tests.
> >
>
> So do this means that the problem is because of contention on extension
> lock?

Yes, at least commonly. Obviously the extension lock would be less of a
problem if we were better at having clean victim buffer ready.

> > I'm not sure whether the above is the best solution however.
>
> Another thing to note here is that during extension we are extending
> just one block, won't it make sense to increment it by some bigger
> number (we can even take input from user for the same where user
> can specify how much to autoextend a relation when the relation doesn't
> have any empty space). During mdextend(), we might increase just one
> block, however we can register the request for background process to
> increase the size similar to what is done for fsync.

I think that's pretty much a separate patch. Made easier by moving
things out of under the lock maybe. Other than that I'd prefer not to
mix things. There's a whole bunch of unrelated complexity that I don't
want to attach to the topic at the same time (autovacuum truncayting
again and so on).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	David Steele <david(at)pgmasters(dot)net>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-03-30 13:29:23
Message-ID:	55194FB3.60903@pgmasters.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/30/15 6:45 AM, Andres Freund wrote:
> On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:
>> Another thing to note here is that during extension we are extending
>> just one block, won't it make sense to increment it by some bigger
>> number (we can even take input from user for the same where user
>> can specify how much to autoextend a relation when the relation doesn't
>> have any empty space). During mdextend(), we might increase just one
>> block, however we can register the request for background process to
>> increase the size similar to what is done for fsync.
>
> I think that's pretty much a separate patch. Made easier by moving
> things out of under the lock maybe. Other than that I'd prefer not to
> mix things. There's a whole bunch of unrelated complexity that I don't
> want to attach to the topic at the same time (autovacuum truncayting
> again and so on).

Agreed that it makes more sense for this to be in a separate patch, but
I definitely like the idea.

A user configurable setting would be fine, but better would be to learn
from the current growth rate of the table and extend based on that.

For, instance, if a table is very large but is only growing by a few
rows a day, there's probably no need for a large extent. Conversely, an
initially small table growing by 1GB per minute would definitely benefit
from large extents and it would be good to be able to track growth and
compute extent sizes accordingly.

Of course, a manual setting to start with would cover most use cases.
Large tables in a database are generally in the minority and known in
advance.

--
- David Steele
david(at)pgmasters(dot)net

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-03-30 15:27:29
Message-ID:	20150330152728.GZ3663@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* David Steele (david(at)pgmasters(dot)net) wrote:
> On 3/30/15 6:45 AM, Andres Freund wrote:
> > On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:
> >> Another thing to note here is that during extension we are extending
> >> just one block, won't it make sense to increment it by some bigger
> >> number (we can even take input from user for the same where user
> >> can specify how much to autoextend a relation when the relation doesn't
> >> have any empty space). During mdextend(), we might increase just one
> >> block, however we can register the request for background process to
> >> increase the size similar to what is done for fsync.
> >
> > I think that's pretty much a separate patch. Made easier by moving
> > things out of under the lock maybe. Other than that I'd prefer not to
> > mix things. There's a whole bunch of unrelated complexity that I don't
> > want to attach to the topic at the same time (autovacuum truncayting
> > again and so on).
>
> Agreed that it makes more sense for this to be in a separate patch, but
> I definitely like the idea.
>
> A user configurable setting would be fine, but better would be to learn
> from the current growth rate of the table and extend based on that.
>
> For, instance, if a table is very large but is only growing by a few
> rows a day, there's probably no need for a large extent. Conversely, an
> initially small table growing by 1GB per minute would definitely benefit
> from large extents and it would be good to be able to track growth and
> compute extent sizes accordingly.
>
> Of course, a manual setting to start with would cover most use cases.
> Large tables in a database are generally in the minority and known in
> advance.

If we're able to extend based on page-level locks rather than the global
relation locking that we're doing now, then I'm not sure we really need
to adjust how big the extents are any more. The reason for making
bigger extents is because of the locking problem we have now when lots
of backends want to extend a relation, but, if I'm following correctly,
that'd go away with Andres' approach.

We don't have full patches for either of these and so I don't mind
saying that, basically, I'd prefer to see if we still have a big
bottleneck here with lots of backends trying to extend the same relation
before we work on adding this particular feature in as it might end up
being unnecessary. Now, if someone shows up tomorrow with a patch to do
this and Andres' approach ends up not progressing, then we should
certainly consider it (in due time and with consideration to the
activities going on for 9.5, of course).

Thanks!

Stephen

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-03-31 03:48:06
Message-ID:	CAA4eK1KDgU7--k+GxgYGa=r390OpO8Lp+kLj5wMT=FT17icorA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 30, 2015 at 8:57 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>
>
> If we're able to extend based on page-level locks rather than the global
> relation locking that we're doing now, then I'm not sure we really need
> to adjust how big the extents are any more. The reason for making
> bigger extents is because of the locking problem we have now when lots
> of backends want to extend a relation, but, if I'm following correctly,
> that'd go away with Andres' approach.
>

The benefit of extending in bigger chunks in background is that backend
would need to perform such an operation at relatively lesser frequency
which in itself could be a win.

> We don't have full patches for either of these and so I don't mind
> saying that, basically, I'd prefer to see if we still have a big
> bottleneck here with lots of backends trying to extend the same relation
> before we work on adding this particular feature in as it might end up
> being unnecessary.

Agreed, I think it is better to first see the results of current
patch on which Andres is working and then if someone is interested
and can show any real benefit with the patch to extend relation
in bigger chunks, then that might be worth consideration.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-04-02 00:24:05
Message-ID:	551C8C25.1070904@BlueTreble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/30/15 10:48 PM, Amit Kapila wrote:
> > If we're able to extend based on page-level locks rather than the global
> > relation locking that we're doing now, then I'm not sure we really need
> > to adjust how big the extents are any more. The reason for making
> > bigger extents is because of the locking problem we have now when lots
> > of backends want to extend a relation, but, if I'm following correctly,
> > that'd go away with Andres' approach.
> >
>
> The benefit of extending in bigger chunks in background is that backend
> would need to perform such an operation at relatively lesser frequency
> which in itself could be a win.

The other potential advantage (and I have to think this could be a BIG
advantage) is extending by a large amount makes it more likely you'll
get contiguous blocks on the storage. That's going to make a big
difference for SeqScan speed. It'd be interesting if someone with access
to some real systems could test that. In particular, seqscan of a
possibly fragmented table vs one of the same size but created at once.
For extra credit, compare to dd bs=8192 of a file of the same size as
the overall table.

What I've seen in the real world is very, very poor SeqScan performance
of tables that were relatively large. So bad that I had to SeqScan 8-16
tables in parallel to max out the IO system the same way I could with a
single dd bs=8k of a large file (in this case, something like 480MB/s).
A single SeqScan would only do something like 30MB/s.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-04-02 03:11:30
Message-ID:	551CB362.4080405@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02-04-2015 AM 09:24, Jim Nasby wrote:
> The other potential advantage (and I have to think this could be a BIG
> advantage) is extending by a large amount makes it more likely you'll get
> contiguous blocks on the storage. That's going to make a big difference for
> SeqScan speed. It'd be interesting if someone with access to some real systems
> could test that. In particular, seqscan of a possibly fragmented table vs one
> of the same size but created at once. For extra credit, compare to dd bs=8192
> of a file of the same size as the overall table.
>

Orthogonal to topic of the thread but this comment made me recall a proposal
couple years ago[0] to add (posix_)fallocate to mdextend(). Wonder if it helps
the case?

Amit

[0]
http://www.postgresql.org/message-id/CADupcHW1POmSuNoNMdVaWLTq-a3X_A3ZQMuSjHs4rCexiPgxAQ@mail.gmail.com

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-04-02 03:14:59
Message-ID:	20150402031459.GT3663@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Amit Langote (Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp) wrote:
> On 02-04-2015 AM 09:24, Jim Nasby wrote:
> > The other potential advantage (and I have to think this could be a BIG
> > advantage) is extending by a large amount makes it more likely you'll get
> > contiguous blocks on the storage. That's going to make a big difference for
> > SeqScan speed. It'd be interesting if someone with access to some real systems
> > could test that. In particular, seqscan of a possibly fragmented table vs one
> > of the same size but created at once. For extra credit, compare to dd bs=8192
> > of a file of the same size as the overall table.
>
> Orthogonal to topic of the thread but this comment made me recall a proposal
> couple years ago[0] to add (posix_)fallocate to mdextend(). Wonder if it helps
> the case?

As I recall, it didn't, and further, modern filesystems are pretty good
about avoiding fragmentation anyway..

I'm not saying Jim's completely off-base with this idea, I'm just not
sure that it'll really buy us much.

Thanks,

Stephen

From:	Qingqing Zhou <zhouqq(dot)postgres(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-04-17 18:19:19
Message-ID:	CAJjS0u0HgsTzFKEjg6+yOVgQswS5W0pySCa=nRXJcbhXvi13Uw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 29, 2015 at 11:56 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I'm not sure whether the above is the best solution however. For one I
> think it's not necessarily a good idea to opencode this in hio.c - I've
> not observed it, but this probably can happen for btrees and such as
> well. For another, this is still a exclusive lock while we're doing IO:
> smgrextend() wants a page to write out, so we have to be careful not to
> overwrite things.
>
I think relaxing a global lock will fix the contention mostly.
However, several people suggested that extending with many pages have
other benefits. This hints for a more fundamental change in our
storage model. Currently we map one file per relation. While it is
simple and robust, considering partitioned table, maybe later columnar
storage are integrated into the core, this model needs some further
thoughts. Think about a 1000 partitioned table with 100 columns: that
is 100K files, no to speak of other forks - surely we can continue
challenging file system's limit or playing around vfds, but we have a
chance now to think ahead.

Most commercial database employs a DMS storage model, where it manages
object mapping and freespace itself. So different objects are sharing
storage within several files. Surely it has historic reasons, but it
has several advantages over current model:
- remove fd pressure
- remove double buffering (by introducing ADIO)
- controlled layout and access pattern (sequential and read-ahead)
- better quota management
- performance potentially better

Considering platforms supported and the stableness period needed, we
shall support both current storage model and DMS model. I will stop
here to see if this deserves further discussion.

Regards,
Qingqing

From:	Qingqing Zhou <zhouqq(dot)postgres(at)gmail(dot)com>
To:
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-04-30 17:50:47
Message-ID:	CAJjS0u3oVpuSAT8NviryeRNqA0nj8WEEXfSx=uRPqWF1=ORSVg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Apr 17, 2015 at 11:19 AM, Qingqing Zhou
<zhouqq(dot)postgres(at)gmail(dot)com> wrote:
>
> Most commercial database employs a DMS storage model, where it manages
> object mapping and freespace itself. So different objects are sharing
> storage within several files. Surely it has historic reasons, but it
> has several advantages over current model:
> - remove fd pressure
> - remove double buffering (by introducing ADIO)
> - controlled layout and access pattern (sequential and read-ahead)
> - better quota management
> - performance potentially better
>
> Considering platforms supported and the stableness period needed, we
> shall support both current storage model and DMS model. I will stop
> here to see if this deserves further discussion.
>

Sorry, it might considered double-posting here but I am wondering have
we ever discussed this before? If we already have some conclusions on
this, could anyone share me a link?

Thanks,
Qingqing

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-07-19 13:58:41
Message-ID:	20150719135841.GG25610@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I, every now and then, spent a bit of time making this more efficient
over the last few weeks.

I had a bit of a problem to reproduce the problems I'd seen in
production on physical hardware (found EC2 to be to variable to
benchmark this), but luckily 2ndQuadrant today allowed me access to
their four socket machine[1] of the AXLE project. Thanks Simon and
Tomas!

First, some mostly juicy numbers:

My benchmark was a parallel COPY into a single wal logged target
table:
CREATE TABLE data(data text);
The source data has been generated with
narrow:
COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY;
wide:
COPY (select repeat(random()::text, 10) FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY;

Between every test I ran a TRUNCATE data; CHECKPOINT;

For each number of clients I ran pgbench for 70 seconds. I'd previously
determined using -P 1 that the numbers are fairly stable. Longer runs
would have been nice, but then I'd not have finished in time.

shared_buffers = 48GB, narrow table contents:
client tps after: tps before:
1 180.255577 210.125143
2 338.231058 391.875088
4 638.814300 405.243901
8 1126.852233 370.922271
16 1242.363623 498.487008
32 1229.648854 484.477042
48 1223.288397 468.127943
64 1198.007422 438.238119
96 1201.501278 370.556354
128 1198.554929 288.213032
196 1189.603398 193.841993
256 1144.082291 191.293781
512 643.323675 200.782105

shared_buffers = 1GB, narrow table contents:
client tps after: tps before:
1 191.137410 210.787214
2 351.293017 384.086634
4 649.800991 420.703149
8 1103.770749 355.947915
16 1287.192256 489.050768
32 1226.329585 464.936427
48 1187.266489 443.386440
64 1182.698974 402.251258
96 1208.315983 331.290851
128 1183.469635 269.250601
196 1202.847382 202.788617
256 1177.924515 190.876852
512 572.457773 192.413191

1
shared_buffers = 48GB, wide table contents:
client tps after: tps before:
1 59.685215 68.445331
2 102.034688 103.210277
4 179.434065 78.982315
8 222.613727 76.195353
16 232.162484 77.520265
32 231.979136 71.654421
48 231.981216 64.730114
64 230.955979 57.444215
96 228.016910 56.324725
128 227.693947 45.701038
196 227.410386 37.138537
256 224.626948 35.265530
512 105.356439 34.397636

shared_buffers = 1GB, wide table contents:
(ran out of patience)

Note that the peak performance with the patch is significantly better,
but there's currently a noticeable regression in single threaded
performance. That undoubtedly needs to be addressed.

So, to get to the actual meat: My goal was to essentially get rid of an
exclusive lock over relation extension alltogether. I think I found a
way to do that that addresses the concerns made in this thread.

Thew new algorithm basically is:
1) Acquire victim buffer, clean it, and mark it as pinned
2) Get the current size of the relation, save buffer into blockno
3) Try to insert an entry into the buffer table for blockno
4) If the page is already in the buffer table, increment blockno by 1,
goto 3)
5) Try to read the page. In most cases it'll not yet exist. But the page
might concurrently have been written by another backend and removed
from shared buffers already. If already existing, goto 1)
6) Zero out the page on disk.

I think this does handle the concurrency issues.

This patch very clearly is in the POC stage. But I do think the approach
is generally sound. I'd like to see some comments before deciding
whether to carry on.

Greetings,

Andres Freund

PS: Yes, I know that precision in the benchmark isn't warranted, but I'm
too lazy to truncate them.

[1]
[10:28:11 PM] Tomas Vondra: 4x Intel Xeon E54620 Eight Core 2.2GHz
Processor’s generation Sandy Bridge EP
each core handles 2 threads, so 16 threads total
256GB (16x16GB) ECC REG System Validated Memory (1333 MHz)
2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB)
17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB)
LSI MegaRAID 92718iCC controller and Cache Vault Kit (1GB cache)
2 x Nvidia Tesla K20 Active GPU Cards (GK110GL)

Attachment	Content-Type	Size
0001-WIP-Saner-heap-extension.patch	text/x-patch	26.2 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-07-19 14:07:46
Message-ID:	20150719140746.GH25610@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Eeek, the attached patch included a trivial last minute screwup
(dereferencing bistate unconditionally...). Fixed version attached.

Andres

Attachment	Content-Type	Size
0001-WIP-Saner-heap-extension.patch	text/x-patch	26.2 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-07-19 15:28:25
Message-ID:	13296.1437319705@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> So, to get to the actual meat: My goal was to essentially get rid of an
> exclusive lock over relation extension alltogether. I think I found a
> way to do that that addresses the concerns made in this thread.

> Thew new algorithm basically is:
> 1) Acquire victim buffer, clean it, and mark it as pinned
> 2) Get the current size of the relation, save buffer into blockno
> 3) Try to insert an entry into the buffer table for blockno
> 4) If the page is already in the buffer table, increment blockno by 1,
> goto 3)
> 5) Try to read the page. In most cases it'll not yet exist. But the page
> might concurrently have been written by another backend and removed
> from shared buffers already. If already existing, goto 1)
> 6) Zero out the page on disk.

> I think this does handle the concurrency issues.

The need for (5) kind of destroys my faith in this really being safe: it
says there are non-obvious race conditions here.

For instance, what about this scenario:
* Session 1 tries to extend file, allocates buffer for page 42 (so it's
now between steps 4 and 5).
* Session 2 tries to extend file, sees buffer for 42 exists, allocates
buffer for page 43 (so it's also now between 4 and 5).
* Session 2 tries to read page 43, sees it's not there, and writes out
page 43 with zeroes (now it's done).
* Session 1 tries to read page 42, sees it's there and zero-filled
(not because anybody wrote it, but because holes in files read as 0).

At this point session 1 will go and create page 44, won't it, and you
just wasted a page. Now we do have mechanisms for reclaiming such pages
but they may not kick in until VACUUM, so you could end up with a whole
lot of table bloat.

Also, the file is likely to end up badly physically fragmented when the
skipped pages are finally filled in. One of the good things about the
relation extension lock is that the kernel sees the file as being extended
strictly sequentially, which it should handle fairly well as far as
filesystem layout goes. This way might end up creating a mess on-disk.

Perhaps even more to the point, you've added a read() kernel call that was
previously not there at all, without having removed either the lseek() or
the write(). Perhaps that scales better when what you're measuring is
saturation conditions on a many-core machine, but I have a very hard time
believing that it's not a significant net loss under less-contended
conditions.

I'm inclined to think that a better solution in the long run is to keep
the relation extension lock but find a way to extend files more than
one page per lock acquisition.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-07-19 15:50:24
Message-ID:	20150719155024.GI25610@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-19 11:28:25 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > So, to get to the actual meat: My goal was to essentially get rid of an
> > exclusive lock over relation extension alltogether. I think I found a
> > way to do that that addresses the concerns made in this thread.
>
> > Thew new algorithm basically is:
> > 1) Acquire victim buffer, clean it, and mark it as pinned
> > 2) Get the current size of the relation, save buffer into blockno
> > 3) Try to insert an entry into the buffer table for blockno
> > 4) If the page is already in the buffer table, increment blockno by 1,
> > goto 3)
> > 5) Try to read the page. In most cases it'll not yet exist. But the page
> > might concurrently have been written by another backend and removed
> > from shared buffers already. If already existing, goto 1)
> > 6) Zero out the page on disk.
>
> > I think this does handle the concurrency issues.
>
> The need for (5) kind of destroys my faith in this really being safe: it
> says there are non-obvious race conditions here.

It's not simple, I agree. I'm doubtful that an significantly simpler
approach exists.

> For instance, what about this scenario:
> * Session 1 tries to extend file, allocates buffer for page 42 (so it's
> now between steps 4 and 5).
> * Session 2 tries to extend file, sees buffer for 42 exists, allocates
> buffer for page 43 (so it's also now between 4 and 5).
> * Session 2 tries to read page 43, sees it's not there, and writes out
> page 43 with zeroes (now it's done).
> * Session 1 tries to read page 42, sees it's there and zero-filled
> (not because anybody wrote it, but because holes in files read as 0).
>
> At this point session 1 will go and create page 44, won't it, and you
> just wasted a page.

My local code now recognizes that case and uses the page. We just need
to do an PageIsNew().

> Also, the file is likely to end up badly physically fragmented when the
> skipped pages are finally filled in. One of the good things about the
> relation extension lock is that the kernel sees the file as being extended
> strictly sequentially, which it should handle fairly well as far as
> filesystem layout goes. This way might end up creating a mess on-disk.

I don't think that'll actually happen with any recent
filesystems. Pretty much all of them do delayed allocation. But it
definitely is a concern with older filesystems.

I've just measured and with ext4 the number of extents per segment in a
300GB relation don't show a significant difference when compared between
the existing and new code.

We could try to address this by optionally using posix_fallocate() to do
the actual extension - then there'll not be sparse regions, but actually
allocated disk blocks.

> Perhaps even more to the point, you've added a read() kernel call that was
> previously not there at all, without having removed either the lseek() or
> the write(). Perhaps that scales better when what you're measuring is
> saturation conditions on a many-core machine, but I have a very hard time
> believing that it's not a significant net loss under less-contended
> conditions.

Yes, this has me worried too.

> I'm inclined to think that a better solution in the long run is to keep
> the relation extension lock but find a way to extend files more than
> one page per lock acquisition.

I doubt that'll help as much. As long as you have to search and write
out buffers under an exclusive lock that'll be painful. You might be
able to make that an infrequent occurrance by extending in larger
amounts and entering the returned pages into the FSM, but you'll have
rather noticeable latency increases everytime that happens. And not just
in the extending relation - all the other relations will wait for the
one doing the extending. We could move that into some background
process, but at that point things have gotten seriously complex.

The more radical solution would be to have some place in memory that'd
store the current number of blocks. Then all the extension specific
locking we'd need would be around incrementing that. But how and where
to store that isn't easy.

Greetings,

Andres Freund

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-07-19 15:56:47
Message-ID:	14957.1437321407@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2015-07-19 11:28:25 -0400, Tom Lane wrote:
>> At this point session 1 will go and create page 44, won't it, and you
>> just wasted a page.

> My local code now recognizes that case and uses the page. We just need
> to do an PageIsNew().

Er, what? How can you tell whether an all-zero page was or was not
just written by another session?

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-07-19 16:07:39
Message-ID:	20150719160739.GJ25610@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-19 11:56:47 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2015-07-19 11:28:25 -0400, Tom Lane wrote:
> >> At this point session 1 will go and create page 44, won't it, and you
> >> just wasted a page.
>
> > My local code now recognizes that case and uses the page. We just need
> > to do an PageIsNew().
>
> Er, what? How can you tell whether an all-zero page was or was not
> just written by another session?

The check is only done while holding the io lock on the relevant page
(have to hold that anyway), after reading it in ourselves, just before
setting BM_VALID. As we only can get to that point when there wasn't any
other entry for the page in the buffer table, that guarantees that no
other backend isn't currently expanding into that page. Others might
wait to read it, but those'll wait behind the IO lock.

The situation the read() protect us against is that two backends try to
extend to the same block, but after one of them succeeded the buffer is
written out and reused for an independent page. So there is no in-memory
state telling the slower backend that that page has already been used.

Andres

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-12-18 05:21:19
Message-ID:	CAFiTN-s642YKPv1mTNCDYBtrZh2OPVz-bWM8TyXGKKKEZ7qNPg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jul 19 2015 9:37 PM Andres Wrote,

> The situation the read() protect us against is that two backends try to
> extend to the same block, but after one of them succeeded the buffer is
> written out and reused for an independent page. So there is no in-memory
> state telling the slower backend that that page has already been used.

I was looking into this patch, and done some performance testing..

Currently i have done testing my my local machine, later i will perform on
big machine once i get access to that.

Just wanted to share the current result what i get i my local machine
Machine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM of
RAM).

Test Script:
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

*Summary of the results:*
1. When data fits into shared buffer improvement is not visible, but when
it don't then some improvement is visible in my local machine (still does
not seems to be scaling, may be we can see some different behaviour in big
machine), Thats because in first case it need not to flush the buffer out.

2. As per Tom's analysis since we are doing extra read it will reduce
performance in lower no of client where RelationExtensionLock is not
bottleneck and same is visible in test result.

As discussed earlier, what about keeping the RelationExtentionLock as it is
and just do the victim buffer search and buffer flushing outside the lock,
that way we can save extra read. Correct me if i have missed something in
this..

Shared Buffer 512 MB
-----------------------------
Client: Tps Base Tps Patch
1 145 126
2 211 246
4 248 302
8 225 234

Shared Buffer 5GB
-----------------------------
Client: Tps Base Tps Patch
1 165 156
2 237 244
4 294 296
8 253 247

Also observed one problem with patch,

@@ -433,63 +434,50 @@ RelationGetBufferForTuple(Relation relation, Size len,
+ while(true)
+ {
+ buffer = ExtendRelation(relation, MAIN_FORKNUM, bistate->strategy);

bistate can be NULL if direct insert instead of copy case

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Sun, Jul 19, 2015 at 9:37 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2015-07-19 11:56:47 -0400, Tom Lane wrote:
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > On 2015-07-19 11:28:25 -0400, Tom Lane wrote:
> > >> At this point session 1 will go and create page 44, won't it, and you
> > >> just wasted a page.
> >
> > > My local code now recognizes that case and uses the page. We just need
> > > to do an PageIsNew().
> >
> > Er, what? How can you tell whether an all-zero page was or was not
> > just written by another session?
>
> The check is only done while holding the io lock on the relevant page
> (have to hold that anyway), after reading it in ourselves, just before
> setting BM_VALID. As we only can get to that point when there wasn't any
> other entry for the page in the buffer table, that guarantees that no
> other backend isn't currently expanding into that page. Others might
> wait to read it, but those'll wait behind the IO lock.
>
>
> The situation the read() protect us against is that two backends try to
> extend to the same block, but after one of them succeeded the buffer is
> written out and reused for an independent page. So there is no in-memory
> state telling the slower backend that that page has already been used.
>
> Andres
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2015-12-31 11:22:50
Message-ID:	CAFiTN-sn1HLEOGWm5-U9xmzNJDdXpAMHYeWh3i-u-pP5rcFROQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 18, 2015 at 10:51 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

> On Sun, Jul 19 2015 9:37 PM Andres Wrote,
>
> > The situation the read() protect us against is that two backends try to
> > extend to the same block, but after one of them succeeded the buffer is
> > written out and reused for an independent page. So there is no in-memory
> > state telling the slower backend that that page has already been used.
>
> I was looking into this patch, and done some performance testing..
>
> Currently i have done testing my my local machine, later i will perform on
> big machine once i get access to that.
>
> Just wanted to share the current result what i get i my local machine
> Machine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM
> of RAM).
>
> Test Script:
> ./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
> 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";
>
> ./psql -d postgres -c "truncate table data"
> ./psql -d postgres -c "checkpoint"
> ./pgbench -f copy_script -T 120 -c$ -j$ postgres
>

This time i have done some testing on big machine with* 64 physical cores @
2.13GHz and 50GB of RAM*

There is performance comparison of base, extend without
RelationExtensionLock patch given by Andres and
multi-extend patch (this will extend the multiple blocks at a time based on
a configuration parameter.)

*Problem Analysis:------------------------*
1. With base code when i try to observe the problem using perf and other
method (gdb), i found that RelationExtensionLock is main bottleneck.
2. Then after using RelationExtensionLock Free patch, i observed now
contention is FileWrite (All backends are trying to extend the file.)

*Performance Summary and
Analysis:------------------------------------------------*
1. In my performance results Multi Extend shown best performance and
scalability.
2. I think by extending in multiple blocks we solves both the
problem(Extension Lock and Parallel File Write).
3. After extending one Block immediately adding to FSM so in most of the
cases other backend can directly find it without taking extension lock.

Currently the patch is in initial stage, i have done only test performance
and pass the regress test suit.

*Open problems -----------------------------*
1. After extending the page we are adding it directly to FSM, so if vacuum
find this page as new it will give WARNING.
2. In RelationGetBufferForTuple, when PageIsNew, we are doing PageInit,
same need to be consider for index cases.

*Test Script:-------------------------*
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

*Performanec Data:*
--------------------------
*There are Three code base for performance*
1. Base Code

2. Lock Free Patch : patch given in below thread
*http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de
<http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de>*

3. Multi extend patch attached in the mail.
*#extend_num_pages : *This this new config parameter to tell how many extra
page extend in case of normal extend..
may be it will give more control to user if we make it relation property.

I will work on the patch for this CF, so adding it to CF.

*Shared Buffer 48 GB*

*Clients* *Base (TPS)*
*Lock Free Patch* *Multi-extend **extend_num_pages=5* 1 142 138 148 2 251
253 280 4 237 416 464 8 168 491 575 16 141 448 404 32 122 337 332

*Shared Buffer 64 MB*

*Clients* *Base (TPS)* *Multi-extend **extend_num_pages=5*
1 140 148
2 252 266
4 229 437
8 153 475
16 132 364

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v1.patch	text/x-patch	5.8 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-01-06 20:06:20
Message-ID:	CA+TgmoadnNuONgAk4Gd=ZiCPeoDUTECEL72B_rkXx2kDrEgS2w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 31, 2015 at 6:22 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

> On Fri, Dec 18, 2015 at 10:51 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
> wrote:
>
> On Sun, Jul 19 2015 9:37 PM Andres Wrote,
>>
>> > The situation the read() protect us against is that two backends try to
>> > extend to the same block, but after one of them succeeded the buffer is
>> > written out and reused for an independent page. So there is no in-memory
>> > state telling the slower backend that that page has already been used.
>>
>> I was looking into this patch, and done some performance testing..
>>
>> Currently i have done testing my my local machine, later i will perform
>> on big machine once i get access to that.
>>
>> Just wanted to share the current result what i get i my local machine
>> Machine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM
>> of RAM).
>>
>> Test Script:
>> ./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
>> 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";
>>
>> ./psql -d postgres -c "truncate table data"
>> ./psql -d postgres -c "checkpoint"
>> ./pgbench -f copy_script -T 120 -c$ -j$ postgres
>>
>
> This time i have done some testing on big machine with* 64 physical cores
> @ 2.13GHz and 50GB of RAM*
>
> There is performance comparison of base, extend without
> RelationExtensionLock patch given by Andres and
> multi-extend patch (this will extend the multiple blocks at a time based
> on a configuration parameter.)
>
>
> *Problem Analysis:------------------------*
> 1. With base code when i try to observe the problem using perf and other
> method (gdb), i found that RelationExtensionLock is main bottleneck.
> 2. Then after using RelationExtensionLock Free patch, i observed now
> contention is FileWrite (All backends are trying to extend the file.)
>
>
> *Performance Summary and
> Analysis:------------------------------------------------*
> 1. In my performance results Multi Extend shown best performance and
> scalability.
> 2. I think by extending in multiple blocks we solves both the
> problem(Extension Lock and Parallel File Write).
> 3. After extending one Block immediately adding to FSM so in most of the
> cases other backend can directly find it without taking extension lock.
>
> Currently the patch is in initial stage, i have done only test performance
> and pass the regress test suit.
>
>
>
> *Open problems -----------------------------*
> 1. After extending the page we are adding it directly to FSM, so if vacuum
> find this page as new it will give WARNING.
> 2. In RelationGetBufferForTuple, when PageIsNew, we are doing PageInit,
> same need to be consider for index cases.
>
>
>
> *Test Script:-------------------------*
> ./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
> 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";
>
> ./psql -d postgres -c "truncate table data"
> ./psql -d postgres -c "checkpoint"
> ./pgbench -f copy_script -T 120 -c$ -j$ postgres
>
> *Performanec Data:*
> --------------------------
> *There are Three code base for performance*
> 1. Base Code
>
> 2. Lock Free Patch : patch given in below thread
> *http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de
> <http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de>*
>
> 3. Multi extend patch attached in the mail.
> *#extend_num_pages : *This this new config parameter to tell how many
> extra page extend in case of normal extend..
> may be it will give more control to user if we make it relation property.
>
> I will work on the patch for this CF, so adding it to CF.
>
>
> *Shared Buffer 48 GB*
>
>
> *Clients* *Base (TPS)*
> *Lock Free Patch* *Multi-extend **extend_num_pages=5* 1 142 138 148 2 251
> 253 280 4 237 416 464 8 168 491 575 16 141 448 404 32 122 337 332
>
>
>
> *Shared Buffer 64 MB*
>
>
> *Clients* *Base (TPS)* *Multi-extend **extend_num_pages=5*
> 1 140 148
> 2 252 266
> 4 229 437
> 8 153 475
> 16 132 364
>
>
I'm not really sure what this email is trying to say.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-01-07 11:18:53
Message-ID:	CAA4eK1Jzg5FN_B0AzW-_uBqmzLH68aB29uqME+vs1ZDqEkFxtQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 7, 2016 at 1:36 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Thu, Dec 31, 2015 at 6:22 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
> wrote:
>
>>
>> *Performanec Data:*
>> --------------------------
>> *There are Three code base for performance*
>> 1. Base Code
>>
>> 2. Lock Free Patch : patch given in below thread
>> *http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de
>> <http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de>*
>>
>> 3. Multi extend patch attached in the mail.
>> *#extend_num_pages : *This this new config parameter to tell how many
>> extra page extend in case of normal extend..
>> may be it will give more control to user if we make it relation property.
>>
>> I will work on the patch for this CF, so adding it to CF.
>>
>>
>> *Shared Buffer 48 GB*
>>
>>
>> *Clients* *Base (TPS)*
>> *Lock Free Patch* *Multi-extend **extend_num_pages=5* 1 142 138 148 2 251
>> 253 280 4 237 416 464 8 168 491 575 16 141 448 404 32 122 337 332
>>
>>
>>
>> *Shared Buffer 64 MB*
>>
>>
>> *Clients* *Base (TPS)* *Multi-extend **extend_num_pages=5*
>> 1 140 148
>> 2 252 266
>> 4 229 437
>> 8 153 475
>> 16 132 364
>>
>>
> I'm not really sure what this email is trying to say.
>
>
What I could understand from above e-mail is that Dilip has tried to
extend relation multiple-pages-at-a-time and observed that it gives
comparable or better performance as compare to Andres's idea of
lock-free extension and it doesn't regress the low-thread count case.

Now, I think here point to discuss is that there could be multiple-ways
for extending a relation multiple-pages-at-a-time like:

a. Extend the relation page by page and add it to FSM without initializing
it. I think this is what the current patch of Dilip seems to be doing. If
we
want to go via this route, then we need to ensure that whenever we get
the page from FSM, if it is empty and not initialised, then initialise it.
b. Extend the relation page by page, initialize it and add it to FSM.
c. Extend the relation *n* pages at a time (in mdextend, have a provision
to do FILEWRITE for multiples of BLCKSZ). Here again, we need to
evaluate what is better way to add it to FSM (after Page initialization or
before page initialization).
d. Use some form of Group Extension, which means only one backend
will extend the relation and others will piggyback there request of
extension to that backend and wait for extension.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-01-07 11:23:41
Message-ID:	20160107112341.5md4r2237hvanoxw@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:
> What I could understand from above e-mail is that Dilip has tried to
> extend relation multiple-pages-at-a-time and observed that it gives
> comparable or better performance as compare to Andres's idea of
> lock-free extension and it doesn't regress the low-thread count case.

I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.

> Now, I think here point to discuss is that there could be multiple-ways
> for extending a relation multiple-pages-at-a-time like:
>
> a. Extend the relation page by page and add it to FSM without initializing
> it. I think this is what the current patch of Dilip seems to be doing. If
> we
> want to go via this route, then we need to ensure that whenever we get
> the page from FSM, if it is empty and not initialised, then initialise
> it.

I think that's pretty much unacceptable, for the non-error path at
least.

One very simple, linux specific, approach would be to simply do
fallocate(FALLOC_FL_KEEP_SIZE) to extend the file, that way space is
pre-allocated, but not yet marked as allocated.

Greetings,

Andres Freund

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-01-12 09:11:47
Message-ID:	CAFiTN-uY7kF0RC8MR07sbmUbZQ91bHLmjiUc64AOM4G=VJCeLg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:
>
> I think it's a worthwhile approach to pursue. But until it actually
> fixes the problem of leaving around uninitialized pages I don't think
> it's very meaningful to do performance comparisons.
>

Attached patch solves this issue, I am allocating the buffer for each page
and initializing the page, only after that adding to FSM.

> > a. Extend the relation page by page and add it to FSM without
> initializing
> > it. I think this is what the current patch of Dilip seems to be doing.
> If
> > we
> I think that's pretty much unacceptable, for the non-error path at
> least.
>

Performance results:
----------------------------
Test Case:
------------
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinary' WITH BINARY";

echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

Test Summary:
--------------------
1. I have measured the performance of base and patch.
2. With patch there are multiple results, that are with different values of
"extend_num_pages" (parameter which says how many extra block to extend)

Test with Data on magnetic Disk and WAL on SSD
--------------------------------------------------------------------
Shared Buffer : 48GB
max_wal_size : 10GB
Storage : Magnetic Disk
WAL : SSD

tps with different value of extend_num_page

------------------------------------------------------------

Client Base 10-Page 20-Page 50-Page

1 105 103 157 129
2 217 219 255 288
4 210 421 494 486
8 166 605 702 701
16 145 484 563 686
32 124 477 480 745

Test with Data and WAL on SSD
-----------------------------------------------

Shared Buffer : 48GB
Max Wal Size : 10GB
Storage : SSD

tps with different value of extend_num_page

------------------------------------------------------------

Client Base 10-Page 20-Page 50-Page 100-Page

1 152 153 155 147 157
2 281 281 292 275 287
4 236 505 502 508 514
8 171 662 687 767 764
16 145 527 639 826 907

Note: Test with both data and WAL on Magnetic Disk : No significant
improvement visible
-- I think wall write is becoming bottleneck in this case.

Currently i have kept extend_num_page as session level parameter but i
think later we can make this as table property.
Any suggestion on this ?

Apart from this approach, I also tried extending the file in multiple block
in one extend call, but this approach (extending one by one) is performing
better.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v2.patch	text/x-diff	3.8 KB

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-01-23 06:49:43
Message-ID:	CAA4eK1+RCbb9E8UWPjK6ge1tgL8P4jbLe2bg4D8pA-L3p7Cxsw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 12, 2016 at 2:41 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

> On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
>> On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:
>>
>> I think it's a worthwhile approach to pursue. But until it actually
>> fixes the problem of leaving around uninitialized pages I don't think
>> it's very meaningful to do performance comparisons.
>>
>
> Attached patch solves this issue, I am allocating the buffer for each page
> and initializing the page, only after that adding to FSM.
>

Few comments about patch:

1.
Patch is not getting compiled.

1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier
1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier
1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier

2.
! page = BufferGetPage(buffer);
! PageInit(page, BufferGetPageSize
(buf), 0);
!
! freespace = PageGetHeapFreeSpace(page);
!
MarkBufferDirty(buffer);
! UnlockReleaseBuffer(buffer);
!
RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);

What is the need to mark page dirty here, won't it automatically
be markerd dirty once the page is used? I think it is required
if you wish to WAL-log this action.

3. I think you don't need to multi-extend a relation if
HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to
get a new page by extending a relation.

4. Again why do you need this multi-extend optimization for local
relations (those only accessible to current backend)?

5. Do we need this for nbtree as well? One way to check that
is by Copying large data in table having index.

> Note: Test with both data and WAL on Magnetic Disk : No significant
> improvement visible
> -- I think wall write is becoming bottleneck in this case.
>
>
In that case, can you try the same test with un-logged tables?

Also, it is good to check the performance of patch with read-write work
load to ensure that extending relation in multiple-chunks should not
regress such cases.

> Currently i have kept extend_num_page as session level parameter but i
> think later we can make this as table property.
> Any suggestion on this ?
>
>
I think we should have a new storage_parameter at table level
extend_by_blocks or something like that instead of GUC. The
default value of this parameter should be 1 which means retain
current behaviour by default.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-01-23 10:58:32
Message-ID:	CAA4eK1+0iXR+jG_wQ_zB=5sLqP9JabocQykMUtsmuxRpx-_6Ng@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 23, 2016 at 12:19 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> On Tue, Jan 12, 2016 at 2:41 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
> wrote:
>
>> On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>
>>> On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:
>>>
>>> I think it's a worthwhile approach to pursue. But until it actually
>>> fixes the problem of leaving around uninitialized pages I don't think
>>> it's very meaningful to do performance comparisons.
>>>
>>
>> Attached patch solves this issue, I am allocating the buffer for each
>> page and initializing the page, only after that adding to FSM.
>>
>
> Few comments about patch:
>
>
I found one more problem with patch.

! UnlockReleaseBuffer(buffer);
! RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer),
freespace);

You can't call BufferGetBlockNumber(buffer) after releasing
the pin on buffer which will be released by
UnlockReleaseBuffer(). Get the block number before unlocking
the buffer.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-01-25 06:29:46
Message-ID:	CAFiTN-u2dZiXLP9jMEtA+OanwE8VPUPFmcJkk+y4c64Sht_n=A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 23, 2016 at 12:19 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote
>
>
> Few comments about patch:
>
Thanks for reviewing..

> 1.
> Patch is not getting compiled.
>
> 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
> identifier
> 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
> identifier
> 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
> identifier
>
Oh, My mistake, my preprocessor is ignoring this error and relacing it with
BLKSIZE

I will fix in next version of patch.

> 2.
> ! page = BufferGetPage(buffer);
> ! PageInit(page, BufferGetPageSize
> (buf), 0);
> !
> ! freespace = PageGetHeapFreeSpace(page);
> !
> MarkBufferDirty(buffer);
> ! UnlockReleaseBuffer(buffer);
> !
> RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);
>
> What is the need to mark page dirty here, won't it automatically
> be markerd dirty once the page is used? I think it is required
> if you wish to WAL-log this action.
>

These pages are not going to be used immediately and we have done PageInit
so i think it should be marked dirty before adding to FSM, so that if
buffer get replaced out then it flushes the init data.

> 3. I think you don't need to multi-extend a relation if
> HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to
> get a new page by extending a relation.
>

Yes, if HEAP_INSERT_SKIP_FSM is used and we use multi-extend atleast in
current transaction it will not take pages from FSM and everytime it will
do multi-extend, however pages will be used if there are parallel backend,
but still not a good idea to extend every time in multiple chunk in current
backend.

So i will change this..

4. Again why do you need this multi-extend optimization for local
> relations (those only accessible to current backend)?
>

I think we can change this while adding the table level "extend_by_blocks"
for local table we will not allow this property, so no need to change at
this place.

What do you think ?

5. Do we need this for nbtree as well? One way to check that
> is by Copying large data in table having index.
>
> Ok, i will try this test and update.

> Note: Test with both data and WAL on Magnetic Disk : No significant
>> improvement visible
>> -- I think wall write is becoming bottleneck in this case.
>>
>>
> In that case, can you try the same test with un-logged tables?
>

>
> Also, it is good to check the performance of patch with read-write work
> load to ensure that extending relation in multiple-chunks should not
> regress such cases.
>

>
> Currently i have kept extend_num_page as session level parameter but i
>> think later we can make this as table property.
>> Any suggestion on this ?
>>
>>
> I think we should have a new storage_parameter at table level
> extend_by_blocks or something like that instead of GUC. The
> default value of this parameter should be 1 which means retain
> current behaviour by default.
>

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-01-25 06:30:56
Message-ID:	CAFiTN-uUpJpvc=VkioX3ew3-OzCfhSdsB8e0Y94XpOYUy0PBpQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 23, 2016 at 4:28 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> I found one more problem with patch.
>
> ! UnlockReleaseBuffer(buffer);
> ! RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer),
> freespace);
>
> You can't call BufferGetBlockNumber(buffer) after releasing
> the pin on buffer which will be released by
> UnlockReleaseBuffer(). Get the block number before unlocking
> the buffer.
>

Good catch, will fix this also in next version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-01-28 11:23:08
Message-ID:	CAFiTN-sjzasvJZuZRSgpbLtdDZ95W2HD=htyA3us+5=BQMEvnw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 25, 2016 at 11:59 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

1.
>> Patch is not getting compiled.
>>
>> 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
>> identifier
>>
> Oh, My mistake, my preprocessor is ignoring this error and relacing it
> with BLKSIZE
>

FIXED

> 3. I think you don't need to multi-extend a relation if
>> HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to
>> get a new page by extending a relation.
>>
>
> So i will change this..
>

FIXED

>
> 4. Again why do you need this multi-extend optimization for local
>> relations (those only accessible to current backend)?
>>
>
> I think we can change this while adding the table level
> "extend_by_blocks" for local table we will not allow this property, so no
> need to change at this place.
> What do you think ?
>

Now I have added table level parameter for specifying the number of blocks,
So do you think that we still need to block it, as user can control it,
Moreover i think if user want to use for local table then no harm in it at
least by extending in one shot he avoid multiple call of Extension lock,
though there will be no contention.

What is your opinion ?

5. Do we need this for nbtree as well? One way to check that
>> is by Copying large data in table having index.
>>
>> Ok, i will try this test and update.
>

I tried to load data to table with Index and tried to analyze bottleneck
using perf, and found btcompare was taking maximum time, still i don't deny
that it can not get benefited by multi extend.

So i tried quick POC for this, but i realize that even though we extend
multiple page for index and add to FSM, it will be updated only in current
page, Information about new free page will be propagated to root page only
during vacuum, And unlike heap Btree always search FSM from root and it
will not find the extra added pages.

So i think we can analyze this topic separately for index multi extend and
find is there are cases where index multi extend can give benefit.

Note: Test with both data and WAL on Magnetic Disk : No significant
>>> improvement visible
>>> -- I think wall write is becoming bottleneck in this case.
>>>
>>>
>> In that case, can you try the same test with un-logged tables?
>>
> Date with un-logged table

Test Init:
------------
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script
./psql -d postgres -c "create unlogged table data (data text)" --> base
./psql -d postgres -c "create unlogged table data (data text)
with(extend_by_blocks=50)" --patch

test_script:
------------
./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

Shared Buffer 48GB
Table: Unlogged Table
ench -c$ -j$ -f -M Prepared postgres

Clients Base patch
1 178 180
2 337 338
4 265 601
8 167 805

Also, it is good to check the performance of patch with read-write work
>> load to ensure that extending relation in multiple-chunks should not
>> regress such cases.
>>
>
> Ok
>

I did not find in regression in normal case.
Note: I tested it with previous patch extend_num_pages=10 (guc parameter)
so that we can see any impact on overall system.

> Currently i have kept extend_num_page as session level parameter but i
>>> think later we can make this as table property.
>>> Any suggestion on this ?
>>>
>>>
>> I think we should have a new storage_parameter at table level
>> extend_by_blocks or something like that instead of GUC. The
>> default value of this parameter should be 1 which means retain
>> current behaviour by default.
>>
> +1
>

Changed it to table level storage parameter. I kept max_limit to 100 any
suggestion on this ? should we increase it ?

latest patch is attached..

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v3.patch	application/x-patch	4.3 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-01-28 11:27:07
Message-ID:	CAFiTN-vcRc5uGFvyrhKWxPFYf7j=Zncko98tZrAF2ZhiZeEnMg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 28, 2016 at 4:53 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

> I did not find in regression in normal case.
> Note: I tested it with previous patch extend_num_pages=10 (guc parameter)
> so that we can see any impact on overall system.
>

Just forgot to mentioned That i have run pgbench read-write case.

S.F: 300

./pgbench -j $ -c $ -T 1800 -M Prepared postgres

Tested with 1,8,16,32,64 Threads and did not see any regression with patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-02-02 15:12:38
Message-ID:	CA+TgmobTZSFBZr+RBSUqOSk5mk0JW=6rT8zbKvRyn8gwRzL7Rw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 28, 2016 at 6:23 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> [ new patch ]

This patch contains a useless hunk and also code not in PostgreSQL
style. Get pgindent set up and it will do it correctly for you, or
look at the style of the surrounding code.

What I'm a bit murky about is *why* this should be any better than the
status quo. I mean, the obvious explanation that pops to mind is that
extending the relation by two pages at a time relieves pressure on the
relation extension lock somehow. One other possible explanation is
that calling RecordPageWithFreeSpace() allows multiple backends to get
access to that page at the same time, while otherwise each backend
would try to conduct a separate extension. But in the first case,
what we ought to do is try to make the locking more efficient; and in
the second case, we might want to think about recording the first page
in the free space map too. I don't think the problem here is that w

Here's a sketch of another approach to this problem. Get rid of the
relation extension lock. Instead, have an array of, say, 256 lwlocks.
Each one protects the extension of relations where hash(relfilenode) %
256 maps to that lock. To extend a relation, grab the corresponding
lwlock, do the work, then release the lwlock. You might occasionally
have a situation where two relations are both being extended very
quickly and happen to map to the same bucket, but that shouldn't be
much of a problem in practice, and the way we're doing it presently is
worse, not better, since two relation extension locks may very easily
map to the same lock manager partition. The only problem with this is
that acquiring an LWLock holds off interrupts, and we don't want
interrupts to be locked out across a potentially lengthy I/O. We
could partially fix that if we call RESUME_INTERRUPTS() after
acquiring the lock and HOLD_INTERRUPTS() just before releasing it, but
there's still the problem that you might block non-interruptibly while
somebody else has the lock. I don't see an easy solution to that
problem right off-hand, but if something like this performs well we
can probably conjure up some solution to that problem.

I'm not saying that we need to do that exact thing - but I am saying
that I don't think we can proceed with an approach like this without
first understanding why it works and whether there's some other way
that might be better to address the underlying problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-02-02 15:49:13
Message-ID:	20160202154913.GV8743@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2016-01-28 16:53:08 +0530, Dilip Kumar wrote:
> test_script:
> ------------
> ./psql -d postgres -c "truncate table data"
> ./psql -d postgres -c "checkpoint"
> ./pgbench -f copy_script -T 120 -c$ -j$ postgres
>
> Shared Buffer 48GB
> Table: Unlogged Table
> ench -c$ -j$ -f -M Prepared postgres
>
> Clients Base patch
> 1 178 180
> 2 337 338
> 4 265 601
> 8 167 805

Could you also measure how this behaves for an INSERT instead of a COPY
workload? Both throughput and latency. It's quite possible that this
causes latency hikes, because suddenly backends will have to wait for
one other to extend by 50 pages. You'll probably have to use -P 1 or
full statement logging to judge that. I think just having a number of
connections inserting relatively wide rows into one table should do the
trick.

I'm doubtful that anything that does the victim buffer search while
holding the extension lock will actually scale in a wide range of
scenarios. The copy scenario here probably isn't too bad because the
copy ring buffes are in use, and because there's no reads increasing the
usagecount of recent buffers; thus a victim buffers are easily found.

Thanks,

Andres

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-02-02 16:00:11
Message-ID:	20160202160011.GW8743@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2016-02-02 10:12:38 -0500, Robert Haas wrote:
> Here's a sketch of another approach to this problem. Get rid of the
> relation extension lock. Instead, have an array of, say, 256 lwlocks.
> Each one protects the extension of relations where hash(relfilenode) %
> 256 maps to that lock. To extend a relation, grab the corresponding
> lwlock, do the work, then release the lwlock. You might occasionally
> have a situation where two relations are both being extended very
> quickly and happen to map to the same bucket, but that shouldn't be
> much of a problem in practice, and the way we're doing it presently is
> worse, not better, since two relation extension locks may very easily
> map to the same lock manager partition.

I guess you suspect that the performance problems come from the
heavyweight lock overhead? That's not what I *think* I've seen in
profiles, but it's hard to conclusively judge that.

I kinda doubt that really solves the problem, profiles aside,
though. The above wouldn't really get rid of the extension locks, it
just changes the implementation a bit. We'd still do victim buffer
search, and filesystem operations, while holding an exclusive
lock. Batching can solve some of that, but I think primarily we need
more granular locking, or get rid of locks entirely.

> The only problem with this is that acquiring an LWLock holds off
> interrupts, and we don't want interrupts to be locked out across a
> potentially lengthy I/O. We could partially fix that if we call
> RESUME_INTERRUPTS() after acquiring the lock and HOLD_INTERRUPTS()
> just before releasing it, but there's still the problem that you might
> block non-interruptibly while somebody else has the lock. I don't see
> an easy solution to that problem right off-hand, but if something like
> this performs well we can probably conjure up some solution to that
> problem.

Hm. I think to get rid of the HOLD_INTERRUPTS() we'd have to to record
what lock we were waiting on, and in which mode, before going into
PGSemaphoreLock(). Then LWLockReleaseAll() could "hand off" the wakeup
to the next waiter in the queue. Without that we'd sometimes end up with
absorbing a wakeup without then releasing the lock, causing everyone to
block on a released lock.

There's probably two major questions around that: Will it have a
performance impact, and will there be any impact on existing callers?

Regards,

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-02-02 16:00:33
Message-ID:	CA+TgmoaLMigywn=XdvbJCN16GcegvTgDJASgKTRWn9tDnxvXNQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 2, 2016 at 10:49 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I'm doubtful that anything that does the victim buffer search while
> holding the extension lock will actually scale in a wide range of
> scenarios. The copy scenario here probably isn't too bad because the
> copy ring buffes are in use, and because there's no reads increasing the
> usagecount of recent buffers; thus a victim buffers are easily found.

I agree that's an avenue we should try to explore. I haven't had any
time to think much about how it should be done, but it seems like it
ought to be possible somehow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-02-02 17:06:50
Message-ID:	20160202170650.GA135006@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund wrote:

> Could you also measure how this behaves for [...]

While we're proposing benchmark cases -- I remember this being an issue
with toast tables getting very large values of xml which causes multiple
toast pages to be extended for each new value inserted. If there are
multiple processes inserting these all the time, things get clogged.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-02-05 11:20:26
Message-ID:	CAA4eK1JE+GP_TjX=cr+oNJoEBrDQz98sgiudJspWNLNQs+qHag@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 2, 2016 at 9:19 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2016-01-28 16:53:08 +0530, Dilip Kumar wrote:
> > test_script:
> > ------------
> > ./psql -d postgres -c "truncate table data"
> > ./psql -d postgres -c "checkpoint"
> > ./pgbench -f copy_script -T 120 -c$ -j$ postgres
> >
> > Shared Buffer 48GB
> > Table: Unlogged Table
> > ench -c$ -j$ -f -M Prepared postgres
> >
> > Clients Base patch
> > 1 178 180
> > 2 337 338
> > 4 265 601
> > 8 167 805
>
> Could you also measure how this behaves for an INSERT instead of a COPY
> workload?
>

I think such a test will be useful.

>
> I'm doubtful that anything that does the victim buffer search while
> holding the extension lock will actually scale in a wide range of
> scenarios.
>

I think the problem for victim buffer could be visible if the blocks
are dirty and it needs to write the dirty buffer and especially as
the patch is written where after acquiring the extension lock, it again
tries to extend the relation without checking if it can get a page with
space from FSM. It seems to me that we should re-check the
availability of page because while one backend is waiting on extension
lock, other backend might have added pages. To re-check the
availability we might want to use something similar to
LWLockAcquireOrWait() semantics as used during WAL writing.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-02-10 05:02:44
Message-ID:	CAFiTN-uGtS0FbjL-URmRoYYMaMz2DWDkU-rcm3ibRRbHD5+yJQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 5, 2016 at 4:50 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> > Could you also measure how this behaves for an INSERT instead of a COPY
> > workload?
>
>
I think such a test will be useful.
>

I have measured the performance with insert to see the behavior when it
don't use strategy. I have tested multiple option, small tuple, big tuple,
data fits in shared buffer and doesn't fit in shared buffer.

Observation:
------------------
Apart from this test I have also used some tool which can take a many
stack traces with some delay.
1. I have observed with base code (when data don't fits in shared buffer)
almost all the stack traces are waiting on "LockRelationForExtension" and
many on "FlushBuffer" also (Flushing the dirty buffer).

Total Stack Captured: 204, FlushBuffer: 13, LockRelationForExtension: 187

(This test run with 8 thread (shared buf 512MB) and after every 5 second
stack is captured.)

2. If I change shared buf 48GB then Obviously FlushBuffer disappeared but
still LockRelationForExtension remains in very high number.

3.Performance of base code in both the cases when Data fits in shared
buffers or doesn't fits in shared buffer remain very low and non-scaling(we
can see that in below results).

Test--1 (big record insert and Data fits in shared Buffer)
------------------------------------------------------------
setup
--------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int)
with(extend_by_blocks=$variable)" {create table data (a int) for base code}
echo "insert into data select * from test_data;" >> copy_script

test:
-----
shared_buffers=48GB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres

client base extend_by_block=50 extend_by_block=1000
1 113 115 118
4 50 220 216
8 43 202 302

Test--2 (big record insert and Data doesn't fits in shared Buffer)
------------------------------------------------------------------
setup:
-------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int)
with(extend_by_blocks=1000)"
echo "insert into data select * from test_data;" >> copy_script

test:
------
shared_buffers=512MB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres

client base extend_by_block=1000
1 125 125
4 49 236
8 41 294
16 39 279

Test--3 (small record insert and Data fits in shared Buffer)
------------------------------------------------------------------
setup:
--------
./psql -d postgres -c "create table test_data(a int)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,10000))"
./psql -d postgres -c "create table data (a int) with(extend_by_blocks=20)"
echo "insert into data select * from test_data;" >> copy_script

test:
-----
shared_buffers=48GB -c max_wal_size=20GB -c checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres

client base Patch-extend_by_block=20
1 137 143
2 269 250
4 377 443
8 170 690
16 145 745

*All test done with Data on MD and Wal on SSD

Note: Last patch have max limit of extend_by_block=100 so for taking
performance with extend_by_block=1000 i localy changed it.
I will send the modified patch once we finalize on which approach to
proceed with.

> > I'm doubtful that anything that does the victim buffer search while
> > holding the extension lock will actually scale in a wide range of
> > scenarios.
> >
>
> I think the problem for victim buffer could be visible if the blocks
> are dirty and it needs to write the dirty buffer and especially as
> the patch is written where after acquiring the extension lock, it again
> tries to extend the relation without checking if it can get a page with
> space from FSM. It seems to me that we should re-check the
> availability of page because while one backend is waiting on extension
> lock, other backend might have added pages. To re-check the
> availability we might want to use something similar to
> LWLockAcquireOrWait() semantics as used during WAL writing.
>

I will work on this in next version...

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-02-10 09:54:39
Message-ID:	20160210095439.54gnh55sq5bx2n5n@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2016-02-10 10:32:44 +0530, Dilip Kumar wrote:
> On Fri, Feb 5, 2016 at 4:50 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> > > Could you also measure how this behaves for an INSERT instead of a COPY
> > > workload?
> >
> >
> I think such a test will be useful.
> >
>
> I have measured the performance with insert to see the behavior when it
> don't use strategy. I have tested multiple option, small tuple, big tuple,
> data fits in shared buffer and doesn't fit in shared buffer.

Could you please also have a look at the influence this has on latency?
I think you unfortunately have to look at the per-transaction logs, and
then see whether the worst case latencies get better or worse.

Greetings,

Andres Freund

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-02-10 13:36:50
Message-ID:	CAFiTN-twocgGtpHGJYz5ggVVEtjaVLutSpP-uUfFNm+-1DgW6g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 10, 2016 at 3:24 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Could you please also have a look at the influence this has on latency?
> I think you unfortunately have to look at the per-transaction logs, and
> then see whether the worst case latencies get better or worse.
>

I have quickly measured the per transaction latency of one case
(I selected below case to find the worst case latency because in this case
we are extending by 1000 blocks and data doesn't fits in shared buffer)

Test--2 (big record insert and Data doesn't fits in shared Buffer)
------------------------------------------------------------------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int)
with(extend_by_blocks=1000)"
echo "insert into data select * from test_data;" >> copy_script

shared_buffers=512B -c max_wal_size=20GB -c checkpoint_timeout=10min
./pgbench -c 8 -j 8 -f copy_script -T -l 120 postgres

base patch(extend 1000)
best 23245 3857
worst 236329 382859
Average 190303 35143

I have also attached the pgbench log files
patch_1000.tar --> log files with patch extend by 1000 blocks
base.tar --> log files with base code

From attached files we can see that very few transactions latency with
patch is high(> 300,000) which is expected and that too when we are
extensing 1000 blocks, And we base code almost every transaction latency is
hight (>200,000) that we can see that best case and average case latency is
1/5 with extend by 1000.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
patch_1000.tar	application/x-tar	880.0 KB
base.tar	application/x-tar	170.0 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-02-29 10:07:44
Message-ID:	CAFiTN-s-YdPoixtSEfKNpv0QMgJ_7fVyES143tyx5s58d=byPA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 10, 2016 at 7:06 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

>
I have tested Relation extension patch from various aspects and performance
results and other statistical data are explained in the mail.

Test 1: Identify the Heavy Weight lock is the Problem or the Actual Context
Switch
1. I converted the RelationExtensionLock to simple LWLock and tested with
single Relation. Results are as below

This is the simple script of copy 10000 record in one transaction of size 4
Bytes
client base lwlock multi_extend by 50 block
1 155 156 160
2 282 276 284
4 248 319 428
8 161 267 675
16 143 241 889

LWLock performance is better than base, obvious reason may be because we
have saved some instructions by converting to LWLock but it don't scales
any better compared to base code.

Test2: Identify that improvement in case of multiextend is becuase of
avoiding context switch or some other factor, like reusing blocks b/w
backend by putting in FSM..

1. Test by just extending multiple blocks and reuse in it's own backend
(Don't put in FSM)
Insert 1K record data don't fits in shared buffer 512MB Shared Buffer

Client Base Extend 800 block self use Extend 1000 Block
1 117 131 118
2 111 203 140
3 51 242 178
4 51 231 190
5 52 259 224
6 51 263 243
7 43 253 254
8 43 240 254
16 40 190 243

We can see the same improvement in case of self using the blocks also, It
shows that Sharing the blocks between the backend was not the WIN but
avoiding context switch was the measure win.

2. Tested the Number of ProcSleep during the Run.
This is the simple script of copy 10000 record in one transaction of size 4
Bytes

* BASE CODE*
*PATCH MULTI EXTEND*
Client Base_TPS ProcSleep Count Extend By 10 Block Proc
Sleep Count
2 280 457,506
311 62,641
3 235 1,098,701
358 141,624
4 216 1,155,735
368 188,173

What we can see in above test that, in Base code performance is degrading
after 2 threads, while Proc Sleep count in increasing with huge amount.

Compared to that in Patch, with extending 10 blocks at a time Proc Sleep
reduce to ~1/8 and we can see it is constantly scaling.

Proc Sleep test for Insert test when data don't fits in shared buffer and
inserting big record of 1024 bytes, is currently running
once I get the data will post the same.

Posting the re-based version and moving to next CF.

Open points:
1. After getting the Lock recheck the FSM if some other back end has
already added extra blocks and reuse them.
2. Is it good idea to have user level parameter for extend_by_block or we
can try some approach to internally identify how many blocks are needed and
as per the need only add the blocks, this will make it more flexible.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v4.patch	application/x-patch	4.0 KB

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-01 11:06:39
Message-ID:	CAA4eK1LOnxz4Qa_DquqbanSPXscTJXrKexJii8h3gnD9z8UY-A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 29, 2016 at 3:37 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

>
> On Wed, Feb 10, 2016 at 7:06 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
> wrote:
>
>>
>
> Test2: Identify that improvement in case of multiextend is becuase of
> avoiding context switch or some other factor, like reusing blocks b/w
> backend by putting in FSM..
>
> 1. Test by just extending multiple blocks and reuse in it's own backend
> (Don't put in FSM)
> Insert 1K record data don't fits in shared buffer 512MB Shared Buffer
>
>
>
Client Base Extend 800 block self use Extend 1000 Block
> 1 117 131 118
> 2 111 203 140
> 3 51 242 178
> 4 51 231 190
> 5 52 259 224
> 6 51 263 243
> 7 43 253 254
> 8 43 240 254
> 16 40 190 243
>
> We can see the same improvement in case of self using the blocks also, It
> shows that Sharing the blocks between the backend was not the WIN but
> avoiding context switch was the measure win.
>
>
One thing that is slightly unclear is that whether there is any overhead
due to buffer eviction especially when the buffer to be evicted is already
dirty and needs XLogFlush(). One reason why it might not hurt is that by
the time we tries to evict the buffer, corresponding WAL is already flushed
by WALWriter or other possibility could be that even if it is getting done
during buffer eviction, the impact for same is much lesser. Can we try to
measure the number of flush calls which happen during buffer eviction?

> 2. Tested the Number of ProcSleep during the Run.
> This is the simple script of copy 10000 record in one transaction of size
> 4 Bytes
>
> * BASE CODE*
> *PATCH MULTI EXTEND*
> Client Base_TPS ProcSleep Count Extend By 10 Block Proc
> Sleep Count
> 2 280 457,506
> 311 62,641
> 3 235 1,098,701
> 358 141,624
> 4 216 1,155,735
> 368 188,173
>
> What we can see in above test that, in Base code performance is degrading
> after 2 threads, while Proc Sleep count in increasing with huge amount.
>
> Compared to that in Patch, with extending 10 blocks at a time Proc Sleep
> reduce to ~1/8 and we can see it is constantly scaling.
>
> Proc Sleep test for Insert test when data don't fits in shared buffer and
> inserting big record of 1024 bytes, is currently running
> once I get the data will post the same.
>
>
Okay. However, I wonder if the performance data for the case when data
doesn't fit into shared buffer also shows similar trend, then it might be
worth to try by doing extend w.r.t load in system. I mean to say we can
batch the extension requests (as we have done in ProcArrayGroupClearXid)
and extend accordingly, if that works out then the benefit could be that we
don't need any configurable knob for the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-02 05:01:32
Message-ID:	CAFiTN-vwU9=1r=6Vc7+H8+2oZpfO-1bWMrYhsh9CDQd86jD_Dw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 1, 2016 at 4:36 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> One thing that is slightly unclear is that whether there is any overhead
> due to buffer eviction especially when the buffer to be evicted is already
> dirty and needs XLogFlush(). One reason why it might not hurt is that by
> the time we tries to evict the buffer, corresponding WAL is already flushed
> by WALWriter or other possibility could be that even if it is getting done
> during buffer eviction, the impact for same is much lesser. Can we try to
> measure the number of flush calls which happen during buffer eviction?
>
>
Good Idea, I will do this test and post the results..

>
>
>> Proc Sleep test for Insert test when data don't fits in shared buffer and
>> inserting big record of 1024 bytes, is currently running
>> once I get the data will post the same.
>>
>>
> Okay. However, I wonder if the performance data for the case when data
> doesn't fit into shared buffer also shows similar trend, then it might be
> worth to try by doing extend w.r.t load in system. I mean to say we can
> batch the extension requests (as we have done in ProcArrayGroupClearXid)
> and extend accordingly, if that works out then the benefit could be that we
> don't need any configurable knob for the same.
>

1. One option can be as you suggested like ProcArrayGroupClearXid, With
some modification, because when we wait for the request and extend w.r.t
that, may be again we face the Context Switch problem, So may be we can
extend in some multiple of the Request.
(But we need to take care whether to give block directly to requester or
add it into FSM or do both i.e. give at-least one block to requester and
put some multiple in FSM)

2. Other option can be that we analyse the current Load on the extend and
then speed up or slow down the extending speed.
Simple algorithm can look like this

If (GotBlockFromFSM())
Success++ // We got the block from FSM
Else
Failure++ // Did not get Block from FSM and need to
extend by my self

Now while extending
-------------------------
Speed up
-------------
If (failure - success > Threshold ) // Threshold can be one number assume
10.
ExtendByBlock += failure- success; --> If many failure means load
is high then Increase the ExtendByBlock
Failure = success= 0 --> reset after this
so that we can measure the latest trend.

Slow down..
--------------
//Now its possible that demand of block is reduced but ExtendByBlock is
still high.. So analyse statistic and slow down the extending pace..

If (success - failure > Threshold)
{
// Can not reduce it by big number because, may be more request are
satisfying because this is correct amount, so gradually decrease the pace
and re-analyse the statistics next time.
ExtendByBlock --;
Failure = success= 0
}

Any Suggestions are Welcome...

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-04 05:06:31
Message-ID:	CAFiTN-tyEu+Wf0-jBc3TGfCoHdEAjNTx=WVuxpoA1vDDyST6KQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 2, 2016 at 10:31 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

> 1. One option can be as you suggested like ProcArrayGroupClearXid, With
> some modification, because when we wait for the request and extend w.r.t
> that, may be again we face the Context Switch problem, So may be we can
> extend in some multiple of the Request.
> (But we need to take care whether to give block directly to requester or
> add it into FSM or do both i.e. give at-least one block to requester and
> put some multiple in FSM)
>
>
> 2. Other option can be that we analyse the current Load on the extend and
> then speed up or slow down the extending speed.
> Simple algorithm can look like this
>

I have tried the approach of group extend,

1. We convert the extension lock to TryLock and if we get the lock then
extend by one block.2.
2. If don't get the Lock then use the Group leader concep where only one
process will extend for all, Slight change from ProcArrayGroupClear is that
here other than satisfying the requested backend we Add some extra blocks
in FSM, say GroupSize*10.
3. So Actually we can not get exact load but still we have some factor like
group size tell us exactly the contention size and we extend in multiple of
that.

Performance Analysis:
---------------------
Performance is scaling with this approach, its slightly less compared to
previous patch where we directly give extend_by_block parameter and extend
in multiple blocks, and I think that's obvious because in group extend case
only after contention happen on lock we add extra blocks, but in former
case it was extending extra blocks optimistically.

Test1(COPY)
-----
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script

./pgbench -f copy_script -T 300 -c$ -j$ postgres

Shared Buffer:8GB max_wal_size:10GB Storage:Magnetic Disk WAL:SSD
-----------------------------------------------------------------------------------------------
Client Base multi_extend by 20 page group_extend_patch(groupsize*10)
1 105 157 149
2 217 255 282
4 210 494 452
8 166 702 629
16 145 563 561
32 124 480 566

Test2(INSERT)
--------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int, b text)
echo "insert into data select * from test_data;" >> insert_script

shared_buffers=512GB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f insert_script -T 300 postgres

Client Base Multi-extend by 1000 *group extend (group*10)
*group extend (group*100)
1 117 118
125 122
2 111 140
161 159
4 51 190
141 187
8 43 254
148 172
16 40 243
150 173

* (group*10)-> means inside the code, Group leader will see how many
members are in the group who want blocks, so we will satisfy request of all
member + will put extra blocks in FSM extra block to extend are =
(group*10) --> 10 is just some constant.

Summary:
------------
1. Here with group extend patch, there is no configuration to tell that how
many block to extend, so that should be decided by current load in the
system (contention on the extension lock).
2. With small multiplier i.e. 10 we can get fairly good improvement compare
to base code, but when load is high like record size is 1K, improving the
multiplier size giving better results.

* Note: Currently this is POC patch, It has only one group Extend List, so
currently can handle only one relation Group extend.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_group.patch	application/x-patch	9.7 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-04 16:29:27
Message-ID:	CA+TgmoYDJ1O5qMXBsA=rKdq1AkVTNBDciiB7B7Ayf4iS=NMrcA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 4, 2016 at 12:06 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I have tried the approach of group extend,
>
> 1. We convert the extension lock to TryLock and if we get the lock then
> extend by one block.2.
> 2. If don't get the Lock then use the Group leader concep where only one
> process will extend for all, Slight change from ProcArrayGroupClear is that
> here other than satisfying the requested backend we Add some extra blocks in
> FSM, say GroupSize*10.
> 3. So Actually we can not get exact load but still we have some factor like
> group size tell us exactly the contention size and we extend in multiple of
> that.

This approach seems good to me, and the performance results look very
positive. The nice thing about this is that there is not a
user-configurable knob; the system automatically determines when
larger extensions are needed, which will mean that real-world users
are much more likely to benefit from this. I don't think it matters
that this is a little faster or slower than an approach with a manual
knob; what matter is that it is a huge improvement over unpatched
master, and that it does not need a knob. The arbitrary constant of
10 is a little unsettling but I think we can live with it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-04 16:31:42
Message-ID:	8172.1457109102@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> This approach seems good to me, and the performance results look very
> positive. The nice thing about this is that there is not a
> user-configurable knob; the system automatically determines when
> larger extensions are needed, which will mean that real-world users
> are much more likely to benefit from this. I don't think it matters
> that this is a little faster or slower than an approach with a manual
> knob; what matter is that it is a huge improvement over unpatched
> master, and that it does not need a knob. The arbitrary constant of
> 10 is a little unsettling but I think we can live with it.

+1. "No knob" is a huge win.

regards, tom lane

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-05 04:49:37
Message-ID:	CAA4eK1K0=ToYcYpdDVNobKy2QaSd8f3B63hgvmW+m2mZVJJBYA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 4, 2016 at 9:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Mar 4, 2016 at 12:06 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
wrote:
> > I have tried the approach of group extend,
> >
> > 1. We convert the extension lock to TryLock and if we get the lock then
> > extend by one block.2.
> > 2. If don't get the Lock then use the Group leader concep where only one
> > process will extend for all, Slight change from ProcArrayGroupClear is
that
> > here other than satisfying the requested backend we Add some extra
blocks in
> > FSM, say GroupSize*10.
> > 3. So Actually we can not get exact load but still we have some factor
like
> > group size tell us exactly the contention size and we extend in
multiple of
> > that.
>
> This approach seems good to me, and the performance results look very
> positive. The nice thing about this is that there is not a
> user-configurable knob; the system automatically determines when
> larger extensions are needed, which will mean that real-world users
> are much more likely to benefit from this.
>

I think one thing which needs more thoughts about this approach is that we
need to maintain some number of slots so that group extend for different
relations can happen in parallel. Do we want to provide simultaneous
extension for 1, 2, 3, 4, 5 or more number of relations? I think providing
it for three or four relations should be okay as higher the number we want
to provide, bigger the size of PGPROC structure will be.

+GroupExtendRelation(PGPROC *proc, Relation relation, BulkInsertState
bistate)

+ volatile PROC_HDR *procglobal = ProcGlobal;

+ uint32 nextidx;

+ uint32 wakeidx;

+ int extraWaits = -1;

+ BlockNumber targetBlock;

+ int count = 0;

+ /* Add ourselves to the list of processes needing a group extend. */

+ proc->groupExtendMember = true;

+ /* We are the leader. Acquire the lock on behalf of everyone. */

+ LockRelationForExtension(relation, ExclusiveLock);

To provide it for multiple relations, I think you need to advocate the reloid
for relation in each proc and then get the relation descriptor for relation
extension lock.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-07 15:04:39
Message-ID:	CA+TgmoYYgyHBtn8efrd0A0aNH8ZV9icoh9b_RtmOHTdqKTThTg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 4, 2016 at 11:49 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I think one thing which needs more thoughts about this approach is that we
> need to maintain some number of slots so that group extend for different
> relations can happen in parallel. Do we want to provide simultaneous
> extension for 1, 2, 3, 4, 5 or more number of relations? I think providing
> it for three or four relations should be okay as higher the number we want
> to provide, bigger the size of PGPROC structure will be.

Hmm. Can we drive this off of the heavyweight lock manager's idea of
how big the relation extension lock wait queue is, instead of adding
more stuff to PGPROC?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-08 09:27:17
Message-ID:	CAA4eK1KF99HeU1+TU66pFkAxNPT9xpAFbOVF9u1CTEzkpmj_oQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 7, 2016 at 8:34 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Mar 4, 2016 at 11:49 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > I think one thing which needs more thoughts about this approach is that
we
> > need to maintain some number of slots so that group extend for different
> > relations can happen in parallel. Do we want to provide simultaneous
> > extension for 1, 2, 3, 4, 5 or more number of relations? I think
providing
> > it for three or four relations should be okay as higher the number we
want
> > to provide, bigger the size of PGPROC structure will be.
>
> Hmm. Can we drive this off of the heavyweight lock manager's idea of
> how big the relation extension lock wait queue is, instead of adding
> more stuff to PGPROC?
>

One idea to make it work without adding additional stuff in PGPROC is that
after acquiring relation extension lock, check if there is any available
block in fsm, if it founds any block, then release the lock and proceed,
else extend the relation by one block and then check lock's wait queue size
or number of lock requests (nRequested) and extend the relation further in
proportion to wait queue size and then release the lock and proceed. Here,
I think we can check for wait queue size even before extending the relation
by one block.

The benefit of doing it with PGPROC is that there will be relatively less
number LockAcquire calls as compare to heavyweight lock approach, which I
think should not matter much because we are planing to extend the relation
in proportion to wait queue size (probably wait queue size * 10).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-08 13:53:09
Message-ID:	CA+TgmoaOCAOePg+t05j0=3M7P5o5h62P41Tf6rRYxssf6o6Bxw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 8, 2016 at 4:27 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> Hmm. Can we drive this off of the heavyweight lock manager's idea of
>> how big the relation extension lock wait queue is, instead of adding
>> more stuff to PGPROC?
>
> One idea to make it work without adding additional stuff in PGPROC is that
> after acquiring relation extension lock, check if there is any available
> block in fsm, if it founds any block, then release the lock and proceed,
> else extend the relation by one block and then check lock's wait queue size
> or number of lock requests (nRequested) and extend the relation further in
> proportion to wait queue size and then release the lock and proceed. Here,
> I think we can check for wait queue size even before extending the relation
> by one block.
>
> The benefit of doing it with PGPROC is that there will be relatively less
> number LockAcquire calls as compare to heavyweight lock approach, which I
> think should not matter much because we are planing to extend the relation
> in proportion to wait queue size (probably wait queue size * 10).

I don't think switching relation extension from heavyweight locks to
lightweight locks is going to work. It would mean, for example, that
we lose the ability to service interrupts while extending a relation;
not to mention that we lose scalability if many relations are being
extended at once.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-08 15:04:08
Message-ID:	CAA4eK1+EF2rhW7uN7sK=Q34xu33VneDjosqoa8o9sD1v0i2HJA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 8, 2016 at 7:23 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Tue, Mar 8, 2016 at 4:27 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> >> Hmm. Can we drive this off of the heavyweight lock manager's idea of
> >> how big the relation extension lock wait queue is, instead of adding
> >> more stuff to PGPROC?
> >
> > One idea to make it work without adding additional stuff in PGPROC is
that
> > after acquiring relation extension lock, check if there is any available
> > block in fsm, if it founds any block, then release the lock and proceed,
> > else extend the relation by one block and then check lock's wait queue
size
> > or number of lock requests (nRequested) and extend the relation further
in
> > proportion to wait queue size and then release the lock and proceed.
Here,
> > I think we can check for wait queue size even before extending the
relation
> > by one block.
> >
> > The benefit of doing it with PGPROC is that there will be relatively
less
> > number LockAcquire calls as compare to heavyweight lock approach, which
I
> > think should not matter much because we are planing to extend the
relation
> > in proportion to wait queue size (probably wait queue size * 10).
>
> I don't think switching relation extension from heavyweight locks to
> lightweight locks is going to work.
>

Sorry, but I am not suggesting to change it to lightweight locks. I am
just suggesting how to make batching works with heavyweight locks as asked
by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-08 16:20:54
Message-ID:	CAFiTN-s051yTDaRbNXmE3SvtON0EQxciEHmn_hojqrJAQ3rdyg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 8, 2016 at 8:34 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> > >> Hmm. Can we drive this off of the heavyweight lock manager's idea of
> > >> how big the relation extension lock wait queue is, instead of adding
> > >> more stuff to PGPROC?
> > >
> > > One idea to make it work without adding additional stuff in PGPROC is
> that
> > > after acquiring relation extension lock, check if there is any
> available
> > > block in fsm, if it founds any block, then release the lock and
> proceed,
> > > else extend the relation by one block and then check lock's wait queue
> size
> > > or number of lock requests (nRequested) and extend the relation
> further in
> > > proportion to wait queue size and then release the lock and proceed.
> Here,
> > > I think we can check for wait queue size even before extending the
> relation
> > > by one block.
>

I have come up with this patch..

If this approach looks fine then I will prepare final patch (more comments,
indentation, and improve some code) and do some long run testing (current
results are 5 mins run).

Idea is same what Robert and Amit suggested up thread.

/* First we try the lock and if get just extend one block, this will give
two benefit ,
1. Single thread performance will not impact by checking lock waiters and
all
2. If we check the waiter in else part it will give time for more waiter to
get collected and will get better estimation of contention*/

TryRelExtLock ()
{
extend one block
}
else
{
RelextLock()
if (recheck the FSM if somebody have added blocks for me)
-- don't extend any block just reuse
else
--we have to extend blocks
-- get the waiter = lock->nRequested
--add extra block to FSM extraBlock = waiter*20;
}

Result looks like this
---------------------------

Test1(COPY)
-----./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinary' WITH BINARY";echo COPY data from
'/tmp/copybinary' WITH BINARY; > copy_script
./pgbench -f copy_script -T 300 -c$ -j$ postgres

Shared Buffer:8GB max_wal_size:10GB Storage:Magnetic Disk WAL:SSD
-----------------------------------------------------------------------------------------------
Client Base multi_extend by 20 page lock_waiter patch(waiter*20)
1 105 157 148
2 217 255 252
4 210 494 442
8 166 702 645
16 145 563 773
32 124 480 957

Note: @1 thread there should not be any improvement, so many be run to run
variance.

Test2(INSERT)
--------./psql -d postgres -c "create table test_data(a int, b text)"./psql
-d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"./psql -d postgres -c
"create table data (a int, b text)
echo "insert into data select * from test_data;" >> insert_script

shared_buffers=512GB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f insert_script -T 300 postgres

Client Base Multi-extend by 1000 lock_waiter patch(waiter*20)
1 117 118 117

2 111 140 132

4 51 190 134

8 43 254 148

16 40 243 206
32 - - 264

* (waiter*20)-> First process got the lock will find the lock waiters and
add waiter*20 extra blocks.

In next run I will run beyond 32 also, as we can see even at 32 client its
increasing.. so its clear when it see more contentions, adapting to
contention and adding more blocks..

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v5.patch	application/octet-stream	8.0 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-08 20:09:55
Message-ID:	CA+TgmoZyPuUnyLjt4dDpsEiz0=b+q05K8PFUOencqSSGhu=x+A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 8, 2016 at 11:20 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I have come up with this patch..
>
> If this approach looks fine then I will prepare final patch (more comments,
> indentation, and improve some code) and do some long run testing (current
> results are 5 mins run).
>
> Idea is same what Robert and Amit suggested up thread.

So this seems promising, but I think the code needs some work.

LockWaiterCount() bravely accesses a shared memory data structure that
is mutable with no locking at all. That might actually be safe for
our purposes, but I think it would be better to take the partition
lock in shared mode if it doesn't cost too much performance. If
that's too expensive, then it should at least have a comment
explaining (1) that it is doing this without the lock and (2) why
that's safe (sketch: the LOCK can't go away because we hold it, and
nRequested could change but we don't mind a stale value, and a 32-bit
read won't be torn).

A few of the other functions in here also lack comments, and perhaps
should have them.

The changes to RelationGetBufferForTuple need a visit from the
refactoring police. Look at the way you are calling
RelationAddOneBlock. The logic looks about like this:

if (needLock)
{
if (trylock relation for extension)
RelationAddOneBlock();
else
{
lock relation for extension;
if (use fsm)
{
complicated;
}
else
RelationAddOneBlock();
}
else
RelationAddOneBlock();

So basically you want to do the RelationAddOneBlock() thing if
!needLock || !use_fsm || can't trylock. See if you can rearrange the
code so that there's only one fallthrough call to
RelationAddOneBlock() instead of three separate ones.

Also, consider moving the code that adds multiple blocks at a time to
its own function instead of including it all in line.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-10 01:53:12
Message-ID:	CAFiTN-vHYrYZ4HC+pmniJ4T=gNkJcPprDJT1Fnjc3QyXBmEHww@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 9, 2016 at 1:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> LockWaiterCount() bravely accesses a shared memory data structure that
> is mutable with no locking at all. That might actually be safe for
> our purposes, but I think it would be better to take the partition
> lock in shared mode if it doesn't cost too much performance. If
> that's too expensive, then it should at least have a comment
> explaining (1) that it is doing this without the lock and (2) why
> that's safe (sketch: the LOCK can't go away because we hold it, and
> nRequested could change but we don't mind a stale value, and a 32-bit
> read won't be torn).
>

With LWLock also performance are equally good so added the lock.

>
> A few of the other functions in here also lack comments, and perhaps
> should have them.
>
> The changes to RelationGetBufferForTuple need a visit from the
> refactoring police. Look at the way you are calling
> RelationAddOneBlock. The logic looks about like this:
>
> if (needLock)
> {
> if (trylock relation for extension)
> RelationAddOneBlock();
> else
> {
> lock relation for extension;
> if (use fsm)
> {
> complicated;
> }
> else
> RelationAddOneBlock();
> }
> else
> RelationAddOneBlock();
>
> So basically you want to do the RelationAddOneBlock() thing if
> !needLock || !use_fsm || can't trylock. See if you can rearrange the
> code so that there's only one fallthrough call to
> RelationAddOneBlock() instead of three separate ones.
>

Actually in every case we need one blocks, So I have re factored it and
RelationAddOneBlock is now out of any condition.

>
> Also, consider moving the code that adds multiple blocks at a time to
> its own function instead of including it all in line.
>

Done

Attaching a latest patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v6.patch	text/x-patch	9.3 KB

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-10 02:25:01
Message-ID:	56E0DAFD.7020401@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/03/16 02:53, Dilip Kumar wrote:
>
> Attaching a latest patch.
>

Hmm, why did you remove the comment above the call to
UnlockRelationForExtension? It still seems relevant, maybe with some
minor modification?

Also there is a bit of whitespace mess inside the conditional lock block.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-10 08:57:43
Message-ID:	CAFiTN-vd-Yr-1TJei24qG2pa61TuRQj5MGaBUOjTnU=jMzKW_g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 10, 2016 at 7:55 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:

Thanks for the comments..

> Hmm, why did you remove the comment above the call to
> UnlockRelationForExtension?

While re factoring I lose this comment.. Fixed it

> It still seems relevant, maybe with some minor modification?
>
> Also there is a bit of whitespace mess inside the conditional lock block.

Fixed

I got the result of 10 mins run so posting it..
Note: Base code results are copied from up thread...

Results For 10 Mins run of COPY 10000 records of size 4 bytes script and
configuration are same as used in up thread
--------------------------------------------------------------------------------------------

Client Base Patch
1 105 111
2 217 246
4 210 428
8 166 653
16 145 808
32 124 988
64 --- 974

Results For 10 Mins run of INSERT 1000 records of size 1024 bytes(data
don't fits in shared buffer)
--------------------------------------------------------------------------------------------------

Client Base Patch
1 117 120
2 111 126
4 51 130
8 43 147
16 40 209
32 --- 254
64 --- 205

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v7.patch	text/x-patch	9.2 KB

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-10 18:34:35
Message-ID:	56E1BE3B.7010909@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/03/16 09:57, Dilip Kumar wrote:
>
> On Thu, Mar 10, 2016 at 7:55 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com
> <mailto:petr(at)2ndquadrant(dot)com>> wrote:
>
> Thanks for the comments..
>
> Hmm, why did you remove the comment above the call to
> UnlockRelationForExtension?
>
> While re factoring I lose this comment.. Fixed it
>
> It still seems relevant, maybe with some minor modification?
>
> Also there is a bit of whitespace mess inside the conditional lock
> block.
>
> Fixed
>
> I got the result of 10 mins run so posting it..
> Note: Base code results are copied from up thread...
>
> Results For 10 Mins run of COPY 10000 records of size 4 bytes script and
> configuration are same as used in up thread
> --------------------------------------------------------------------------------------------
>
> Client Base Patch
> 1 105 111
> 2 217 246
> 4 210 428
> 8 166 653
> 16 145 808
> 32 124 988
> 64 --- 974
>
>
> Results For 10 Mins run of INSERT 1000 records of size 1024 bytes(data
> don't fits in shared buffer)
> --------------------------------------------------------------------------------------------------
>
> Client Base Patch
> 1 117 120
> 2 111 126
> 4 51 130
> 8 43 147
> 16 40 209
> 32 --- 254
> 64 --- 205
>

Those look good. The patch looks good in general now. I am bit scared by
the lockWaiters * 20 as it can result in relatively big changes in rare
corner cases when for example a lot of backends were waiting for lock on
relation and suddenly all try to extend it. I wonder if we should clamp
it to something sane (although what's sane today might be small in
couple of years).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-11 01:44:21
Message-ID:	CAFiTN-tpsAAprJViK6y32GcnaJZcowciPQ4b9rxWbBQQcLriGQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 11, 2016 at 12:04 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:

Thanks for looking..

Those look good. The patch looks good in general now. I am bit scared by
> the lockWaiters * 20 as it can result in relatively big changes in rare
> corner cases when for example a lot of backends were waiting for lock on
> relation and suddenly all try to extend it. I wonder if we should clamp it
> to something sane (although what's sane today might be small in couple of
> years).

But in such condition when all are waiting on lock, then at a time only one
waiter will get the lock and that will easily count the lock waiter and
extend in multiple of that. And we also have the check that if any waiter
get the lock it will first check in FSM that anybody else have added one
block or not..

And other waiter will not get lock unless first waiter extend all blocks
and release the locks.

One possible case is as soon as we extend the blocks new requester directly
find in FSM and don't come for lock, and old waiter after getting lock
don't find in FSM, But IMHO in such cases, also its good that other waiter
also extend more blocks (because this can happen when request flow is very
high).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-11 01:54:22
Message-ID:	56E2254E.5090202@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/03/16 02:44, Dilip Kumar wrote:
>
> On Fri, Mar 11, 2016 at 12:04 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com
> <mailto:petr(at)2ndquadrant(dot)com>> wrote:
>
> Thanks for looking..
>
> Those look good. The patch looks good in general now. I am bit
> scared by the lockWaiters * 20 as it can result in relatively big
> changes in rare corner cases when for example a lot of backends were
> waiting for lock on relation and suddenly all try to extend it. I
> wonder if we should clamp it to something sane (although what's sane
> today might be small in couple of years).
>
>
> But in such condition when all are waiting on lock, then at a time only
> one waiter will get the lock and that will easily count the lock waiter
> and extend in multiple of that. And we also have the check that if any
> waiter get the lock it will first check in FSM that anybody else have
> added one block or not..
>

I am not talking about extension locks, the lock queue can be long
because there is concurrent DDL for example and then once DDL finishes
suddenly 100 connections that tried to insert into table will try to get
extension lock and this will add 2000 new pages when much fewer was
actually needed. I guess that's fine as it's corner case and it's only
16MB even in such extreme case.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-11 21:29:39
Message-ID:	CA+TgmoY0BqcYJLUuBkoFE9S_ztR=12FpXmPZ-p9WnFE8AQuT3w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 10, 2016 at 8:54 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
> I am not talking about extension locks, the lock queue can be long because
> there is concurrent DDL for example and then once DDL finishes suddenly 100
> connections that tried to insert into table will try to get extension lock
> and this will add 2000 new pages when much fewer was actually needed. I
> guess that's fine as it's corner case and it's only 16MB even in such
> extreme case.

I don't really understand this part about concurrent DDL. If there
were concurrent DDL going on, presumably other backends would be
blocked on the relation lock, not the relation extension lock - and it
doesn't seem likely that you'd often have a huge pile-up of inserters
waiting on concurrent DDL. But I guess it could happen.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-11 23:14:30
Message-ID:	56E35156.70603@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/03/16 22:29, Robert Haas wrote:
> On Thu, Mar 10, 2016 at 8:54 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
>> I am not talking about extension locks, the lock queue can be long because
>> there is concurrent DDL for example and then once DDL finishes suddenly 100
>> connections that tried to insert into table will try to get extension lock
>> and this will add 2000 new pages when much fewer was actually needed. I
>> guess that's fine as it's corner case and it's only 16MB even in such
>> extreme case.
>
> I don't really understand this part about concurrent DDL. If there
> were concurrent DDL going on, presumably other backends would be
> blocked on the relation lock, not the relation extension lock - and it
> doesn't seem likely that you'd often have a huge pile-up of inserters
> waiting on concurrent DDL. But I guess it could happen.
>

Yeah I was thinking about the latter part and as I said it's very rare
case, but I did see something similar couple of times in the wild. It's
not objection against committing this patch though, in fact I think it
can be committed as is.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-12 00:01:07
Message-ID:	56E35C43.5030604@BlueTreble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/11/16 5:14 PM, Petr Jelinek wrote:
>> I don't really understand this part about concurrent DDL. If there
>> were concurrent DDL going on, presumably other backends would be
>> blocked on the relation lock, not the relation extension lock - and it
>> doesn't seem likely that you'd often have a huge pile-up of inserters
>> waiting on concurrent DDL. But I guess it could happen.
>>
>
> Yeah I was thinking about the latter part and as I said it's very rare
> case, but I did see something similar couple of times in the wild. It's
> not objection against committing this patch though, in fact I think it
> can be committed as is.

FWIW, this is definitely a real possibility in any shop that has very
high downtime costs and high transaction rates.

I also think some kind of clamp is a good idea. It's not that uncommon
to run max_connections significantly higher than 100, so the extension
could be way larger than 16MB. In those cases this patch could actually
make things far worse as everyone backs up waiting on the OS to extend
many MB when all you actually needed were a couple dozen more pages.

BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent
demand for extension that means you're running lots of small DML
operations, not really big ones. I'd think that would make *1 more
appropriate.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-12 02:46:53
Message-ID:	CAFiTN-twn5YLRxWBJ2ZTki1vn4OP8_oc8DUJUUaib_cnaDyfVA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:

> FWIW, this is definitely a real possibility in any shop that has very high
> downtime costs and high transaction rates.
>
> I also think some kind of clamp is a good idea. It's not that uncommon to
> run max_connections significantly higher than 100, so the extension could
> be way larger than 16MB. In those cases this patch could actually make
> things far worse as everyone backs up waiting on the OS to extend many MB
> when all you actually needed were a couple dozen more pages.
>

I agree, We can have some max limit on number of extra pages, What other
thinks ?

> BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent
> demand for extension that means you're running lots of small DML
> operations, not really big ones. I'd think that would make *1 more
> appropriate.
>

*1 will not solve this problem, Here the main problem was many people are
sleep/wakeup on the extension lock and that was causing the bottleneck. So
if we do *1 this will satisfy only current requesters which has already
waited on the lock. But our goal is to avoid backends from requesting this
lock.

Idea of Finding the requester to get the statistics on this locks (load on
the lock) and extend in multiple of load so that in future this situation
will be avoided for long time and again when happen next time extend in
multiple of load.

How 20 comes ?
I tested with Multiple clients loads 1..64, with multiple load size 4
byte records to 1KB Records, COPY/ INSERT and found 20 works best.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-12 03:07:59
Message-ID:	CAA4eK1JURc596BxMM6p5Tx_7DRv-6t42CAiEX1zPYXVBW7i_Rw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 12, 2016 at 8:16 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
>
> On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
wrote:
>>
>> FWIW, this is definitely a real possibility in any shop that has very
high downtime costs and high transaction rates.
>>
>> I also think some kind of clamp is a good idea. It's not that uncommon
to run max_connections significantly higher than 100, so the extension
could be way larger than 16MB. In those cases this patch could actually
make things far worse as everyone backs up waiting on the OS to extend many
MB when all you actually needed were a couple dozen more pages.
>
>
> I agree, We can have some max limit on number of extra pages, What other
thinks ?
>
>
>>
>> BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent
demand for extension that means you're running lots of small DML
operations, not really big ones. I'd think that would make *1 more
appropriate.
>
>
> *1 will not solve this problem, Here the main problem was many people are
sleep/wakeup on the extension lock and that was causing the bottleneck. So
if we do *1 this will satisfy only current requesters which has already
waited on the lock. But our goal is to avoid backends from requesting this
lock.
>
> Idea of Finding the requester to get the statistics on this locks (load
on the lock) and extend in multiple of load so that in future this
situation will be avoided for long time and again when happen next time
extend in multiple of load.
>
> How 20 comes ?
> I tested with Multiple clients loads 1..64, with multiple load size 4
byte records to 1KB Records, COPY/ INSERT and found 20 works best.
>

Can you post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier
you have tried, so that it is clear that 20 is best?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-12 03:56:58
Message-ID:	56E3938A.3040306@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/03/16 01:01, Jim Nasby wrote:
> On 3/11/16 5:14 PM, Petr Jelinek wrote:
>>> I don't really understand this part about concurrent DDL. If there
>>> were concurrent DDL going on, presumably other backends would be
>>> blocked on the relation lock, not the relation extension lock - and it
>>> doesn't seem likely that you'd often have a huge pile-up of inserters
>>> waiting on concurrent DDL. But I guess it could happen.
>>>
>>
>> Yeah I was thinking about the latter part and as I said it's very rare
>> case, but I did see something similar couple of times in the wild. It's
>> not objection against committing this patch though, in fact I think it
>> can be committed as is.
>
> FWIW, this is definitely a real possibility in any shop that has very
> high downtime costs and high transaction rates.
>
> I also think some kind of clamp is a good idea. It's not that uncommon
> to run max_connections significantly higher than 100, so the extension
> could be way larger than 16MB. In those cases this patch could actually
> make things far worse as everyone backs up waiting on the OS to extend
> many MB when all you actually needed were a couple dozen more pages.
>

Well yeah I've seen 10k, but not everything will write to same table,
wanted to stay in realms of something that has realistic chance of
happening.

The benchmarks I've seen showed you want at least *10 and *20 was better.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-12 03:57:45
Message-ID:	56E393B9.1070906@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/03/16 03:46, Dilip Kumar wrote:
>
> On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com
> <mailto:Jim(dot)Nasby(at)bluetreble(dot)com>> wrote:
>
> FWIW, this is definitely a real possibility in any shop that has
> very high downtime costs and high transaction rates.
>
> I also think some kind of clamp is a good idea. It's not that
> uncommon to run max_connections significantly higher than 100, so
> the extension could be way larger than 16MB. In those cases this
> patch could actually make things far worse as everyone backs up
> waiting on the OS to extend many MB when all you actually needed
> were a couple dozen more pages.
>
>
> I agree, We can have some max limit on number of extra pages, What other
> thinks ?
>

Well, that's what I meant with clamping originally. I don't know what is
a good value though.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-12 06:24:10
Message-ID:	CAFiTN-vFRHTe_1ayhK01QJTX_oKyuj2xhbw__7NYnnhDT2815Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 12, 2016 at 8:37 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> Can you post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier
> you have tried, so that it is clear that 20 is best?

I had Tried with 1, 10, 20 and 50.

1. With base code it was almost the same as base code.

2. With 10 thread data it matching with my previous group extend patch data
posted upthread
http://www.postgresql.org/message-id/CAFiTN-tyEu+Wf0-jBc3TGfCoHdEAjNTx=WVuxpoA1vDDyST6KQ@mail.gmail.com

3. Beyond 20 with 50 I did not see any extra benefit in performance number
(compared to 20 when tested 50 with 4 byte COPY I did not tested other data
size with 50.).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-13 23:32:38
Message-ID:	56E5F896.6040808@BlueTreble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/11/16 9:57 PM, Petr Jelinek wrote:
>> I also think some kind of clamp is a good idea. It's not that
>> uncommon to run max_connections significantly higher than 100, so
>> the extension could be way larger than 16MB. In those cases this
>> patch could actually make things far worse as everyone backs up
>> waiting on the OS to extend many MB when all you actually needed
>> were a couple dozen more pages.
>>
>>
>> I agree, We can have some max limit on number of extra pages, What other
>> thinks ?
>>
>
> Well, that's what I meant with clamping originally. I don't know what is
> a good value though.

Well, 16MB is 2K pages, which is what you'd get if 100 connections were
all blocked and we're doing 20 pages per waiter. That seems like a
really extreme scenario, so maybe 4MB is a good compromise. That's
unlikely to be hit in most cases, unlikely to put a ton of stress on IO,
even with magnetic media (assuming the whole 4MB is queued to write in
one shot...). 4MB would still reduce the number of locks by 500x.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-14 02:29:51
Message-ID:	CAFiTN-sZ=XP1=+OSNwVVMDQxw=ue6=51c-0XF8h1Ftr9HdzEcw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 14, 2016 at 5:02 AM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:

> Well, 16MB is 2K pages, which is what you'd get if 100 connections were
> all blocked and we're doing 20 pages per waiter. That seems like a really
> extreme scenario, so maybe 4MB is a good compromise. That's unlikely to be
> hit in most cases, unlikely to put a ton of stress on IO, even with
> magnetic media (assuming the whole 4MB is queued to write in one shot...).
> 4MB would still reduce the number of locks by 500x.

In my performance results given up thread, we are getting max performance
at 32 clients, means at a time we are extending 32*20 ~= max (600) pages at
a time. So now with 4MB limit (max 512 pages) Results will looks similar.
So we need to take a decision whether 4MB is good limit, should I change it
?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-14 02:56:38
Message-ID:	56E62866.8040306@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 14/03/16 03:29, Dilip Kumar wrote:
>
> On Mon, Mar 14, 2016 at 5:02 AM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com
> <mailto:Jim(dot)Nasby(at)bluetreble(dot)com>> wrote:
>
> Well, 16MB is 2K pages, which is what you'd get if 100 connections
> were all blocked and we're doing 20 pages per waiter. That seems
> like a really extreme scenario, so maybe 4MB is a good compromise.
> That's unlikely to be hit in most cases, unlikely to put a ton of
> stress on IO, even with magnetic media (assuming the whole 4MB is
> queued to write in one shot...). 4MB would still reduce the number
> of locks by 500x.
>
>
> In my performance results given up thread, we are getting max
> performance at 32 clients, means at a time we are extending 32*20 ~= max
> (600) pages at a time. So now with 4MB limit (max 512 pages) Results
> will looks similar. So we need to take a decision whether 4MB is good
> limit, should I change it ?
>
>

Well any value we choose will be very arbitrary. If we look at it from
the point of maximum absolute disk space we allocate for relation at
once, the 4MB limit would represent 2.5 orders of magnitude change. That
sounds like enough for one release cycle, I think we can further tune it
if the need arises in next one. (with my love for round numbers I would
have suggested 8MB as that's 3 orders of magnitude, but I am fine with
4MB as well)

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-17 03:42:28
Message-ID:	CAFiTN-v8NQ10RucWPg0cwQ2qhLptLsL7PUTwxk0CXVwLX8Ht9Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 14, 2016 at 8:26 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:

> Well any value we choose will be very arbitrary. If we look at it from the
> point of maximum absolute disk space we allocate for relation at once,
> the 4MB limit would represent 2.5 orders of magnitude change. That sounds
> like enough for one release cycle, I think we can further tune it if the
> need arises in next one. (with my love for round numbers I would have
> suggested 8MB as that's 3 orders of magnitude, but I am fine with 4MB as
> well)
>

I have modified the patch, this contains the max limit on extra pages,
512(4MB) pages is the max limit.

I have measured the performance also and that looks equally good.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v8.patch	text/x-patch	9.4 KB

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-17 08:01:26
Message-ID:	56EA6456.1030206@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 17/03/16 04:42, Dilip Kumar wrote:
>
> On Mon, Mar 14, 2016 at 8:26 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com
> <mailto:petr(at)2ndquadrant(dot)com>> wrote:
>
> Well any value we choose will be very arbitrary. If we look at it
> from the point of maximum absolute disk space we allocate for
> relation at once, the 4MB limit would represent 2.5 orders of
> magnitude change. That sounds like enough for one release cycle, I
> think we can further tune it if the need arises in next one. (with
> my love for round numbers I would have suggested 8MB as that's 3
> orders of magnitude, but I am fine with 4MB as well)
>
>
> I have modified the patch, this contains the max limit on extra pages,
> 512(4MB) pages is the max limit.
>
> I have measured the performance also and that looks equally good.
>

Great.

Just small notational thing, maybe this would be simpler?:
extraBlocks = Min(512, lockWaiters * 20);

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-18 09:08:18
Message-ID:	CAFiTN-sUs8mQ=zzdReDaUpaxzGS7MabWf3YaN=np6XJtusWx+w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 17, 2016 at 1:31 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:

> Great.
>
> Just small notational thing, maybe this would be simpler?:
> extraBlocks = Min(512, lockWaiters * 20);
>

Done, new patch attached.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v9.patch	text/x-patch	9.4 KB

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-21 12:10:07
Message-ID:	CAA4eK1+hb3xfk+_JKnz-d2Wyt-j3=gka=Jz6kptqx2re8CZYDw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 18, 2016 at 2:38 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
>
> On Thu, Mar 17, 2016 at 1:31 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com>
wrote:
>>
>> Great.
>>
>> Just small notational thing, maybe this would be simpler?:
>> extraBlocks = Min(512, lockWaiters * 20);
>
>
> Done, new patch attached.
>

Review comments:

1.
/*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState
bistate)

Shall we mention in comments that this API returns locked buffer and it's
the responsibility of caller to unlock it.

2.
+ /* To avoid the cases when there are huge number of lock waiters, and
+ * extend file size by big amount at a
time, put some limit on the

first line in multi-line comments should not contain anything.

3.
+ extraBlocks = Min(512, lockWaiters * 20);

I think we should explain in comments about the reason of choosing 20 in
above calculation.

4. Sometime back [1], you agreed on doing some analysis for the overhead
that XLogFlush can cause during buffer eviction, but I don't see the
results of same, are you planing to complete the same?

5.
+ if (LOCKACQUIRE_OK
+ != RelationExtensionLockConditional(relation,
ExclusiveLock))

I think the coding style is to keep constant on right side of condition,
did you see any other place in code which uses the check in a similar way?

6.
- /*
- * Now acquire lock on the new page.
+ /* For all case we need to add at least one block to satisfy
our
+ * own request.
*/

Same problem as in point 2.

7.
@@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
if (needLock)

UnlockRelationForExtension(relation, ExclusiveLock);

+
+

Spurious newline addition.

8.
+int
+RelationExtensionLockWaiter(Relation relation)

How about naming this function as RelationExtensionLockWaiterCount?

9.
+ /* Other waiter has extended the block for us*/

Provide an extra space before ending the comment.

10.
+ if (use_fsm)
+ {
+ if (lastValidBlock !=
InvalidBlockNumber)
+ {
+ targetBlock =
RecordAndGetPageWithFreeSpace(relation,
+
lastValidBlock,
+
pageFreeSpace,
+
len + saveFreeSpace);
+ }

Are you using RecordAndGetPageWithFreeSpace() instead of
GetPageWithFreeSpace() to get the page close to the previous target page?
If yes, then do you see enough benefit of the same that it can compensate
the additional write operation which Record* API might cause due to
additional dirtying of buffer?

11.
+ {
+ /*
+ * First try to get the lock in no-wait mode, if succeed extend one
+
* block, else get the lock in normal mode and after we get the lock
+ * extend some extra blocks, extra
blocks will be added to satisfy
+ * request of other waiters and avoid future contention. Here instead
+
* of directly taking lock we try no-wait mode, this is to handle the
+ * case, when there is no
contention - it should not find the lock
+ * waiter and execute extra instructions.
+ */
+
if (LOCKACQUIRE_OK
+ != RelationExtensionLockConditional(relation, ExclusiveLock))
+
{
+ LockRelationForExtension(relation, ExclusiveLock);
+
+ if (use_fsm)
+
{
+ if (lastValidBlock != InvalidBlockNumber)
+
{
+ targetBlock = RecordAndGetPageWithFreeSpace(relation,
+
lastValidBlock,
+
pageFreeSpace,
+
len + saveFreeSpace);
+
}
+
+ /* Other waiter has extended the block for us*/
+
if (targetBlock != InvalidBlockNumber)
+ {
+
UnlockRelationForExtension(relation, ExclusiveLock);
+ goto loop;
+
}
+
+ RelationAddExtraBlocks(relation, bistate);
+ }
+
}
+ }

- /*
- * Now acquire lock on the new page.
+ /* For all case we need to add at least one block to
satisfy our
+ * own request.
*/
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ buffer =
RelationAddOneBlock(relation, otherBuffer, bistate);

Won't this cause one extra block addition after backend extends the
relation for multiple blocks, what is the need of same?

12. I think it is good to once test pgbench read-write tests to ensure that
this doesn't introduce any new regression.

[1] -
http://www.postgresql.org/message-id/CAA4eK1LOnxz4Qa_DquqbanSPXscTJXrKexJii8h3gnD9z8UY-A@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-22 09:15:12
Message-ID:	CAFiTN-s406zUjaCK_btZoQKsEPSgE8K8OPzzCZEC2jG8xBUsxQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 21, 2016 at 8:10 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> Review comments:
>
>
Thanks for the review, Please find my response inline..

> 1.
> /*
> + * RelationAddOneBlock
> + *
> + * Extend relation by one block and lock the buffer
> + */
> +static Buffer
> +RelationAddOneBlock(Relation relation, Buffer otherBuffer,
> BulkInsertState bistate)
>
> Shall we mention in comments that this API returns locked buffer and it's
> the responsibility of caller to unlock it.
>

Fixed

> 2.
> + /* To avoid the cases when there are huge number of lock waiters, and
> + * extend file size by big amount at a
> time, put some limit on the
>
> first line in multi-line comments should not contain anything.
>

Fixed

>
> 3.
> + extraBlocks = Min(512, lockWaiters * 20);
>
> I think we should explain in comments about the reason of choosing 20 in
> above calculation.
>

Fixed, Just check explanation is enough or we need to add something more ?

>
> 4. Sometime back [1], you agreed on doing some analysis for the overhead
> that XLogFlush can cause during buffer eviction, but I don't see the
> results of same, are you planing to complete the same?
>

Ok, I will test this..

>
> 5.
> + if (LOCKACQUIRE_OK
> + != RelationExtensionLockConditional(relation,
> ExclusiveLock))
>
> I think the coding style is to keep constant on right side of condition,
> did you see any other place in code which uses the check in a similar way?
>

Fixed,

Not sure about any other place. (This is what I used to follow to keep
constant on left side to avoid the cases, where instead == by mistake if we
have given =, then it will do assignment instead throwing error).

But In PG style constant should be on right, so I will take care.

>
> 6.
> - /*
> - * Now acquire lock on the new page.
> + /* For all case we need to add at least one block to satisfy
> our
> + * own request.
> */
>
> Same problem as in point 2.
>

Fixed.

>
> 7.
> @@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
> if (needLock)
>
> UnlockRelationForExtension(relation, ExclusiveLock);
>
> +
> +
>
> Spurious newline addition.
>

Fixed

>
> 8.
> +int
> +RelationExtensionLockWaiter(Relation relation)
>
> How about naming this function as RelationExtensionLockWaiterCount?
>
>
Done

> 9.
> + /* Other waiter has extended the block for us*/
>
> Provide an extra space before ending the comment.
>

Fixed

>
> 10.
> + if (use_fsm)
> + {
> + if (lastValidBlock !=
> InvalidBlockNumber)
> + {
> + targetBlock =
> RecordAndGetPageWithFreeSpace(relation,
> +
> lastValidBlock,
> +
> pageFreeSpace,
> +
> len + saveFreeSpace);
> + }
>
> Are you using RecordAndGetPageWithFreeSpace() instead of
> GetPageWithFreeSpace() to get the page close to the previous target page?
> If yes, then do you see enough benefit of the same that it can compensate
> the additional write operation which Record* API might cause due to
> additional dirtying of buffer?
>

Here we are calling RecordAndGetPageWithFreeSpace instead of
GetPageWithFreeSpace, because other backend who have got the lock might
have added extra block in the FSM and its possible that FSM tree might not
have been updated so far and we will not get the page by searching from top
using GetPageWithFreeSpace, so we need to search the leaf page directly
using our last valid target block.

Explained same in the comments...

> 11.
> + {
> + /*
> + * First try to get the lock in no-wait mode, if succeed extend one
> +
> * block, else get the lock in normal mode and after we get the lock
> - LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> + buffer =
> RelationAddOneBlock(relation, otherBuffer, bistate);
>
> Won't this cause one extra block addition after backend extends the
> relation for multiple blocks, what is the need of same?
>

This is the block for this backend, Extra extend for future request and
already added to FSM. I could have added this count along with extra block
in RelationAddExtraBlocks, But In that case I need to put some extra If for
saving one buffer for this bakend and then returning that the specific
buffer to caller, and In caller also need to distinguish between who wants
to add one block or who have got one block added in along with extra block.

I think this way code is simple.. That everybody comes down will add one
block for self use. and all other functionality and logic is above, i.e.
wether to take lock or not, whether to add extra blocks or not..

>
> 12. I think it is good to once test pgbench read-write tests to ensure
> that this doesn't introduce any new regression.
>

I will test this and post the results..

Latest patch attached..

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v10.patch	application/octet-stream	10.3 KB

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-22 17:12:53
Message-ID:	56F17D15.1050704@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22/03/16 10:15, Dilip Kumar wrote:
>
> On Mon, Mar 21, 2016 at 8:10 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com
> <mailto:amit(dot)kapila16(at)gmail(dot)com>> wrote:
> 11.
> +{
> +/*
> +* First try to get the lock in no-wait mode, if succeed extend one
> +
> * block, else get the lock in normal mode and after we get the lock
> -LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> +buffer =
> RelationAddOneBlock(relation, otherBuffer, bistate);
>
> Won't this cause one extra block addition after backend extends the
> relation for multiple blocks, what is the need of same?
>
>
> This is the block for this backend, Extra extend for future request and
> already added to FSM. I could have added this count along with extra
> block in RelationAddExtraBlocks, But In that case I need to put some
> extra If for saving one buffer for this bakend and then returning that
> the specific buffer to caller, and In caller also need to distinguish
> between who wants to add one block or who have got one block added in
> along with extra block.
>
> I think this way code is simple.. That everybody comes down will add one
> block for self use. and all other functionality and logic is above, i.e.
> wether to take lock or not, whether to add extra blocks or not..
>
>

I also think the code simplicity makes this worth it.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-23 18:39:59
Message-ID:	CA+TgmoYos5vSppA0ym5B+b_parkDBC=5+-Dwcit9WrgBGvTRww@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
> I also think the code simplicity makes this worth it.

Agreed. I went over this patch and did a cleanup pass today. I
discovered that the LockWaiterCount() function was broken if you try
to tried to use it on a lock that you didn't hold or a lock that you
held in any mode other than exclusive, so I tried to fix that. I
rewrote a lot of the comments and tightened some other things up. The
result is attached.

I'm baffled by the same code Amit asked about upthread, even though
there's now a comment:

+ /*
+ * Here we are calling
RecordAndGetPageWithFreeSpace
+ * instead of GetPageWithFreeSpace,
because other backend
+ * who have got the lock might have
added extra blocks in
+ * the FSM and its possible that FSM
tree might not have
+ * been updated so far and we will not
get the page by
+ * searching from top using
GetPageWithFreeSpace, so we
+ * need to search the leaf page
directly using our last
+ * valid target block.
+ *
+ * XXX. I don't understand what is
happening here. -RMH
+ */

I've read this over several times and looked at
RecordAndGetPageWithFreeSpace() and I'm still confused. First of all,
if the lock was acquired by some other backend which did
RelationAddExtraBlocks(), it *will* have updated the FSM - that's the
whole point. Second, if the other backend extended the relation in
some other manner and did not extend the FSM, how does calling
RecordAndGetPageWithFreeSpace help? As far as I can see,
GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
searching the FSM, so if one is stymied the other will be too. What
am I missing?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
multi_extend_v11.patch	text/x-diff	8.5 KB

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-23 18:52:24
Message-ID:	56F2E5E8.6050406@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23/03/16 19:39, Robert Haas wrote:
> On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
>> I also think the code simplicity makes this worth it.
>
> Agreed. I went over this patch and did a cleanup pass today. I
> discovered that the LockWaiterCount() function was broken if you try
> to tried to use it on a lock that you didn't hold or a lock that you
> held in any mode other than exclusive, so I tried to fix that. I
> rewrote a lot of the comments and tightened some other things up. The
> result is attached.
>
> I'm baffled by the same code Amit asked about upthread, even though
> there's now a comment:
>
> + /*
> + * Here we are calling
> RecordAndGetPageWithFreeSpace
> + * instead of GetPageWithFreeSpace,
> because other backend
> + * who have got the lock might have
> added extra blocks in
> + * the FSM and its possible that FSM
> tree might not have
> + * been updated so far and we will not
> get the page by
> + * searching from top using
> GetPageWithFreeSpace, so we
> + * need to search the leaf page
> directly using our last
> + * valid target block.
> + *
> + * XXX. I don't understand what is
> happening here. -RMH
> + */
>
> I've read this over several times and looked at
> RecordAndGetPageWithFreeSpace() and I'm still confused. First of all,
> if the lock was acquired by some other backend which did
> RelationAddExtraBlocks(), it *will* have updated the FSM - that's the
> whole point.

That's good point, maybe this coding is bit too defensive.

> Second, if the other backend extended the relation in
> some other manner and did not extend the FSM, how does calling
> RecordAndGetPageWithFreeSpace help? As far as I can see,
> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
> searching the FSM, so if one is stymied the other will be too. What
> am I missing?
>

The RecordAndGetPageWithFreeSpace will extend FSM as it calls
fsm_set_and_search which in turn calls fsm_readbuf with extend = true.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-23 19:01:44
Message-ID:	CA+TgmoYRP5-4YTMf1x1pNWVQrYW4_XUqGF1CsShU3_p0xg_GjQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
>> Second, if the other backend extended the relation in
>> some other manner and did not extend the FSM, how does calling
>> RecordAndGetPageWithFreeSpace help? As far as I can see,
>> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
>> searching the FSM, so if one is stymied the other will be too. What
>> am I missing?
>>
>
> The RecordAndGetPageWithFreeSpace will extend FSM as it calls
> fsm_set_and_search which in turn calls fsm_readbuf with extend = true.

So how does that help? If I'm reading this right, the new block will
be all zeroes which means no space available on any of those pages.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-23 19:19:08
Message-ID:	56F2EC2C.3000804@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23/03/16 20:01, Robert Haas wrote:
> On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
>>> Second, if the other backend extended the relation in
>>> some other manner and did not extend the FSM, how does calling
>>> RecordAndGetPageWithFreeSpace help? As far as I can see,
>>> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
>>> searching the FSM, so if one is stymied the other will be too. What
>>> am I missing?
>>>
>>
>> The RecordAndGetPageWithFreeSpace will extend FSM as it calls
>> fsm_set_and_search which in turn calls fsm_readbuf with extend = true.
>
> So how does that help? If I'm reading this right, the new block will
> be all zeroes which means no space available on any of those pages.
>

I am bit confused as to what exactly you are saying, but what will
happen is we get back to the while cycle and try again so eventually we
should find either block with enough free space or add new one (not sure
if this would actually ever happen in practice in heavily concurrent
workload where the FSM would not be correctly extended during relation
extension though, we might just loop here forever).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-23 19:33:38
Message-ID:	56F2EF92.6020303@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23/03/16 20:19, Petr Jelinek wrote:
> On 23/03/16 20:01, Robert Haas wrote:
>> On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com>
>> wrote:
>>>> Second, if the other backend extended the relation in
>>>> some other manner and did not extend the FSM, how does calling
>>>> RecordAndGetPageWithFreeSpace help? As far as I can see,
>>>> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
>>>> searching the FSM, so if one is stymied the other will be too. What
>>>> am I missing?
>>>>
>>>
>>> The RecordAndGetPageWithFreeSpace will extend FSM as it calls
>>> fsm_set_and_search which in turn calls fsm_readbuf with extend = true.
>>
>> So how does that help? If I'm reading this right, the new block will
>> be all zeroes which means no space available on any of those pages.
>>
>
> I am bit confused as to what exactly you are saying, but what will
> happen is we get back to the while cycle and try again so eventually we
> should find either block with enough free space or add new one (not sure
> if this would actually ever happen in practice in heavily concurrent
> workload where the FSM would not be correctly extended during relation
> extension though, we might just loop here forever).
>

Btw thinking about it some more, ISTM that not finding the block and
just doing the extension if the FSM wasn't extended correctly previously
is probably cleaner behavior than what we do now. The reasoning for that
opinion is that if the FSM wasn't extended, we'll fix it by doing
relation extension since we know we do both in this code path and also
if we could not find page before we'll most likely not find one even on
retry and if there was page added at the end by extension that we might
reuse partially here then there is no harm in adding new one anyway as
the whole point of this patch is that it does bigger extension that
strictly necessary so insisting on page reuse for something that seems
like only theoretical possibility that does not even exist in current
code does not seem right.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-23 19:43:12
Message-ID:	CA+Tgmoap2vBpXp29LOXMiiqSd0pwkcOdmwnsoS_H_dymn9J=Cw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 23, 2016 at 3:33 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
> Btw thinking about it some more, ISTM that not finding the block and just
> doing the extension if the FSM wasn't extended correctly previously is
> probably cleaner behavior than what we do now. The reasoning for that
> opinion is that if the FSM wasn't extended, we'll fix it by doing relation
> extension since we know we do both in this code path and also if we could
> not find page before we'll most likely not find one even on retry and if
> there was page added at the end by extension that we might reuse partially
> here then there is no harm in adding new one anyway as the whole point of
> this patch is that it does bigger extension that strictly necessary so
> insisting on page reuse for something that seems like only theoretical
> possibility that does not even exist in current code does not seem right.

I'm not sure I completely follow this. The fact that the last
sentence is 9 lines long may be related. :-)

I think it's pretty clearly important to re-check the FSM after
acquiring the extension lock. Otherwise, imagine that 25 backends
arrive at the exact same time. The first one gets the lock and
extends the relation 500 pages; the next one, 480, and so on. In
total, they extend the relation by 6500 pages, which is a bit rich.
Rechecking the FSM after acquiring the lock prevents that from
happening, and that's a very good thing. We'll do one 500-page
extension and that's it.

However, I don't think using RecordAndGetPageWithFreeSpace rather than
GetPageWithFreeSpace is appropriate. We've already recorded free
space on that page, and recording it again is a bad idea. It's quite
possible that by the time we get the lock our old value is totally
inaccurate. If there's some advantage to searching in the more
targeted way that RecordAndGetPageWithFreeSpace does over
GetPageWithFreeSpace then we need a new API into the freespace stuff
that does the more targeted search without updating anything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-23 20:01:05
Message-ID:	56F2F601.1030400@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23/03/16 20:43, Robert Haas wrote:
> On Wed, Mar 23, 2016 at 3:33 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
>> Btw thinking about it some more, ISTM that not finding the block and just
>> doing the extension if the FSM wasn't extended correctly previously is
>> probably cleaner behavior than what we do now. The reasoning for that
>> opinion is that if the FSM wasn't extended, we'll fix it by doing relation
>> extension since we know we do both in this code path and also if we could
>> not find page before we'll most likely not find one even on retry and if
>> there was page added at the end by extension that we might reuse partially
>> here then there is no harm in adding new one anyway as the whole point of
>> this patch is that it does bigger extension that strictly necessary so
>> insisting on page reuse for something that seems like only theoretical
>> possibility that does not even exist in current code does not seem right.
>
> I'm not sure I completely follow this. The fact that the last
> sentence is 9 lines long may be related. :-)
>

I tend to do that sometimes :)

> I think it's pretty clearly important to re-check the FSM after
> acquiring the extension lock. Otherwise, imagine that 25 backends
> arrive at the exact same time. The first one gets the lock and
> extends the relation 500 pages; the next one, 480, and so on. In
> total, they extend the relation by 6500 pages, which is a bit rich.
> Rechecking the FSM after acquiring the lock prevents that from
> happening, and that's a very good thing. We'll do one 500-page
> extension and that's it.
>

Right, but that would only happen if all the backends did it using
different code which does not do the FSM extension because the current
code does FSM extension and the point of using
RecordAndGetPageWithFreeSpace seems to be "just in case" somebody is
doing extension differently (at least I don't see other reason). So
basically I am not saying we shouldn't do the search but that I agree
GetPageWithFreeSpace should be enough as the worst that can happen is
that we overextend the relation in case some theoretical code from
somewhere else also did extension of relation without extending FSM
(afaics).

But maybe Dilip had some other reason for using the
RecordAndGetPageWithFreeSpace that is not documented.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-24 01:43:56
Message-ID:	CAA4eK1J_U+DD+i-AA9v9xOeZww7ud7-szJ_t7KZcQVFhXd6UgQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2016 at 12:09 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com>
wrote:
>
> I've read this over several times and looked at
> RecordAndGetPageWithFreeSpace() and I'm still confused. First of all,
> if the lock was acquired by some other backend which did
> RelationAddExtraBlocks(), it *will* have updated the FSM - that's the
> whole point.
>

It doesn't update the FSM uptill root in some cases, as per comments on top
of RecordPageWithFreeSpace and the code as well.

>
> Second, if the other backend extended the relation in
> some other manner and did not extend the FSM, how does calling
> RecordAndGetPageWithFreeSpace help? As far as I can see,
> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
> searching the FSM, so if one is stymied the other will be too. What
> am I missing?
>

RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
update till root, it will be able to search the newly added page. I agree
with whatever you have said in another mail that we should introduce a new
API to do a more targeted search for such cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-24 02:44:38
Message-ID:	CA+TgmoaQ9zv-NUMhNur3K=ie_EOJ+U5K2vfW3tcy3WH7HNMfug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
> it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
> update till root, it will be able to search the newly added page. I agree
> with whatever you have said in another mail that we should introduce a new
> API to do a more targeted search for such cases.

OK, let's do that, then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-24 06:04:22
Message-ID:	CAFiTN-vNE4JyYYozbhGyKs7Q5yhzj6th+TvP6TTmyE=1fmQTag@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed
> to
> > it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
> > update till root, it will be able to search the newly added page. I
> agree
> > with whatever you have said in another mail that we should introduce a
> new
> > API to do a more targeted search for such cases.
>
> OK, let's do that, then.

Ok, I have added new API which just find the free block from and start
search from last given page.

1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I
don't like this name, but I could not come up with any better, Please
suggest one.

And also body of GetPageWithFreeSpaceUsingOldPage looks almost similar to
RecordAndGetPageWithFreeSpace, I tried to merge these two but for that we
need to pass extra parameter to the function.

2. I also had to write one more function *fsm_search_from_addr *instead of
using* fsm_set_and_search. *So that we can find block without updating the
other slot.

I have done performance test just to ensure the result. And performance is
same as old. with both COPY and INSERT.

3. I have also run pgbench read-write what amit suggested upthread.. No
regression or improvement with pgbench workload.

Client base Patch
1 899 914
8 5397 5413
32 18170 18052
64 29850 29941

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v12.patch	application/octet-stream	11.6 KB

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-24 08:18:51
Message-ID:	56F3A2EB.2080208@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 24/03/16 07:04, Dilip Kumar wrote:
>
> On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com
> <mailto:robertmhaas(at)gmail(dot)com>> wrote:
>
> On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila
> <amit(dot)kapila16(at)gmail(dot)com <mailto:amit(dot)kapila16(at)gmail(dot)com>> wrote:
> > RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
> > it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
> > update till root, it will be able to search the newly added page. I agree
> > with whatever you have said in another mail that we should introduce a new
> > API to do a more targeted search for such cases.
>
> OK, let's do that, then.
>
>
> Ok, I have added new API which just find the free block from and start
> search from last given page.
>
> 1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I
> don't like this name, but I could not come up with any better, Please
> suggest one.
>

GetNearestPageWithFreeSpace? (although not sure that's accurate
description, maybe Nearby would be better)

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-24 10:13:37
Message-ID:	CAA4eK1J_WF75baG9-eZCa1eKnHGsse_6aP4HTCbfGp2QH4V01A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2016 at 1:48 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
>
> On 24/03/16 07:04, Dilip Kumar wrote:
>>
>>
>> On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com
>> <mailto:robertmhaas(at)gmail(dot)com>> wrote:
>>
>> On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila
>> <amit(dot)kapila16(at)gmail(dot)com <mailto:amit(dot)kapila16(at)gmail(dot)com>> wrote:
>> > RecordAndGetPageWithFreeSpace() tries to search from the oldPage
passed to
>> > it, rather than from top, so even if RecordPageWithFreeSpace()
doesn't
>> > update till root, it will be able to search the newly added page.
I agree
>> > with whatever you have said in another mail that we should
introduce a new
>> > API to do a more targeted search for such cases.
>>
>> OK, let's do that, then.
>>
>>
>> Ok, I have added new API which just find the free block from and start
>> search from last given page.
>>
>> 1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I
>> don't like this name, but I could not come up with any better, Please
>> suggest one.
>>
>

1.
+GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage,
+ Size spaceNeeded)
{
..
+ /*
+ * If fsm_set_and_search found a suitable new block, return that.
+ * Otherwise, search as usual.
+ */
..
}

In the above comment, you are referring wrong function.

2.
+static int
+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
+{
+ Buffer buf;
+ int newslot = -1;
+
+ buf = fsm_readbuf(rel, addr, true);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ if (minValue != 0)
+ {
+ /* Search while we still hold the lock */
+ newslot = fsm_search_avail(buf, minValue,
+ addr.level == FSM_BOTTOM_LEVEL,
+ false);

In this new API, I don't understand why we need minValue != 0 check,
basically if user of API doesn't want to search for space > 0, then what is
the need of calling this API? I think this API should use Assert for
minValue!=0 unless you see reason for not doing so.

>
> GetNearestPageWithFreeSpace? (although not sure that's accurate
description, maybe Nearby would be better)
>

Better than what is used in patch.

Yet another possibility could be to call it as GetPageWithFreeSpaceExtended
and call it from GetPageWithFreeSpace with value of oldPage
as InvalidBlockNumber.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-24 11:17:17
Message-ID:	CAFiTN-sH1PmdEwn=Xut0ne6Ux+fczb5jmDMpFSUhxP8jUjM9rg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2016 at 6:13 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

>
> 1.
> +GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage,
> + Size spaceNeeded)
> {
> ..
> + /*
> + * If fsm_set_and_search found a suitable new block, return that.
> + * Otherwise, search as usual.
> + */
> ..
> }
>
> In the above comment, you are referring wrong function.
>

Fixed

>
> 2.
> +static int
> +fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
> +{
> + Buffer buf;
> + int newslot = -1;
> +
> + buf = fsm_readbuf(rel, addr, true);
> + LockBuffer(buf, BUFFER_LOCK_SHARE);
> +
> + if (minValue != 0)
> + {
> + /* Search while we still hold the lock */
> + newslot = fsm_search_avail(buf, minValue,
> + addr.level == FSM_BOTTOM_LEVEL,
> + false);
>
> In this new API, I don't understand why we need minValue != 0 check,
> basically if user of API doesn't want to search for space > 0, then what is
> the need of calling this API? I think this API should use Assert for
> minValue!=0 unless you see reason for not doing so.
>
>
Agree, it should be assert.

> >
> > GetNearestPageWithFreeSpace? (although not sure that's accurate
> description, maybe Nearby would be better)
> >
>
> Better than what is used in patch.
>
> Yet another possibility could be to call it as
> GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with
> value of oldPage as InvalidBlockNumber.
>

Yes I like this.. Changed the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v13.patch	application/octet-stream	11.9 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-24 21:34:51
Message-ID:	CA+TgmoZotDBrYAPXjnFcy0AG0fqDFUFWNBvAVKk_MYN1+6Vx9g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2016 at 7:17 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> Yet another possibility could be to call it as
>> GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with
>> value of oldPage as InvalidBlockNumber.
>
> Yes I like this.. Changed the same.

After thinking about this some more, I don't think this is the right
approach. I finally understand what's going on here:
RecordPageWithFreeSpace updates the FSM lazily, only adjusting the
leaves and not the upper levels. It relies on VACUUM to update the
upper levels. This seems like it might be a bad policy in general,
because VACUUM on a very large relation may be quite infrequent, and
you could lose track of a lot of space for a long time, leading to a
lot of extra bloat. However, it's a particularly bad policy for bulk
relation extension, because you're stuffing a large number of totally
free pages in there in a way that doesn't make them particularly easy
for anybody else to discover. There are two ways we can fail here:

1. Callers who use GetPageWithFreeSpace() rather than
GetPageFreeSpaceExtended() will fail to find the new pages if the
upper map levels haven't been updated by VACUUM.

2. Even callers who use GetPageFreeSpaceExtended() may fail to find
the new pages. This can happen in two separate ways, namely (a) the
lastValidBlock saved by RelationGetBufferForTuple() can be in the
middle of the relation someplace rather than near the end, or (b) the
bulk-extension performed by some other backend can have overflowed
onto some new FSM page that won't be searched even though a relatively
plausible lastValidBlock was passed.

It seems to me that since we're adding a whole bunch of empty pages at
once, it's worth the effort to update the upper levels of the FSM.
This isn't a case of discovering a single page with an extra few bytes
of storage available due to a HOT prune or something - this is a case
of putting at least 20 and plausibly hundreds of extra pages into the
FSM. The extra effort to update the upper FSM pages is trivial by
comparison with the cost of extending the relation by many blocks.

So, I suggest adding a new function FreeSpaceMapBulkExtend(BlockNumber
first_block, BlockNumber last_block) which sets all the FSM entries
for pages between first_block and last_block to 255 and then bubbles
that up to the higher levels of the tree and all the way to the root.
Have the bulk extend code use that instead of repeatedly calling
RecordPageWithFreeSpace. That should actually be much more efficient,
because it can call fsm_readbuf(), LockBuffer(), and
UnlockReleaseBuffer() just once per FSM page instead of once per FSM
page *per byte modified*. Maybe that makes no difference in practice,
but it can't hurt.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-25 17:05:00
Message-ID:	CAFiTN-t5mDimOkBQpbewBjgKPN8XFzJAgC7OzBLJqQ35UebfEg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 25, 2016 at 3:04 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> 1. Callers who use GetPageWithFreeSpace() rather than
> GetPageFreeSpaceExtended() will fail to find the new pages if the
> upper map levels haven't been updated by VACUUM.
>
> 2. Even callers who use GetPageFreeSpaceExtended() may fail to find
> the new pages. This can happen in two separate ways, namely (a) the
>

Yeah, that's the issue, if extended pages spills to next FSM page, then
other waiters will not find those page, and one by one all waiters will end
up adding extra pages.
for example, if there are ~30 waiters then
total blocks extended = (25(25+1)/2) *20 =~ 6500 pages.

This is not the case every time but whenever heap block go to new FSM page
this will happen.

- FSM page case hold 4096 heap blk info, so after every 8th extend (assume
512 block will extend in one time), it will extend ~6500 pages

- Any new request to RelationGetBufferForTuple will be able to find those
page, because by that time the backend which is extending the page would
have set new block using RelationSetTargetBlock.
(there are still chances that some blocks can be completely left unused,
until vacuum comes).

I have changed the patch as per the suggestion (just POC because
performance number are not that great)

Below is the performance number comparison of base, previous patch(v13) and
latest patch (v14).

performance of patch v14 is significantly low compared to v13, mainly I
guess below reasons
1. As per above calculation v13 extend ~6500 block (after every 8th
extend), and that's why it's performing well.

2. In v13 as soon as we extend the block we add to FSM so immediately
available for new requester, (In this patch also I tried to add one by one
to FSM and updated fsm tree till root after all pages added to FSM, but no
significant improvement).

3. fsm_update_recursive doesn't seems like problem to me. does it ?

Copy 10000 tuples, of 4 bytes each..
---------------------------------------------
Client base patch v13 patch v14
1 118 147 126
2 217 276 269
4 210 421 347
8 166 630 375
16 145 813 415
32 124 985 451
64 974 455

Insert 1000 tuples of 1K size each.

Client base patch v13 patch v14
1 117 124 119
2 111 126 119
4 51 128 124
8 43 149 131
16 40 217 120
32 263 115
64 248 109

Note: I think one thread number can be just run to run variance..

Does anyone see problem in updating the FSM tree, I have debugged and saw
that we are able to get the pages properly from tree and same is visible in
performance number of v14 compared to base.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v14_poc.patch	text/x-diff	11.5 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-26 02:37:00
Message-ID:	CA+TgmoYy8cYhAnY2i4KDqJeZWXE7pAR4OROfKAz05-AK=PEe6g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 25, 2016 at 1:05 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Fri, Mar 25, 2016 at 3:04 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> 1. Callers who use GetPageWithFreeSpace() rather than
>> GetPageFreeSpaceExtended() will fail to find the new pages if the
>> upper map levels haven't been updated by VACUUM.
>>
>> 2. Even callers who use GetPageFreeSpaceExtended() may fail to find
>> the new pages. This can happen in two separate ways, namely (a) the
>
> Yeah, that's the issue, if extended pages spills to next FSM page, then
> other waiters will not find those page, and one by one all waiters will end
> up adding extra pages.
> for example, if there are ~30 waiters then
> total blocks extended = (25(25+1)/2) *20 =~ 6500 pages.
>
> This is not the case every time but whenever heap block go to new FSM page
> this will happen.

I think we need to start testing these patches not only in terms of
how *fast* they are but how *large* the relation ends up being when
we're done. A patch that inserts the rows slower but the final
relation is smaller may be better overall. Can you retest v13, v14,
and master, and post not only the timings but the relation size
afterwards? And maybe post the exact script you are using?

> performance of patch v14 is significantly low compared to v13, mainly I
> guess below reasons
> 1. As per above calculation v13 extend ~6500 block (after every 8th extend),
> and that's why it's performing well.

That should be completely unnecessary, though. I mean, if the problem
is that it's expensive to repeatedly acquire and release the relation
extension lock, then bulk-extending even 100 blocks at a time should
be enough to fix that, because you've reduced the number of times that
has to be done by 99%. There's no way we should need to extend by
thousands of blocks to get good performance.

Maybe something like this would help:

if (needLock)
{
if (!use_fsm)
LockRelationForExtension(relation, ExclusiveLock);
else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
{
BlockNumber last_blkno = RelationGetNumberOfBlocks(relation);

targetBlock = GetPageWithFreeSpaceExtended(relation,
last_blkno, len + saveFreeSpace);
if (targetBlock != InvalidBlockNumber)
goto loop;

LockRelationForExtension(relation, ExclusiveLock);
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
if (targetBlock != InvalidBlockNumber)
{
UnlockRelationForExtension(relation, ExclusiveLock);
goto loop;
}
RelationAddExtraBlocks(relation, bistate);
}
}

I think this is better than what you had before with lastValidBlock,
because we're actually interested in searching the free space map at
the *very end* of the relation, not wherever the last target block
happens to have been.

We could go further still and have GetPageWithFreeSpace() always
search the last, say, two pages of the FSM in all cases. But that
might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-26 09:48:44
Message-ID:	CAFiTN-tkX6gs-jL8VrPxg6OG9VUAKnObUq7r7pWQqASzdF5OwA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 26, 2016 at 8:07 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I think we need to start testing these patches not only in terms of
> how *fast* they are but how *large* the relation ends up being when
> we're done. A patch that inserts the rows slower but the final
> relation is smaller may be better overall. Can you retest v13, v14,
> and master, and post not only the timings but the relation size
> afterwards? And maybe post the exact script you are using?
>

I have tested the size and performance, scripts are attached in the mail.

COPY 1-10 bytes tuple from 32 Clients
Base V13 V14
-------- ---------
---------
TPS 123 874 446
No. Of Tuples 148270000 1049980000 536370000
Relpages 656089 4652593 2485482
INSERT 1028 bytes Tuples From 16 Clients
Base V13 V14
-------- --------
---------
TPS 42 211 120
No. Of Tuples 5149000 25343000 14524000
Rel Pages 735577 3620765 2140612

As per above results If we calculate the tuple number of tuples and
respective relpages, then neither in v13 nor v14 there are extra unused
pages.

As per my calculation for INSERT (1028 byte tuple) each page contain 7
tuples so
number of pages required
Base: 5149000/7 = 735571 (from relpages we can see 6 pages are extra)
V13 : 25343000/7= 3620428 (from relpages we can see ~300 pages are extra).
V14 : 14524000/7= 2074857 (from relpages we can see ~70000 pages are
extra).

With V14 we have found max pages number of extra pages, I expected V13 to
have max unused pages, but it's v14 and I tested it in multiple runs and
v13 is always the winner. I tested with multiple client count also like 8,
32 and v13 always have only ~60-300 extra pages out of total ~2-4 Million
Pages.

Attached files:
-------------------
test_size_ins.sh --> automated script to run insert test and calculate
tuple and relpages.
test_size_copy --> automated script to run copy test and calculate tuple
and relpages.
copy_script -> copy pg_bench script used by test_size_copy.sh
insert_script --> insert pg_bench script used by test_size_ins.sh

> Maybe something like this would help:
>
> if (needLock)
> {
> if (!use_fsm)
> LockRelationForExtension(relation, ExclusiveLock);
> else if (!ConditionLockRelationForExtension(relation,
> ExclusiveLock))
> {
> BlockNumber last_blkno =
> RelationGetNumberOfBlocks(relation);
>
> targetBlock = GetPageWithFreeSpaceExtended(relation,
> last_blkno, len + saveFreeSpace);
> if (targetBlock != InvalidBlockNumber)
> goto loop;
>
> LockRelationForExtension(relation, ExclusiveLock);
> targetBlock = GetPageWithFreeSpace(relation, len +
> saveFreeSpace);
> if (targetBlock != InvalidBlockNumber)
> {
> UnlockRelationForExtension(relation, ExclusiveLock);
> goto loop;
> }
> RelationAddExtraBlocks(relation, bistate);
> }
> }
>
> I think this is better than what you had before with lastValidBlock,
> because we're actually interested in searching the free space map at
> the *very end* of the relation, not wherever the last target block
> happens to have been.
>
> We could go further still and have GetPageWithFreeSpace() always
> search the last, say, two pages of the FSM in all cases. But that
> might be expensive. The extra call to RelationGetNumberOfBlocks seems
> cheap enough here because the alternative is to wait for a contended
> heavyweight lock.
>

I will try the test with this also and post the results.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
copy_script	application/octet-stream	50 bytes
insert_script	application/octet-stream	41 bytes
test_size_copy.sh	application/x-sh	1.1 KB
test_size_ins.sh	application/x-sh	1.4 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-27 11:52:58
Message-ID:	CAFiTN-sJQsKVvexO-ek6pkaxN3OqwD0L_LnjFLXV5HFde4J5+Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 26, 2016 at 3:18 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

> search the last, say, two pages of the FSM in all cases. But that
>> might be expensive. The extra call to RelationGetNumberOfBlocks seems
>> cheap enough here because the alternative is to wait for a contended
>> heavyweight lock.
>>
>
> I will try the test with this also and post the results.
>

I have changed v14 as per this suggestion and results are same as v14.

I have again measured the relation size, this time directly using size
function so results are better understandable.

Relation Size
-----------------
INSERT : 16000 transaction from 32 Client

Base v13 v14_1
--------- --------- --------
TPS 37 255 112
Rel Size 17GB 17GB 18GB

COPY: 32000 transaction from 32 client
Base v13 v14_1
--------- --------- ---------
TPS 121 823 427
Rel Size 11GB 11GB 11GB

Script are attached in the mail
----------------------------------------=
test_size_ins.sh --> run insert test and calculate relation size.
test_size_copy --> run copy test and relation size
copy_script -> copy pg_bench script used by test_size_copy.sh
insert_script --> insert pg_bench script used by test_size_ins.sh
multi_extend_v14_poc_v1.patch --> modified patch of v14.

I also tried modifying v14 from different different angle.

One is like below-->
-------------------------
In AddExtraBlock
{
I add page to FSM one by one like v13 does.
then update the full FSM tree up till root
}

Results:
----------
1. With this performance is little less than v14 but the problem of extra
relation size is solved.
2. With this we can conclude that extra size of relation in v14 is because
some while extending the pages, its not immediately available and at end
some of the pages are left unused.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
test_size_copy.sh	application/x-sh	1.3 KB
test_size_ins.sh	application/x-sh	1.5 KB
copy_script	application/octet-stream	50 bytes
insert_script	application/octet-stream	41 bytes
multi_extend_v14_poc_v1.patch	application/octet-stream	14.2 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-27 12:00:30
Message-ID:	CAFiTN-uqEYo=jSqNMwTDg0tm8FrKB3rnAWoczg0vkB_mxeKH3g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 26, 2016 at 3:18 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

> We could go further still and have GetPageWithFreeSpace() always
>> search the last, say, two pages of the FSM in all cases. But that
>> might be expensive. The extra call to RelationGetNumberOfBlocks seems
>> cheap enough here because the alternative is to wait for a contended
>> heavyweight lock.
>>
>
> I will try the test with this also and post the results.
>

**Something went wrong in last mail, seems like become separate thread, so
resending the same mail **

I have changed v14 as per this suggestion and results are same as v14.

I have again measured the relation size, this time directly using size
function so results are better understandable.

Relation Size
-----------------
INSERT : 16000 transaction from 32 Client

Base v13 v14_1
--------- --------- --------
TPS 37 255 112
Rel Size 17GB 17GB 18GB

COPY: 32000 transaction from 32 client
Base v13 v14_1
--------- --------- ---------
TPS 121 823 427
Rel Size 11GB 11GB 11GB

I also tried modifying v14 from different different angle.

One is like below-->
-------------------------
In AddExtraBlock
{
I add page to FSM one by one like v13 does.
then update the full FSM tree up till root
}

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
copy_script	application/octet-stream	50 bytes
insert_script	application/octet-stream	41 bytes
multi_extend_v14_poc_v1.patch	application/octet-stream	14.2 KB
test_size_copy.sh	application/x-sh	1.3 KB
test_size_ins.sh	application/x-sh	1.5 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-27 20:25:33
Message-ID:	CA+TgmoYnestWvampMas8rjFFruanB3Eyv6VADO2om2b3q6AXDQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 27, 2016 at 8:00 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> Relation Size
> -----------------
> INSERT : 16000 transaction from 32 Client
>
> Base v13 v14_1
> --------- --------- --------
> TPS 37 255 112
> Rel Size 17GB 17GB 18GB
>
> COPY: 32000 transaction from 32 client
> Base v13 v14_1
> --------- --------- ---------
> TPS 121 823 427
> Rel Size 11GB 11GB 11GB
>
>
> Script are attached in the mail
> ----------------------------------------=
> test_size_ins.sh --> run insert test and calculate relation size.
> test_size_copy --> run copy test and relation size
> copy_script -> copy pg_bench script used by test_size_copy.sh
> insert_script --> insert pg_bench script used by test_size_ins.sh
> multi_extend_v14_poc_v1.patch --> modified patch of v14.
>
> I also tried modifying v14 from different different angle.
>
> One is like below-->
> -------------------------
> In AddExtraBlock
> {
> I add page to FSM one by one like v13 does.
> then update the full FSM tree up till root
> }

Not following this. Did you attach this version?

> Results:
> ----------
> 1. With this performance is little less than v14 but the problem of extra
> relation size is solved.
> 2. With this we can conclude that extra size of relation in v14 is because
> some while extending the pages, its not immediately available and at end
> some of the pages are left unused.

I agree with that conclusion. I'm not quite sure where that leaves
us, though. We can go back to v13, but why isn't that producing extra
pages? It seems like it should: whenever a bulk extend rolls over to
a new FSM page, concurrent backends will search either the old or the
new one but not both.

Maybe we could do this - not sure if it's what you were suggesting above:

1. Add the pages one at a time, and do RecordPageWithFreeSpace after each one.
2. After inserting them all, go back and update the upper levels of
the FSM tree up the root.

Another idea is:

If ConditionalLockRelationForExtension fails to get the lock
immediately, search the last *two* pages of the FSM for a free page.

Just brainstorming here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-28 01:51:02
Message-ID:	CAFiTN-v5OyKKnG5ExVh0bJiOnzAe-Cai1BA_69zZ7aPg9BNzrg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 28, 2016 at 1:55 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> > One is like below-->
> > -------------------------
> > In AddExtraBlock
> > {
> > I add page to FSM one by one like v13 does.
> > then update the full FSM tree up till root
> > }
>
> Not following this. Did you attach this version?
>

No I did not attached this.. During rough experiment, tried this, did not
produced any patch, I will send this.

> > Results:
> > ----------
> > 1. With this performance is little less than v14 but the problem of extra
> > relation size is solved.
> > 2. With this we can conclude that extra size of relation in v14 is
> because
> > some while extending the pages, its not immediately available and at end
> > some of the pages are left unused.
>
> I agree with that conclusion. I'm not quite sure where that leaves
> us, though. We can go back to v13, but why isn't that producing extra
> pages? It seems like it should: whenever a bulk extend rolls over to
> a new FSM page, concurrent backends will search either the old or the
> new one but not both.
>
> Maybe we could do this - not sure if it's what you were suggesting above:
>
> 1. Add the pages one at a time, and do RecordPageWithFreeSpace after each
> one.
> 2. After inserting them all, go back and update the upper levels of
> the FSM tree up the root.
>

Yes same, I wanted to explained the same above.

> Another idea is:
>
> If ConditionalLockRelationForExtension fails to get the lock
> immediately, search the last *two* pages of the FSM for a free page.
>
> Just brainstorming here.

I think this is better option, Since we will search last two pages of FSM
tree, then no need to update the upper level of the FSM tree. Right ?

I will test and post the result with this option.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-28 05:30:43
Message-ID:	CAA4eK1KFp78eSt+HV-20KH2JC_8+w7w2hr5T7uy1brZYqsb5wg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 28, 2016 at 1:55 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Sun, Mar 27, 2016 at 8:00 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
wrote:
>
> > Results:
> > ----------
> > 1. With this performance is little less than v14 but the problem of
extra
> > relation size is solved.
> > 2. With this we can conclude that extra size of relation in v14 is
because
> > some while extending the pages, its not immediately available and at end
> > some of the pages are left unused.
>
> I agree with that conclusion. I'm not quite sure where that leaves
> us, though. We can go back to v13, but why isn't that producing extra
> pages? It seems like it should: whenever a bulk extend rolls over to
> a new FSM page, concurrent backends will search either the old or the
> new one but not both.
>

I have not debugged the flow, but by looking at v13 code, it looks like it
will search both old and new. In
function GetPageWithFreeSpaceExtended()->fsm_search_from_addr()->fsm_search_avail(),
the basic idea of search is: Start the search from the target slot. At
every step, move one
node to the right, then climb up to the parent. Stop when we reach a node
with enough free space (as we must, since the root has enough space).
So shouldn't it be able to find the new FSM page where the bulk extend
rolls over?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-28 06:11:42
Message-ID:	CAFiTN-uk34U3UiGFh7smyuVsr5MRFWYopu_roASK0ux7qMuWaA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 28, 2016 at 11:00 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> I have not debugged the flow, but by looking at v13 code, it looks like it
> will search both old and new. In
> function GetPageWithFreeSpaceExtended()->fsm_search_from_addr()->fsm_search_avail(),
> the basic idea of search is: Start the search from the target slot. At
> every step, move one
> node to the right, then climb up to the parent. Stop when we reach a node
> with enough free space (as we must, since the root has enough space).
> So shouldn't it be able to find the new FSM page where the bulk extend
> rolls over?
>

This is actually multi level tree, So each FSM page contain one slot tree.

So fsm_search_avail() is searching only the slot tree, inside one FSM page.
But we want to go to next FSM page.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-28 09:32:54
Message-ID:	CAFiTN-vYhexxra6U_E_Kcio56jrzJJxVji7K91o3feuZUN7i-w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 28, 2016 at 7:21 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

> I agree with that conclusion. I'm not quite sure where that leaves
>> us, though. We can go back to v13, but why isn't that producing extra
>> pages? It seems like it should: whenever a bulk extend rolls over to
>> a new FSM page, concurrent backends will search either the old or the
>> new one but not both.
>>
>
Our open question was why V13 is not producing extra pages, I tried to
print some logs and debug. It seems to me that,
when blocks are spilling to next FSM pages, that time all backends who are
waiting on lock will not get the block because searching in old FSM page.
But the backend which is extending the pages will set
RelationSetTargetBlock to latest blocks, and that will make new FSM page
available for search by new requesters.

1. So this is why v13 (in normal cases*1) not producing unused pages.
2. But it will produce extra pages (which will be consumed by new
requesters), because all waiter will come one by one and extend 512 pages.

*1 : Above I have mentioned normal case, I mean there is some case exist
where V13 can leave unused page, Like one by one each waiter Get the lock
and extend the page, but no one go down till RelationSetTargetBlock so till
now new pages are not available by new requester, and time will come when
blocks will spill to third FSM page, now one by one all backends go down
and set RelationSetTargetBlock, and suppose last one set it to the block
which is in 3rd FSM page, in this case, pages in second FSM pages are
unused.

Maybe we could do this - not sure if it's what you were suggesting above:
>>
>> 1. Add the pages one at a time, and do RecordPageWithFreeSpace after each
>> one.
>> 2. After inserting them all, go back and update the upper levels of
>> the FSM tree up the root.
>>
> I think this is better option, Since we will search last two pages of FSM
> tree, then no need to update the upper level of the FSM tree. Right ?
>
> I will test and post the result with this option.
>

I have created this patch and results are as below.

* All test scripts are same attached upthread

1. Relation Size : No change in size, its same as base and v13

2. INSERT 1028 Byte 1000 tuple performance
-----------------------------------------------------------
Client base v13 v15
1 117 124 122
2 111 126 123
4 51 128 125
8 43 149 135
16 40 217 147
32 35 263 141

3. COPY 10000 Tuple performance.
----------------------------------------------
Client base v13 v15
1 118 147 155
2 217 276 273
4 210 421 457
8 166 630 643
16 145 813 595
32 124 985 598

Conclusion:
---------------
1. I think v15 is solving the problem exist with v13 and performance is
significantly high compared to base, and relation size is also stable, So
IMHO V15 is winner over other solution, what other thinks ?

2. And no point in thinking that V13 is better than V15 because, V13 has
bug of sometime extending more than expected pages and that is uncontrolled
and same can be the reason also of v13 performing better.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v15.patch	application/octet-stream	12.9 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-28 12:46:08
Message-ID:	CAFiTN-tkYR3S_DdF2U9gQbNi7zUEJdQtgcOVHadBcTeVS+d9_w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 28, 2016 at 3:02 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

> 1. Relation Size : No change in size, its same as base and v13
>
> 2. INSERT 1028 Byte 1000 tuple performance
> -----------------------------------------------------------
> Client base v13 v15
> 1 117 124 122
> 2 111 126 123
> 4 51 128 125
> 8 43 149 135
> 16 40 217 147
> 32 35 263 141
>
> 3. COPY 10000 Tuple performance.
> ----------------------------------------------
> Client base v13 v15
> 1 118 147 155
> 2 217 276 273
> 4 210 421 457
> 8 166 630 643
> 16 145 813 595
> 32 124 985 598
>
> Conclusion:
> ---------------
> 1. I think v15 is solving the problem exist with v13 and performance is
> significantly high compared to base, and relation size is also stable, So
> IMHO V15 is winner over other solution, what other thinks ?
>
> 2. And no point in thinking that V13 is better than V15 because, V13 has
> bug of sometime extending more than expected pages and that is uncontrolled
> and same can be the reason also of v13 performing better.
>

Found one problem with V15, so sending the new version.
In V15 I am taking prev_blkno as targetBlock instead it should be the last
block of the relation at that time. Attaching new patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v16.patch	application/octet-stream	12.6 KB

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-28 21:51:34
Message-ID:	56F9A766.3000504@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 28/03/16 14:46, Dilip Kumar wrote:
>
> Conclusion:
> ---------------
> 1. I think v15 is solving the problem exist with v13 and performance
> is significantly high compared to base, and relation size is also
> stable, So IMHO V15 is winner over other solution, what other thinks ?
>

It seems so, do you have ability to reasonably test with 64 clients? I
am mostly wondering if we see the performance going further down or just
plateau.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-29 01:56:35
Message-ID:	CA+TgmoayhKhjFDAbnxFJV67Pitz-tS+LHCkd67DKZ4ckZvLfvQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 27, 2016 at 9:51 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I think this is better option, Since we will search last two pages of FSM
> tree, then no need to update the upper level of the FSM tree. Right ?

Well, it's less important in that case, but I think it's still worth
doing. Some people are going to do just plain GetPageWithFreeSpace().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-29 01:59:58
Message-ID:	CA+TgmoYnp7w_OCN=nHEytm6-6aMr4oXhsozC6J6czS3MzrN0GQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 28, 2016 at 8:46 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> Found one problem with V15, so sending the new version.
> In V15 I am taking prev_blkno as targetBlock instead it should be the last
> block of the relation at that time. Attaching new patch.

BlockNumber targetBlock,
- otherBlock;
+ otherBlock,
+ prev_blkno = RelationGetNumberOfBlocks(relation);

Absolutely not. There is no way it's acceptable to introduce an
unconditional call to RelationGetNumberOfBlocks() into every call to
RelationGetBufferForTuple().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-29 04:38:55
Message-ID:	CAA4eK1JWu5Xhmnpe2hOXw7WbHtBW6BacoiOMR106i2+Zv4xf8A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 29, 2016 at 3:21 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:

> On 28/03/16 14:46, Dilip Kumar wrote:
>
>>
>> Conclusion:
>> ---------------
>> 1. I think v15 is solving the problem exist with v13 and performance
>> is significantly high compared to base, and relation size is also
>> stable, So IMHO V15 is winner over other solution, what other thinks ?
>>
>>
> It seems so, do you have ability to reasonably test with 64 clients? I am
> mostly wondering if we see the performance going further down or just
> plateau.
>
>
Yes, that makes sense. One more point is that if the reason for v13 giving
better performance is extra blocks (which we believe in certain cases can
leak till the time Vacuum updates the FSM tree), do you think it makes
sense to once test by increasing lockWaiters * 20 limit to may
be lockWaiters * 25 or lockWaiters * 30.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-29 08:39:13
Message-ID:	CAFiTN-sxgazkW_xmXZ8OE838DV-_ELCni291sCenTDKfzXYiYg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 29, 2016 at 7:26 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Well, it's less important in that case, but I think it's still worth
> doing. Some people are going to do just plain GetPageWithFreeSpace().
>

I am attaching new version v17.

Its like this...

In *RelationAddExtraBlocks*
{
-- Add Block one by one to FSM.

-- Update FSM tree all the way to root
}

In *RelationGetBufferForTuple*
--- Same as v14, search the FSM tree from root. GetPageWithFreeSpace

*Summary:*
*--------------*
1. By Adding block to FSM tree one by one it solves the problem of unused
blocks in V14.
2. It Update the FSM tree all they up to root, so anybody search from root
can get the block.
3. It also search the block from root, so it don't have problem like v15
has(Exactly identifying which two FSM page to search).
4. This solves the performance problem of V14 by some optimizations in
logic of updating FSM tree till root.

*Performance Data*:
--------------------------
Client base v17
-------- -------- --------
1 117 120
2 111 123
4 51 124
8 43 135
16 40 145
32 35 144
64 -- 140
Client base v17
------- ------- ------
1 118 117
2 217 220
4 210 379
8 166 574
16 145 594
32 124 599
64 ---- 609

Notes:
---------
If I do some change in this patch in strategy of searching the block,
performance remains almost the same.
1. Like search in two block like v15 or v17 does.
2. Search first using target block and if don't get then search from top.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v17.patch	application/octet-stream	12.0 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-29 17:29:48
Message-ID:	CAFiTN-sMDXwMzZSwuEw=+EmVomngo3eH+zsDVwExur5hdDteUA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 29, 2016 at 2:09 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

>
> Attaching new version v18

- Some cleanup work on v17.
- Improved *UpdateFreeSpaceMap *function.
- Performance and space utilization are same as V17

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v18.patch	application/octet-stream	11.6 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-30 01:49:25
Message-ID:	CA+Tgmobwq3R=g9Sy0zJqw8b+dMvz3HcuSpzmaZQFmExOwteHjA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 29, 2016 at 1:29 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> Attaching new version v18
>
> - Some cleanup work on v17.
> - Improved UpdateFreeSpaceMap function.
> - Performance and space utilization are same as V17

Looks better. Here's a v19 that I hacked on a bit.

Unfortunately, one compiler I tried this with had a pretty legitimate complaint:

hio.c: In function ‘RelationGetBufferForTuple’:
hio.c:231:20: error: ‘freespace’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:185:7: note: ‘freespace’ was declared here
hio.c:231:20: error: ‘blockNum’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:181:14: note: ‘blockNum’ was declared here

There's nothing whatsoever to prevent RelationExtensionLockWaiterCount
from returning 0.

It's also rather ugly that the call to UpdateFreeSpaceMap() assumes
that the last value returned by PageGetHeapFreeSpace() is as good as
any other, but maybe we can just install a comment explaining that
point; there's not an obviously better approach that I can see.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
multi_extend_v19.patch	text/x-diff	11.8 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-30 02:21:30
Message-ID:	CAFiTN-t5ELpH6g_8QhAc3VeG_3BhbFkuqUuGqzZYV=-qDrqLuQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 30, 2016 at 7:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

Thanks for review and better comments..

> hio.c: In function ‘RelationGetBufferForTuple’:
> hio.c:231:20: error: ‘freespace’ may be used uninitialized in this
> function [-Werror=uninitialized]
> hio.c:185:7: note: ‘freespace’ was declared here
> hio.c:231:20: error: ‘blockNum’ may be used uninitialized in this
> function [-Werror=uninitialized]
> hio.c:181:14: note: ‘blockNum’ was declared here
>

I have fixed those in v20

>
> There's nothing whatsoever to prevent RelationExtensionLockWaiterCount
> from returning 0.
>
> It's also rather ugly that the call to UpdateFreeSpaceMap() assumes
> that the last value returned by PageGetHeapFreeSpace() is as good as
> any other, but maybe we can just install a comment explaining that
> point; there's not an obviously better approach that I can see.
>

Added comments..

+ if (lockWaiters)
+ /*
+ * Here we are using same freespace for all the Blocks, but that
+ * is Ok, because all are newly added blocks and have same freespace
+ * And even some block which we just added to FreespaceMap above, is
+ * used by some backend and now freespace is not same, will not harm
+ * anything, because actual freespace will be calculated by user
+ * after getting the page.
+ */
+ UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);

Does this look good ?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v20.patch	application/octet-stream	12.2 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-30 02:54:27
Message-ID:	CAFiTN-tnLe=Nr2SerQBv1UWEAr698nCviVPqBYoY3HmWSgy3pg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 30, 2016 at 7:51 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

> + if (lockWaiters)
> + /*
> + * Here we are using same freespace for all the Blocks, but that
> + * is Ok, because all are newly added blocks and have same freespace
> + * And even some block which we just added to FreespaceMap above, is
> + * used by some backend and now freespace is not same, will not harm
> + * anything, because actual freespace will be calculated by user
> + * after getting the page.
> + */
> + UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);
>
>
> Does this look good ?

Done in better way..

+ lockWaiters = RelationExtensionLockWaiterCount(relation);
+
+ if (lockWaiters == 0)
+ return;
+

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
multi_extend_v21.patch	application/octet-stream	12.2 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-31 04:59:16
Message-ID:	CAFiTN-t=wrVREMJp=h_6NGXrbKT89XP+aoAtjELH3TFwKQL3LA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> Yes, that makes sense. One more point is that if the reason for v13
> giving better performance is extra blocks (which we believe in certain
> cases can leak till the time Vacuum updates the FSM tree), do you think it
> makes sense to once test by increasing lockWaiters * 20 limit to may
> be lockWaiters * 25 or lockWaiters * 30.

I tested COPY 10000 record, by increasing the number of blocks just to find
out why we are not as good as V13
with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..

COPY 10000
--------------------
Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))
-------- ---------
16 752
32 708

This proves that main reason of v13 being better is its adding extra blocks
without control.
though v13 is better than these results, I think we can get that also by
changing multiplier and max limit .

But I think we are ok with the max size as 4MB (512 blocks) right?.

Does this test make sense ?

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-31 08:01:25
Message-ID:	CAA4eK1+pR_hEm4OEJ5DbcF75xtLmPb8A8cWZUPqy6ADTQMNiRQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 31, 2016 at 10:29 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:

>
> On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>
>> Yes, that makes sense. One more point is that if the reason for v13
>> giving better performance is extra blocks (which we believe in certain
>> cases can leak till the time Vacuum updates the FSM tree), do you think it
>> makes sense to once test by increasing lockWaiters * 20 limit to may
>> be lockWaiters * 25 or lockWaiters * 30.
>
>
> I tested COPY 10000 record, by increasing the number of blocks just to
> find out why we are not as good as V13
> with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..
>
> COPY 10000
> --------------------
> Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))
> -------- ---------
> 16 752
> 32 708
>
> This proves that main reason of v13 being better is its adding extra
> blocks without control.
> though v13 is better than these results, I think we can get that also by
> changing multiplier and max limit .
>
> But I think we are ok with the max size as 4MB (512 blocks) right?.
>
>
This shows that there is performance increase of ~26% (599 to 752) at 16
client count if we raise the limit of max extend size from 4MB to 16MB
which is a good boost, but not sure if it is worth extending the relation
for cases where the newly added pages won't get used. I think it should be
okay to go for 4MB as a limit for now and then if during beta testing or in
future, there are use cases where tuning this max limit helps, then we can
come back to it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-31 10:58:39
Message-ID:	CA+TgmoZSGfs_p87s8bjZXUa2LQN856VX0Mg-3kkpHuZ_kCAzCQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 31, 2016 at 12:59 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>> Yes, that makes sense. One more point is that if the reason for v13
>> giving better performance is extra blocks (which we believe in certain cases
>> can leak till the time Vacuum updates the FSM tree), do you think it makes
>> sense to once test by increasing lockWaiters * 20 limit to may be
>> lockWaiters * 25 or lockWaiters * 30.
>
>
> I tested COPY 10000 record, by increasing the number of blocks just to find
> out why we are not as good as V13
> with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..
>
> COPY 10000
> --------------------
> Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))
> -------- ---------
> 16 752
> 32 708
>
> This proves that main reason of v13 being better is its adding extra blocks
> without control.
> though v13 is better than these results, I think we can get that also by
> changing multiplier and max limit .
>
> But I think we are ok with the max size as 4MB (512 blocks) right?

Yeah, kind of. But obviously if we could make the limit smaller
without hurting performance, that would be better.

Per my note yesterday about performance degradation with parallel
COPY, I wasn't able to demonstrate that this patch gives a consistent
performance benefit on hydra - the best result I got was speeding up a
9.5 minute load to 8 minutes where linear scalability would have been
2 minutes. And I found cases where it was actually slower with the
patch. Now maybe hydra is just a crap machine, but I'm feeling
nervous.

What machines did you use to test this? Have you tested really large
data loads, like many MB or even GB of data?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-31 11:14:02
Message-ID:	CAA4eK1K3EqABhtkpddyccwsV3X8RfN9xMm1EPNnYBqH+wf3Sig@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 31, 2016 at 4:28 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Thu, Mar 31, 2016 at 12:59 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
> wrote:
> > On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > wrote:
> >> Yes, that makes sense. One more point is that if the reason for v13
> >> giving better performance is extra blocks (which we believe in certain
> cases
> >> can leak till the time Vacuum updates the FSM tree), do you think it
> makes
> >> sense to once test by increasing lockWaiters * 20 limit to may be
> >> lockWaiters * 25 or lockWaiters * 30.
> >
> >
> > I tested COPY 10000 record, by increasing the number of blocks just to
> find
> > out why we are not as good as V13
> > with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..
> >
> > COPY 10000
> > --------------------
> > Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))
> > -------- ---------
> > 16 752
> > 32 708
> >
> > This proves that main reason of v13 being better is its adding extra
> blocks
> > without control.
> > though v13 is better than these results, I think we can get that also by
> > changing multiplier and max limit .
> >
> > But I think we are ok with the max size as 4MB (512 blocks) right?
>
> Yeah, kind of. But obviously if we could make the limit smaller
> without hurting performance, that would be better.
>
> Per my note yesterday about performance degradation with parallel
> COPY, I wasn't able to demonstrate that this patch gives a consistent
> performance benefit on hydra - the best result I got was speeding up a
> 9.5 minute load to 8 minutes where linear scalability would have been
> 2 minutes.

Is this test for unlogged tables? As far as I understand this patch will
show benefit if Data and WAL are on separate disks or if you test them with
unlogged tables, otherwise the WAL contention supersedes the benefit of
this patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-03-31 13:03:45
Message-ID:	CAFiTN-s0p79Oho4XQx6iv_U02BXiNNDhUjurPo9kaV_JUst5OA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 31, 2016 at 4:28 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Yeah, kind of. But obviously if we could make the limit smaller
> without hurting performance, that would be better.
>
> Per my note yesterday about performance degradation with parallel
> COPY, I wasn't able to demonstrate that this patch gives a consistent
> performance benefit on hydra - the best result I got was speeding up a
> 9.5 minute load to 8 minutes where linear scalability would have been
> 2 minutes. And I found cases where it was actually slower with the
> patch. Now maybe hydra is just a crap machine, but I'm feeling
> nervous.
>

I see the performance gain when either
"*complete data is in SSD*"
or *"data on MD and WAL on SSD"*
or *unlogged table*.

What machines did you use to test this? Have you tested really large
> data loads, like many MB or even GB of data?
>

With INSERT Script within 2 mins run data size is 18GB I am running 5-10
Mins means at least 85GB data.
(Inserts 1000 1KB tuples in each transactions)

With COPY Script within 2 mins run data size is 23GB and I am running 5-10
Mins means at least 100GB data.
(Inserts 10000 tuples in each transactions (tuple size is 1byte to 5 bytes)

Machine Details
-----------------------
I tested in 8 socket NUMA machine with 64 core.
Machine Details:
----------------------
[dilip(dot)kumar(at)cthulhu ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 8
NUMA node(s): 8
Vendor ID: GenuineIntel
CPU family: 6
Model: 47
Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz
Stepping: 2
CPU MHz: 1064.000
BogoMIPS: 4266.62

If you need some more information please let me know ?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-04-05 17:02:00
Message-ID:	CA+Tgmob7xED4AhoqLspSOF0wCMYEomgHfuVdzNJnwWVoE_c60g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 31, 2016 at 9:03 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> If you need some more information please let me know ?

I repeated the testing described in
http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com
on a MacBook Pro (OS X 10.8.5, 2.4 GHz Intel Core i7, 8GB, early 2013)
and got the following results. Note that I did not adjust
*_flush_delay in this test because that's always 0, apparently, on
MacOS.

master, unlogged tables, 1 copy: 0m18.928s, 0m20.276s, 0m18.040s
patched, unlogged tables, 1 copy: 0m20.499s, 0m20.879s, 0m18.912s
master, unlogged tables, 4 parallel copies: 0m57.301s, 0m58.045s, 0m57.556s
patched, unlogged tables, 4 parallel copies: 0m47.994s, 0m45.586s, 0m44.440s

master, logged tables, 1 copy: 0m29.353s, 0m29.693s, 0m31.840s
patched, logged tables, 1 copy: 0m30.837s, 0m31.567s, 0m36.843s
master, logged tables, 4 parallel copies: 1m45.691s, 1m53.085s, 1m35.674s
patched, logged tables, 4 parallel copies: 1m21.137s, 1m20.678s, 1m22.419s

So the first thing here is that the patch seems to be a clear win in
this test. For a single copy, it seems to be pretty much a wash.
When running 4 copies in parallel, it is about 20-25% faster with both
logged and unlogged tables. The second thing that is interesting is
that we are getting super-linear scalability even without the patch:
if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but
it really takes 60 unpatched or 45 patched. If 1 copy takes 30
seconds, you might expect 4 to take 120 seconds, but in really takes
105 unpatched or 80 patched. So we're not actually I/O constrained on
this test, I think, perhaps because this machine has an SSD.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-04-05 17:05:41
Message-ID:	CAM3SWZSFaj4ru8aFDwOv-piFMSqBtsK9tspmLP11UXoexumDVQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 5, 2016 at 10:02 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> So the first thing here is that the patch seems to be a clear win in
> this test. For a single copy, it seems to be pretty much a wash.
> When running 4 copies in parallel, it is about 20-25% faster with both
> logged and unlogged tables. The second thing that is interesting is
> that we are getting super-linear scalability even without the patch:
> if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but
> it really takes 60 unpatched or 45 patched. If 1 copy takes 30
> seconds, you might expect 4 to take 120 seconds, but in really takes
> 105 unpatched or 80 patched. So we're not actually I/O constrained on
> this test, I think, perhaps because this machine has an SSD.

It's not unusual for COPY to not be I/O constrained, I believe.

--
Peter Geoghegan

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-04-08 06:08:29
Message-ID:	CA+TgmoaHWeromNxV7Spaej9EMXY3p1ONvU3RWnOPCM-R=OinOQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 5, 2016 at 1:05 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Tue, Apr 5, 2016 at 10:02 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> So the first thing here is that the patch seems to be a clear win in
>> this test. For a single copy, it seems to be pretty much a wash.
>> When running 4 copies in parallel, it is about 20-25% faster with both
>> logged and unlogged tables. The second thing that is interesting is
>> that we are getting super-linear scalability even without the patch:
>> if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but
>> it really takes 60 unpatched or 45 patched. If 1 copy takes 30
>> seconds, you might expect 4 to take 120 seconds, but in really takes
>> 105 unpatched or 80 patched. So we're not actually I/O constrained on
>> this test, I think, perhaps because this machine has an SSD.
>
> It's not unusual for COPY to not be I/O constrained, I believe.

Yeah. I've committed the patch now, with some cosmetic cleanup.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relation extension scalability
Date:	2016-04-08 06:15:14
Message-ID:	CAFiTN-sdFXWbAMmjF2Q_0ya48HRS9Kpkfi4QAf8CE=a_spuiog@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Apr 8, 2016 at 11:38 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Yeah. I've committed the patch now, with some cosmetic cleanup.
>

Thanks Robert !!!

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com