Re: Speed up Clog Access by increasing CLOG buffers

Lists: pgsql-hackers
From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-01 04:49:19
Message-ID: CAA4eK1+8=X9mSNeVeHg_NqMsOR-XKsjuqrYzQf=iCsdh3U4EOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

After reducing ProcArrayLock contention in commit
(0e141c0fbb211bdd23783afa731e3eef95c9ad7a), the other lock
which seems to be contentious in read-write transactions is
CLogControlLock. In my investigation, I found that the contention
is mainly due to two reasons, one is that while writing the transaction
status in CLOG (TransactionIdSetPageStatus()), it acquires EXCLUSIVE
CLogControlLock which contends with every other transaction which
tries to access the CLOG for checking transaction status and to reduce it
already a patch [1] is proposed by Simon; Second contention is due to
the reason that when the CLOG page is not found in CLOG buffers, it
needs to acquire CLogControlLock in Exclusive mode which again contends
with shared lockers which tries to access the transaction status.

Increasing CLOG buffers to 64 helps in reducing the contention due to second
reason. Experiments revealed that increasing CLOG buffers only helps
once the contention around ProcArrayLock is reduced.

Performance Data
-----------------------------
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

pgbench setup
------------------------
scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-b

HEAD - commit 0e141c0f
Patch-1 - increase_clog_bufs_v1

Client Count/Patch_ver 1 8 16 32 64 128 256 HEAD 911 5695 9886 18028 27851
28654 25714 Patch-1 954 5568 9898 18450 29313 31108 28213

This data shows that there is an increase of ~5% at 64 client-count
and 8~10% at more higher clients without degradation at lower client-
count. In above data, there is some fluctuation seen at 8-client-count,
but I attribute that to run-to-run variation, however if anybody has doubts
I can again re-verify the data at lower client counts.

Now if we try to further increase the number of CLOG buffers to 128,
no improvement is seen.

I have also verified that this improvement can be seen only after the
contention around ProcArrayLock is reduced. Below is the data with
Commit before the ProcArrayLock reduction patch. Setup and test
is same as mentioned for previous test.

HEAD - commit 253de7e1
Patch-1 - increase_clog_bufs_v1

Client Count/Patch_ver 128 256 HEAD 16657 10512 Patch-1 16694 10477

I think the benefit of this patch would be more significant along
with the other patch to reduce CLogControlLock contention [1]
(I have not tested both the patches together as still there are
few issues left in the other patch), however it has it's own independent
value, so can be considered separately.

Thoughts?

[1] -
http://www.postgresql.org/message-id/CANP8+j+imQfHxkChFyfnXDyi6k-arAzRV+ZG-V_OFxEtJjOL2Q@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
increase_clog_bufs_v1.patch application/octet-stream 2.4 KB

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-03 11:41:37
Message-ID: 20150903114137.GE27649@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-09-01 10:19:19 +0530, Amit Kapila wrote:
> pgbench setup
> ------------------------
> scale factor - 300
> Data is on magnetic disk and WAL on ssd.
> pgbench -M prepared tpc-b
>
> HEAD - commit 0e141c0f
> Patch-1 - increase_clog_bufs_v1
>
> Client Count/Patch_ver 1 8 16 32 64 128 256 HEAD 911 5695 9886 18028 27851
> 28654 25714 Patch-1 954 5568 9898 18450 29313 31108 28213
>
>
> This data shows that there is an increase of ~5% at 64 client-count
> and 8~10% at more higher clients without degradation at lower client-
> count. In above data, there is some fluctuation seen at 8-client-count,
> but I attribute that to run-to-run variation, however if anybody has doubts
> I can again re-verify the data at lower client counts.

> Now if we try to further increase the number of CLOG buffers to 128,
> no improvement is seen.
>
> I have also verified that this improvement can be seen only after the
> contention around ProcArrayLock is reduced. Below is the data with
> Commit before the ProcArrayLock reduction patch. Setup and test
> is same as mentioned for previous test.

The buffer replacement algorithm for clog is rather stupid - I do wonder
where the cutoff is that it hurts.

Could you perhaps try to create a testcase where xids are accessed that
are so far apart on average that they're unlikely to be in memory? And
then test that across a number of client counts?

There's two reasons that I'd like to see that: First I'd like to avoid
regression, second I'd like to avoid having to bump the maximum number
of buffers by small buffers after every hardware generation...

> /*
> * Number of shared CLOG buffers.
> *
> - * Testing during the PostgreSQL 9.2 development cycle revealed that on a
> + * Testing during the PostgreSQL 9.6 development cycle revealed that on a
> * large multi-processor system, it was possible to have more CLOG page
> - * requests in flight at one time than the number of CLOG buffers which existed
> - * at that time, which was hardcoded to 8. Further testing revealed that
> - * performance dropped off with more than 32 CLOG buffers, possibly because
> - * the linear buffer search algorithm doesn't scale well.
> + * requests in flight at one time than the number of CLOG buffers which
> + * existed at that time, which was 32 assuming there are enough shared_buffers.
> + * Further testing revealed that either performance stayed same or dropped off
> + * with more than 64 CLOG buffers, possibly because the linear buffer search
> + * algorithm doesn't scale well or some other locking bottlenecks in the
> + * system mask the improvement.
> *
> - * Unconditionally increasing the number of CLOG buffers to 32 did not seem
> + * Unconditionally increasing the number of CLOG buffers to 64 did not seem
> * like a good idea, because it would increase the minimum amount of shared
> * memory required to start, which could be a problem for people running very
> * small configurations. The following formula seems to represent a reasonable
> * compromise: people with very low values for shared_buffers will get fewer
> - * CLOG buffers as well, and everyone else will get 32.
> + * CLOG buffers as well, and everyone else will get 64.
> *
> * It is likely that some further work will be needed here in future releases;
> * for example, on a 64-core server, the maximum number of CLOG requests that
> * can be simultaneously in flight will be even larger. But that will
> * apparently require more than just changing the formula, so for now we take
> - * the easy way out.
> + * the easy way out. It could also happen that after removing other locking
> + * bottlenecks, further increase in CLOG buffers can help, but that's not the
> + * case now.
> */

I think the comment should be more drastically rephrased to not
reference individual versions and numbers.

Greetings,

Andres Freund


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-07 13:34:10
Message-ID: 20150907133410.GO2912@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund wrote:

> The buffer replacement algorithm for clog is rather stupid - I do wonder
> where the cutoff is that it hurts.
>
> Could you perhaps try to create a testcase where xids are accessed that
> are so far apart on average that they're unlikely to be in memory? And
> then test that across a number of client counts?
>
> There's two reasons that I'd like to see that: First I'd like to avoid
> regression, second I'd like to avoid having to bump the maximum number
> of buffers by small buffers after every hardware generation...

I wonder if it would make sense to explore an idea that has been floated
for years now -- to have pg_clog pages be allocated as part of shared
buffers rather than have their own separate pool. That way, no separate
hardcoded allocation limit is needed. It's probably pretty tricky to
implement, though :-(

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-07 16:49:08
Message-ID: 20150907164908.GB5084@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2015-09-07 10:34:10 -0300, Alvaro Herrera wrote:
> I wonder if it would make sense to explore an idea that has been floated
> for years now -- to have pg_clog pages be allocated as part of shared
> buffers rather than have their own separate pool. That way, no separate
> hardcoded allocation limit is needed. It's probably pretty tricky to
> implement, though :-(

I still think that'd be a good plan, especially as it'd also let us use
a lot of related infrastructure. I doubt we could just use the standard
cache replacement mechanism though - it's not particularly efficient
either... I also have my doubts that a hash table lookup at every clog
lookup is going to be ok performancewise.

The biggest problem will probably be that the buffer manager is pretty
directly tied to relations and breaking up that bond won't be all that
easy. My guess is that the best bet here is that the easiest way to at
least explore this is to define pg_clog/... as their own tablespaces
(akin to pg_global) and treat the files therein as plain relations.

Greetings,

Andres Freund


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-07 18:56:39
Message-ID: 20150907185639.GQ2912@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund wrote:

> On 2015-09-07 10:34:10 -0300, Alvaro Herrera wrote:
> > I wonder if it would make sense to explore an idea that has been floated
> > for years now -- to have pg_clog pages be allocated as part of shared
> > buffers rather than have their own separate pool. That way, no separate
> > hardcoded allocation limit is needed. It's probably pretty tricky to
> > implement, though :-(
>
> I still think that'd be a good plan, especially as it'd also let us use
> a lot of related infrastructure. I doubt we could just use the standard
> cache replacement mechanism though - it's not particularly efficient
> either... I also have my doubts that a hash table lookup at every clog
> lookup is going to be ok performancewise.

Yeah. I guess we'd have to mark buffers as unusable for regular pages
("somehow"), and have a separate lookup mechanism. As I said, it is
likely to be tricky.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-08 11:50:01
Message-ID: CAA4eK1KpXReQcFL-qKw6T7buYqQAmAEPwYgwCzmRtS+9J4dq0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 7, 2015 at 7:04 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
wrote:
>
> Andres Freund wrote:
>
> > The buffer replacement algorithm for clog is rather stupid - I do wonder
> > where the cutoff is that it hurts.
> >
> > Could you perhaps try to create a testcase where xids are accessed that
> > are so far apart on average that they're unlikely to be in memory?
> >

Yes, I am working on it, what I have in mind is to create a table with
large number of rows (say 50000000) and have each row with different
transaction id. Now each transaction should try to update rows that
are at least 1048576 (number of transactions whose status can be held in
32 CLog buffers) distance apart, that way for each update it will try to
access
Clog page that is not in-memory. Let me know if you can think of any
better or simpler way.

> > There's two reasons that I'd like to see that: First I'd like to avoid
> > regression, second I'd like to avoid having to bump the maximum number
> > of buffers by small buffers after every hardware generation...
>
> I wonder if it would make sense to explore an idea that has been floated
> for years now -- to have pg_clog pages be allocated as part of shared
> buffers rather than have their own separate pool.
>

There could be some benefits of it, but I think we still have to acquire
Exclusive lock while committing transaction or while Extending Clog
which are also major sources of contention in this area. I think the
benefits of moving it to shared_buffers could be that the upper limit on
number of pages that can be retained in memory could be increased and even
if we have to replace the page, responsibility to flush it could be
delegated
to checkpoint. So yes, there could be benefits with this idea, but not sure
if they are worth investigating this idea, one thing we could try if you
think
that is beneficial is that just skip fsync during write of clog pages and
if thats
beneficial, then we can think of pushing it to checkpoint (something similar
to what Andres has mentioned on nearby thread).

Yet another way could be to have configuration variable for clog buffers
(Clog_Buffers).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-08 18:30:19
Message-ID: CA+TgmoaDRa2ioA=Udku9bgJOAff743nb+phLLVyY46rmHeFa5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 7, 2015 at 9:34 AM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> Andres Freund wrote:
>> The buffer replacement algorithm for clog is rather stupid - I do wonder
>> where the cutoff is that it hurts.
>>
>> Could you perhaps try to create a testcase where xids are accessed that
>> are so far apart on average that they're unlikely to be in memory? And
>> then test that across a number of client counts?
>>
>> There's two reasons that I'd like to see that: First I'd like to avoid
>> regression, second I'd like to avoid having to bump the maximum number
>> of buffers by small buffers after every hardware generation...
>
> I wonder if it would make sense to explore an idea that has been floated
> for years now -- to have pg_clog pages be allocated as part of shared
> buffers rather than have their own separate pool. That way, no separate
> hardcoded allocation limit is needed. It's probably pretty tricky to
> implement, though :-(

Yeah, I looked at that once and threw my hands up in despair pretty
quickly. I also considered another idea that looked simpler: instead
of giving every SLRU its own pool of pages, have one pool of pages for
all of them, separate from shared buffers but common to all SLRUs.
That looked easier, but still not easy.

I've also considered trying to replace the entire SLRU system with new
code and throwing away what exists today. The locking mode is just
really strange compared to what we do elsewhere. That, too, does not
look all that easy. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-11 14:31:18
Message-ID: CAA4eK1LMMGNQ439BUm0LcS3p0sb8S9kc-cUGU_ThNqMwA8_Tug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 3, 2015 at 5:11 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2015-09-01 10:19:19 +0530, Amit Kapila wrote:
> > pgbench setup
> > ------------------------
> > scale factor - 300
> > Data is on magnetic disk and WAL on ssd.
> > pgbench -M prepared tpc-b
> >
> > HEAD - commit 0e141c0f
> > Patch-1 - increase_clog_bufs_v1
> >
>
> The buffer replacement algorithm for clog is rather stupid - I do wonder
> where the cutoff is that it hurts.
>
> Could you perhaps try to create a testcase where xids are accessed that
> are so far apart on average that they're unlikely to be in memory? And
> then test that across a number of client counts?
>

Okay, I have tried one such test, but what I could come up with is on an
average every 100th access is a disk access and then tested it with
different number of clog buffers and client count. Below is the result:

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=32GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB
autovacuum=off

HEAD - commit 49124613
Patch-1 - Clog Buffers - 64
Patch-2 - Clog Buffers - 128

Client Count/Patch_ver 1 8 64 128 HEAD 1395 8336 37866 34463 Patch-1 1615
8180 37799 35315 Patch-2 1409 8219 37068 34729
So there is not much difference in test results with different values for
Clog
buffers, probably because the I/O has dominated the test and it shows that
increasing the clog buffers won't regress the current behaviour even though
there are lot more accesses for transaction status outside CLOG buffers.

Now about the test, create a table with large number of rows (say 11617457,
I have tried to create larger, but it was taking too much time (more than a
day))
and have each row with different transaction id. Now each transaction
should
update rows that are at least 1048576 (number of transactions whose status
can
be held in 32 CLog buffers) distance apart, that way ideally for each update
it will
try to access Clog page that is not in-memory, however as the value to
update
is getting selected randomly and that leads to every 100th access as disk
access.

Test
-------
1. Attached file clog_prep.sh should create and populate the required
table and create the function used to access the CLOG pages. You
might want to update the no_of_rows based on the rows you want to
create in table
2. Attached file access_clog_disk.sql is used to execute the function
with random values. You might want to update nrows variable based
on the rows created in previous step.
3. Use pgbench as follows with different client count
./pgbench -c 4 -j 4 -n -M prepared -f "access_clog_disk.sql" -T 300 postgres
4. To ensure that clog access function always accesses same data
during each run, the test ensures to copy the data_directory created by
step-1
before each run.

I have checked by adding some instrumentation that approximately
every 100th access is disk access, attached patch clog_info-v1.patch
adds the necessary instrumentation in code.

As an example, pgbench test yields below results:
./pgbench -c 4 -j 4 -n -M prepared -f "access_clog_disk.sql" -T 180 postgres

LOG: trans_status(3169396)
LOG: trans_status_disk(29546)
LOG: trans_status(3054952)
LOG: trans_status_disk(28291)
LOG: trans_status(3131242)
LOG: trans_status_disk(28989)
LOG: trans_status(3155449)
LOG: trans_status_disk(29347)

Here 'trans_status' is the number of times the process went for accessing
the CLOG status and 'trans_status_disk' is the number of times it went
to disk for accessing CLOG page.

>
> > /*
> > * Number of shared CLOG buffers.
> > *
>
>
>
> I think the comment should be more drastically rephrased to not
> reference individual versions and numbers.
>

Updated comments and the patch (increate_clog_bufs_v2.patch)
containing the same is attached.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
clog_prep.sh application/x-sh 2.6 KB
access_clog_disk.sql application/octet-stream 154 bytes
clog_info-v1.patch application/octet-stream 1.9 KB
increase_clog_bufs_v2.patch application/octet-stream 2.1 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-11 15:51:36
Message-ID: CA+TgmoYE4kj=fRNwPPL6+Qm-oD-JYX+RnxFjVaGGgOjT1aj70Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2015 at 10:31 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > Could you perhaps try to create a testcase where xids are accessed that
> > are so far apart on average that they're unlikely to be in memory? And
> > then test that across a number of client counts?
> >
>
> Now about the test, create a table with large number of rows (say 11617457,
> I have tried to create larger, but it was taking too much time (more than a day))
> and have each row with different transaction id. Now each transaction should
> update rows that are at least 1048576 (number of transactions whose status can
> be held in 32 CLog buffers) distance apart, that way ideally for each update it will
> try to access Clog page that is not in-memory, however as the value to update
> is getting selected randomly and that leads to every 100th access as disk access.

What about just running a regular pgbench test, but hacking the
XID-assignment code so that we increment the XID counter by 100 each
time instead of 1?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-12 03:01:51
Message-ID: CAA4eK1JxL0zfqNxX=a-bRyNbCfXeL9Pq8v5oeoPb8Z_u2sjL+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2015 at 9:21 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Sep 11, 2015 at 10:31 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > > Could you perhaps try to create a testcase where xids are accessed
that
> > > are so far apart on average that they're unlikely to be in memory? And
> > > then test that across a number of client counts?
> > >
> >
> > Now about the test, create a table with large number of rows (say
11617457,
> > I have tried to create larger, but it was taking too much time (more
than a day))
> > and have each row with different transaction id. Now each transaction
should
> > update rows that are at least 1048576 (number of transactions whose
status can
> > be held in 32 CLog buffers) distance apart, that way ideally for each
update it will
> > try to access Clog page that is not in-memory, however as the value to
update
> > is getting selected randomly and that leads to every 100th access as
disk access.
>
> What about just running a regular pgbench test, but hacking the
> XID-assignment code so that we increment the XID counter by 100 each
> time instead of 1?
>

If I am not wrong we need 1048576 number of transactions difference
for each record to make each CLOG access a disk access, so if we
increment XID counter by 100, then probably every 10000th (or multiplier
of 10000) transaction would go for disk access.

The number 1048576 is derived by below calc:
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)

Transaction difference required for each transaction to go for disk access:
CLOG_XACTS_PER_PAGE * num_clog_buffers.

I think reducing to every 100th access for transaction status as disk access
is sufficient to prove that there is no regression with the patch for the
screnario
asked by Andres or do you think it is not?

Now another possibility here could be that we try by commenting out fsync
in CLOG path to see how much it impact the performance of this test and
then for pgbench test. I am not sure there will be any impact because even
every 100th transaction goes to disk access that is still less as compare
WAL fsync which we have to perform for each transaction.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-14 11:23:27
Message-ID: CA+TgmobJCGdyZkdbD8cSMh7NfPvEbo2_-BL_TVGapH_xX7YtiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2015 at 11:01 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> If I am not wrong we need 1048576 number of transactions difference
> for each record to make each CLOG access a disk access, so if we
> increment XID counter by 100, then probably every 10000th (or multiplier
> of 10000) transaction would go for disk access.
>
> The number 1048576 is derived by below calc:
> #define CLOG_XACTS_PER_BYTE 4
> #define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
>
> Transaction difference required for each transaction to go for disk access:
> CLOG_XACTS_PER_PAGE * num_clog_buffers.
>
> I think reducing to every 100th access for transaction status as disk access
> is sufficient to prove that there is no regression with the patch for the
> screnario
> asked by Andres or do you think it is not?

I have no idea. I was just suggesting that hacking the server somehow
might be an easier way of creating the scenario Andres was interested
in than the process you described. But feel free to ignore me, I
haven't taken much time to think about this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-18 17:38:59
Message-ID: 55FC4C33.1050903@redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/11/2015 10:31 AM, Amit Kapila wrote:
> Updated comments and the patch (increate_clog_bufs_v2.patch)
> containing the same is attached.
>

I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x
RAID10 SSD (data + xlog) with Min(64,).

Kept the shared_buffers=64GB and effective_cache_size=160GB settings
across all runs, but did runs with both synchronous_commit on and off
and different scale factors for pgbench.

The results are in flux for all client numbers within -2 to +2%
depending on the latency average.

So no real conclusion from here other than the patch doesn't help/hurt
performance on this setup, likely depends on further CLogControlLock
related changes to see real benefit.

Best regards,
Jesper


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-19 03:11:44
Message-ID: CAA4eK1LKw4zR6Mb5EQWsmRCyhxpeUakRrDyB1Gth6nw4ktw3iw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 18, 2015 at 11:08 PM, Jesper Pedersen <
jesper(dot)pedersen(at)redhat(dot)com> wrote:

> On 09/11/2015 10:31 AM, Amit Kapila wrote:
>
>> Updated comments and the patch (increate_clog_bufs_v2.patch)
>> containing the same is attached.
>>
>>
> I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x
> RAID10 SSD (data + xlog) with Min(64,).
>
>
The benefit with this patch could be seen at somewhat higher
client-count as you can see in my initial mail, can you please
once try with client count > 64?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-21 01:04:28
Message-ID: CAM3SWZRWJp05QvwPtuDL4xmixuYcBeq2ChVZpgwJZdZ_ncGDYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 31, 2015 at 9:49 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> Increasing CLOG buffers to 64 helps in reducing the contention due to second
> reason. Experiments revealed that increasing CLOG buffers only helps
> once the contention around ProcArrayLock is reduced.

There has been a lot of research on bitmap compression, more or less
for the benefit of bitmap index access methods.

Simple techniques like run length encoding are effective for some
things. If the need to map the bitmap into memory to access the status
of transactions is a concern, there has been work done on that, too.
Byte-aligned bitmap compression is a technique that might offer a good
trade-off between compression clog, and decompression overhead -- I
think that there basically is no decompression overhead, because set
operations can be performed on the "compressed" representation
directly. There are other techniques, too.

Something to consider. There could be multiple benefits to compressing
clog, even beyond simply avoiding managing clog buffers.

--
Peter Geoghegan


From: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-09-21 12:55:45
Message-ID: 55FFFE51.1060308@redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/18/2015 11:11 PM, Amit Kapila wrote:
>> I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x
>> RAID10 SSD (data + xlog) with Min(64,).
>>
>>
> The benefit with this patch could be seen at somewhat higher
> client-count as you can see in my initial mail, can you please
> once try with client count > 64?
>

Client count were from 1 to 80.

I did do one run with Min(128,) like you, but didn't see any difference
in the result compared to Min(64,), so focused instead in the
sync_commit on/off testing case.

Best regards,
Jesper


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-10-05 01:04:17
Message-ID: CAMkU=1yLzEBi3w-zsAMzyYvDs-FM1p_AiUu9=0d67u0fULWgqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> On Fri, Sep 11, 2015 at 9:21 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> >
> > On Fri, Sep 11, 2015 at 10:31 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > > > Could you perhaps try to create a testcase where xids are accessed
> that
> > > > are so far apart on average that they're unlikely to be in memory?
> And
> > > > then test that across a number of client counts?
> > > >
> > >
> > > Now about the test, create a table with large number of rows (say
> 11617457,
> > > I have tried to create larger, but it was taking too much time (more
> than a day))
> > > and have each row with different transaction id. Now each transaction
> should
> > > update rows that are at least 1048576 (number of transactions whose
> status can
> > > be held in 32 CLog buffers) distance apart, that way ideally for each
> update it will
> > > try to access Clog page that is not in-memory, however as the value to
> update
> > > is getting selected randomly and that leads to every 100th access as
> disk access.
> >
> > What about just running a regular pgbench test, but hacking the
> > XID-assignment code so that we increment the XID counter by 100 each
> > time instead of 1?
> >
>
> If I am not wrong we need 1048576 number of transactions difference
> for each record to make each CLOG access a disk access, so if we
> increment XID counter by 100, then probably every 10000th (or multiplier
> of 10000) transaction would go for disk access.
>
> The number 1048576 is derived by below calc:
> #define CLOG_XACTS_PER_BYTE 4
> #define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
>

> Transaction difference required for each transaction to go for disk access:
> CLOG_XACTS_PER_PAGE * num_clog_buffers.
>

That guarantees that every xid occupies its own 32-contiguous-pages chunk
of clog.

But clog pages are not pulled in and out in 32-page chunks, but one page
chunks. So you would only need 32,768 differences to get every real
transaction to live on its own clog page, which means every look up of a
different real transaction would have to do a page replacement. (I think
your references to disk access here are misleading. Isn't the issue here
the contention on the lock that controls the page replacement, not the
actual IO?)

I've attached a patch that allows you set the guc "JJ_xid",which makes it
burn the given number of xids every time one new one is asked for. (The
patch introduces lots of other stuff as well, but I didn't feel like
ripping the irrelevant parts out--if you don't set any of the other gucs it
introduces from their defaults, they shouldn't cause you trouble.) I think
there are other tools around that do the same thing, but this is the one I
know about. It is easy to drive the system into wrap-around shutdown with
this, so lowering autovacuum_vacuum_cost_delay is a good idea.

Actually I haven't attached it, because then the commitfest app will list
it as the patch needing review, instead I've put it here
https://drive.google.com/file/d/0Bzqrh1SO9FcERV9EUThtT3pacmM/view?usp=sharing

I think reducing to every 100th access for transaction status as disk access
> is sufficient to prove that there is no regression with the patch for the
> screnario
> asked by Andres or do you think it is not?
>
> Now another possibility here could be that we try by commenting out fsync
> in CLOG path to see how much it impact the performance of this test and
> then for pgbench test. I am not sure there will be any impact because even
> every 100th transaction goes to disk access that is still less as compare
> WAL fsync which we have to perform for each transaction.
>

You mentioned that your clog is not on ssd, but surely at this scale of
hardware, the hdd the clog is on has a bbu in front of it, no?

But I thought Andres' concern was not about fsync, but about the fact that
the SLRU does linear scans (repeatedly) of the buffers while holding the
control lock? At some point, scanning more and more buffers under the lock
is going to cause more contention than scanning fewer buffers and just
evicting a page will.

Cheers,

Jeff


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-10-05 03:27:00
Message-ID: CAA4eK1Jh0BLZZycQWU8whLr4Drmw=o_iENDeHw8tJUGrP320bw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 5, 2015 at 6:34 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>>
>>
>> If I am not wrong we need 1048576 number of transactions difference
>> for each record to make each CLOG access a disk access, so if we
>> increment XID counter by 100, then probably every 10000th (or multiplier
>> of 10000) transaction would go for disk access.
>>
>> The number 1048576 is derived by below calc:
>> #define CLOG_XACTS_PER_BYTE 4
>> #define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
>>
>
>> Transaction difference required for each transaction to go for disk
>> access:
>> CLOG_XACTS_PER_PAGE * num_clog_buffers.
>>
>
>
> That guarantees that every xid occupies its own 32-contiguous-pages chunk
> of clog.
>
> But clog pages are not pulled in and out in 32-page chunks, but one page
> chunks. So you would only need 32,768 differences to get every real
> transaction to live on its own clog page, which means every look up of a
> different real transaction would have to do a page replacement.
>

Agreed, but that doesn't effect the test result with the test done above.

> (I think your references to disk access here are misleading. Isn't the
> issue here the contention on the lock that controls the page replacement,
> not the actual IO?)
>
>
The point is that if there is no I/O needed, then all the read-access for
transaction status will just use Shared locks, however if there is an I/O,
then it would need an Exclusive lock.

> I've attached a patch that allows you set the guc "JJ_xid",which makes it
> burn the given number of xids every time one new one is asked for. (The
> patch introduces lots of other stuff as well, but I didn't feel like
> ripping the irrelevant parts out--if you don't set any of the other gucs it
> introduces from their defaults, they shouldn't cause you trouble.) I think
> there are other tools around that do the same thing, but this is the one I
> know about. It is easy to drive the system into wrap-around shutdown with
> this, so lowering autovacuum_vacuum_cost_delay is a good idea.
>
> Actually I haven't attached it, because then the commitfest app will list
> it as the patch needing review, instead I've put it here
> https://drive.google.com/file/d/0Bzqrh1SO9FcERV9EUThtT3pacmM/view?usp=sharing
>
>
Thanks, I think probably this could also be used for testing.

> I think reducing to every 100th access for transaction status as disk
>> access
>> is sufficient to prove that there is no regression with the patch for the
>> screnario
>> asked by Andres or do you think it is not?
>>
>> Now another possibility here could be that we try by commenting out fsync
>> in CLOG path to see how much it impact the performance of this test and
>> then for pgbench test. I am not sure there will be any impact because
>> even
>> every 100th transaction goes to disk access that is still less as compare
>> WAL fsync which we have to perform for each transaction.
>>
>
> You mentioned that your clog is not on ssd, but surely at this scale of
> hardware, the hdd the clog is on has a bbu in front of it, no?
>
>
Yes.

> But I thought Andres' concern was not about fsync, but about the fact that
> the SLRU does linear scans (repeatedly) of the buffers while holding the
> control lock? At some point, scanning more and more buffers under the lock
> is going to cause more contention than scanning fewer buffers and just
> evicting a page will.
>
>

Yes, at some point, that could matter, but I could not see the impact
at 64 or 128 number of Clog buffers.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-17 06:50:36
Message-ID: CAA4eK1L_snxM_JcrzEstNq9P66++F4kKFce=1r5+D1vzPofdtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 21, 2015 at 6:25 PM, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com
> wrote:

> On 09/18/2015 11:11 PM, Amit Kapila wrote:
>
>> I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x
>>> RAID10 SSD (data + xlog) with Min(64,).
>>>
>>>
>>> The benefit with this patch could be seen at somewhat higher
>> client-count as you can see in my initial mail, can you please
>> once try with client count > 64?
>>
>>
> Client count were from 1 to 80.
>
> I did do one run with Min(128,) like you, but didn't see any difference in
> the result compared to Min(64,), so focused instead in the sync_commit
> on/off testing case.
>

I think the main focus for test in this area would be at higher client
count. At what scale factors have you taken the data and what are
the other non-default settings you have used. By the way, have you
tried by dropping and recreating the database and restarting the server
after each run, can you share the exact steps you have used to perform
the tests. I am not sure why it is not showing the benefit in your testing,
may be the benefit is on some what more higher end m/c or it could be
that some of the settings used for test are not same as mine or the way
to test the read-write workload of pgbench is different.

In anycase, I went ahead and tried further reducing the CLogControlLock
contention by grouping the transaction status updates. The basic idea
is same as is used to reduce the ProcArrayLock contention [1] which is to
allow one of the proc to become leader and update the transaction status for
other active transactions in system. This has helped to reduce the
contention
around CLOGControlLock. Attached patch group_update_clog_v1.patch
implements this idea.

I have taken performance data with this patch to see the impact at
various scale-factors. All the data is for cases when data fits in shared
buffers and is taken against commit - 5c90a2ff on server with below
configuration and non-default postgresql.conf settings.

Performance Data
-----------------------------
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

Refer attached files for performance data.

sc_300_perf.png - This data indicates that at scale_factor 300, there is a
gain of ~15% at higher client counts, without degradation at lower client
count.
different_sc_perf.png - At various scale factors, there is a gain from
~15% to 41% at higher client counts and in some cases we see gain
of ~5% at somewhat moderate client count (64) as well.
perf_write_clogcontrollock_data_v1.ods - Detailed performance data at
various client counts and scale factors.

Feel free to ask for more details if the data in attached files is not
clear.

Below is the LWLock_Stats information with and without patch:

Stats Data
---------
A. scale_factor = 300; shared_buffers=32GB; client_connections - 128

HEAD - 5c90a2ff
----------------
CLogControlLock Data
------------------------
PID 94100 lwlock main 11: shacq 678672 exacq 326477 blk 204427 spindelay
8532 dequeue self 93192
PID 94129 lwlock main 11: shacq 757047 exacq 363176 blk 207840 spindelay
8866 dequeue self 96601
PID 94115 lwlock main 11: shacq 721632 exacq 345967 blk 207665 spindelay
8595 dequeue self 96185
PID 94011 lwlock main 11: shacq 501900 exacq 241346 blk 173295 spindelay
7882 dequeue self 78134
PID 94087 lwlock main 11: shacq 653701 exacq 314311 blk 201733 spindelay
8419 dequeue self 92190

After Patch group_update_clog_v1
----------------
CLogControlLock Data
------------------------
PID 100205 lwlock main 11: shacq 836897 exacq 176007 blk 116328 spindelay
1206 dequeue self 54485
PID 100034 lwlock main 11: shacq 437610 exacq 91419 blk 77523 spindelay 994
dequeue self 35419
PID 100175 lwlock main 11: shacq 748948 exacq 158970 blk 114027 spindelay
1277 dequeue self 53486
PID 100162 lwlock main 11: shacq 717262 exacq 152807 blk 115268 spindelay
1227 dequeue self 51643
PID 100214 lwlock main 11: shacq 856044 exacq 180422 blk 113695 spindelay
1202 dequeue self 54435

The above data indicates that contention due to CLogControlLock is
reduced by around 50% with this patch.

The reasons for remaining contention could be:

1. Readers of clog data (checking transaction status data) can take
Exclusive CLOGControlLock when reading the page from disk, this can
contend with other Readers (shared lockers of CLogControlLock) and with
exclusive locker which updates transaction status. One of the ways to
mitigate this contention is to increase the number of CLOG buffers for which
patch has been already posted on this thread.

2. Readers of clog data (checking transaction status data) takes shared
CLOGControlLock which can contend with exclusive locker (Group leader) which
updates transaction status. I have tried to reduce the amount of work done
by group leader, by allowing group leader to just read the Clog page once
for all the transactions in the group which updated the same CLOG page
(idea similar to what we currently we use for updating the status of
transactions
having sub-transaction tree), but that hasn't given any further performance
boost,
so I left it.

I think we can use some other ways as well to reduce the contention around
CLOGControlLock by doing somewhat major surgery around SLRU like using
buffer pools similar to shared buffers, but this idea gives us moderate
improvement without much impact on exiting mechanism.

Thoughts?

[1] -
http://www.postgresql.org/message-id/CAA4eK1JbX4FzPHigNt0JSaz30a85BPJV+ewhk+wg_o-T6xufEA@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v1.patch application/octet-stream 12.5 KB
image/png 66.2 KB
different_sc_perf.png image/png 82.0 KB
perf_write_clogcontrollock_data_v1.ods application/vnd.oasis.opendocument.spreadsheet 24.8 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-17 08:02:29
Message-ID: CAA4eK1KtyBowAQNUf9WfaOKeG3wqc46mt6FQZHROf8Jd+AM-aQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 21, 2015 at 6:34 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>
> On Mon, Aug 31, 2015 at 9:49 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > Increasing CLOG buffers to 64 helps in reducing the contention due to
second
> > reason. Experiments revealed that increasing CLOG buffers only helps
> > once the contention around ProcArrayLock is reduced.
>
> There has been a lot of research on bitmap compression, more or less
> for the benefit of bitmap index access methods.
>
> Simple techniques like run length encoding are effective for some
> things. If the need to map the bitmap into memory to access the status
> of transactions is a concern, there has been work done on that, too.
> Byte-aligned bitmap compression is a technique that might offer a good
> trade-off between compression clog, and decompression overhead -- I
> think that there basically is no decompression overhead, because set
> operations can be performed on the "compressed" representation
> directly. There are other techniques, too.
>

I could see benefits of doing compression for CLOG, but I think it won't
be straight forward, other than handling of compression and decompression,
currently code relies on transaction id to find the clog page, that will
not work after compression or we need to do some changes in that mapping
to make it work. Also I think it could avoid the increase of clog buffers
which
can help readers, but it won't help much for contention around clog
updates for transaction status.

Overall this idea sounds promising, but I think the work involved is more
than the benefit I am expecting for the current optimization we are
discussing.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-17 08:06:47
Message-ID: CAA4eK1+9AcXMBAnWi5ntVwmQfd3pR4m7UMJtoEd1Lb6zPemgHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 17, 2015 at 1:32 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>
> On Mon, Sep 21, 2015 at 6:34 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> >
> > On Mon, Aug 31, 2015 at 9:49 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > > Increasing CLOG buffers to 64 helps in reducing the contention due to
second
> > > reason. Experiments revealed that increasing CLOG buffers only helps
> > > once the contention around ProcArrayLock is reduced.
> >
>
> Overall this idea sounds promising, but I think the work involved is more
> than the benefit I am expecting for the current optimization we are
> discussing.
>

Sorry, I think last line is slightly confusing, let me try to again write
it:

Overall this idea sounds promising, but I think the work involved is more
than the benefit expected from the current optimization we are
discussing.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-17 09:15:35
Message-ID: CANP8+jKGcCypXg6cTsAx=vOze81wHrEmEUPu9qWj8BfwvB9Thw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17 November 2015 at 06:50, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> In anycase, I went ahead and tried further reducing the CLogControlLock
> contention by grouping the transaction status updates. The basic idea
> is same as is used to reduce the ProcArrayLock contention [1] which is to
> allow one of the proc to become leader and update the transaction status
> for
> other active transactions in system. This has helped to reduce the
> contention
> around CLOGControlLock.
>

Sounds good. The technique has proved effective with proc array and makes
sense to use here also.

> Attached patch group_update_clog_v1.patch
> implements this idea.
>

I don't think we should be doing this only for transactions that don't have
subtransactions. We are trying to speed up real cases, not just benchmarks.

So +1 for the concept, patch is going in right direction though lets do the
full press-up.

The above data indicates that contention due to CLogControlLock is
> reduced by around 50% with this patch.
>
> The reasons for remaining contention could be:
>
> 1. Readers of clog data (checking transaction status data) can take
> Exclusive CLOGControlLock when reading the page from disk, this can
> contend with other Readers (shared lockers of CLogControlLock) and with
> exclusive locker which updates transaction status. One of the ways to
> mitigate this contention is to increase the number of CLOG buffers for
> which
> patch has been already posted on this thread.
>
> 2. Readers of clog data (checking transaction status data) takes shared
> CLOGControlLock which can contend with exclusive locker (Group leader)
> which
> updates transaction status. I have tried to reduce the amount of work done
> by group leader, by allowing group leader to just read the Clog page once
> for all the transactions in the group which updated the same CLOG page
> (idea similar to what we currently we use for updating the status of
> transactions
> having sub-transaction tree), but that hasn't given any further
> performance boost,
> so I left it.
>
> I think we can use some other ways as well to reduce the contention around
> CLOGControlLock by doing somewhat major surgery around SLRU like using
> buffer pools similar to shared buffers, but this idea gives us moderate
> improvement without much impact on exiting mechanism.
>

My earlier patch to reduce contention by changing required lock level is
still valid here. Increasing the number of buffers doesn't do enough to
remove that.

I'm working on a patch to use a fast-update area like we use for GIN. If a
page is not available when we want to record commit, just store it in a
hash table, when not in crash recovery. I'm experimenting with writing WAL
for any xids earlier than last checkpoint, though we could also trickle
writes and/or flush them in batches at checkpoint time - your code would
help there.

The hash table can also be used for lookups. My thinking is that most reads
of older xids are caused by long running transactions, so they cause a page
fault at commit and then other page faults later when people read them back
in. The hash table works for both kinds of page fault.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-17 11:27:20
Message-ID: CAA4eK1LJmK=CfHf3xu+fPHMPzs6GYXON_Ni7LvncP7LddWaXWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 17, 2015 at 2:45 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 17 November 2015 at 06:50, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
>
>> In anycase, I went ahead and tried further reducing the CLogControlLock
>> contention by grouping the transaction status updates. The basic idea
>> is same as is used to reduce the ProcArrayLock contention [1] which is to
>> allow one of the proc to become leader and update the transaction status
>> for
>> other active transactions in system. This has helped to reduce the
>> contention
>> around CLOGControlLock.
>>
>
> Sounds good. The technique has proved effective with proc array and makes
> sense to use here also.
>
>
>> Attached patch group_update_clog_v1.patch
>> implements this idea.
>>
>
> I don't think we should be doing this only for transactions that don't
> have subtransactions.
>

The reason for not doing this optimization for subtransactions is that we
need to advertise the information that Group leader needs for updating
the transaction status and if we want to do it for sub transactions, then
all the subtransaction id's needs to be advertised. Now here the tricky
part is that number of subtransactions for which the status needs to
be updated is dynamic, so reserving memory for it would be difficult.
However, we can reserve some space in Proc like we do for XidCache
(cache of sub transaction ids) and then use that to advertise that many
Xid's at-a-time or just allow this optimization if number of subtransactions
is lesser than or equal to the size of this new XidCache. I am not sure
if it is good idea to use the existing XidCache for this purpose in which
case we need to have a separate space in PGProc for this purpose. I
don't see allocating space for 64 or so subxid's as a problem, however
doing it for bigger number could be cause of concern.

> We are trying to speed up real cases, not just benchmarks.
>
> So +1 for the concept, patch is going in right direction though lets do
> the full press-up.
>
>
I have mentioned above the reason for not doing it for sub transactions, if
you think it is viable to reserve space in shared memory for this purpose,
then
I can include the optimization for subtransactions as well.

> The above data indicates that contention due to CLogControlLock is
>> reduced by around 50% with this patch.
>>
>> The reasons for remaining contention could be:
>>
>> 1. Readers of clog data (checking transaction status data) can take
>> Exclusive CLOGControlLock when reading the page from disk, this can
>> contend with other Readers (shared lockers of CLogControlLock) and with
>> exclusive locker which updates transaction status. One of the ways to
>> mitigate this contention is to increase the number of CLOG buffers for
>> which
>> patch has been already posted on this thread.
>>
>> 2. Readers of clog data (checking transaction status data) takes shared
>> CLOGControlLock which can contend with exclusive locker (Group leader)
>> which
>> updates transaction status. I have tried to reduce the amount of work
>> done
>> by group leader, by allowing group leader to just read the Clog page once
>> for all the transactions in the group which updated the same CLOG page
>> (idea similar to what we currently we use for updating the status of
>> transactions
>> having sub-transaction tree), but that hasn't given any further
>> performance boost,
>> so I left it.
>>
>> I think we can use some other ways as well to reduce the contention around
>> CLOGControlLock by doing somewhat major surgery around SLRU like using
>> buffer pools similar to shared buffers, but this idea gives us moderate
>> improvement without much impact on exiting mechanism.
>>
>
> My earlier patch to reduce contention by changing required lock level is
> still valid here. Increasing the number of buffers doesn't do enough to
> remove that.
>
>
I understand that increasing alone the number of buffers is not
enough, that's why I have tried this group leader idea. However
if we do something on lines what you have described below
(handling page faults) could avoid the need for increasing buffers.

> I'm working on a patch to use a fast-update area like we use for GIN. If a
> page is not available when we want to record commit, just store it in a
> hash table, when not in crash recovery. I'm experimenting with writing WAL
> for any xids earlier than last checkpoint, though we could also trickle
> writes and/or flush them in batches at checkpoint time - your code would
> help there.
>
> The hash table can also be used for lookups. My thinking is that most
> reads of older xids are caused by long running transactions, so they cause
> a page fault at commit and then other page faults later when people read
> them back in. The hash table works for both kinds of page fault.
>
>

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-17 11:34:17
Message-ID: CANP8+j+q4+ZP0JExgvDRPBpW6cDjb15nvjBR3iWmyoHH--3dCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17 November 2015 at 11:27, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> Attached patch group_update_clog_v1.patch
>>> implements this idea.
>>>
>>
>> I don't think we should be doing this only for transactions that don't
>> have subtransactions.
>>
>
> The reason for not doing this optimization for subtransactions is that we
> need to advertise the information that Group leader needs for updating
> the transaction status and if we want to do it for sub transactions, then
> all the subtransaction id's needs to be advertised. Now here the tricky
> part is that number of subtransactions for which the status needs to
> be updated is dynamic, so reserving memory for it would be difficult.
> However, we can reserve some space in Proc like we do for XidCache
> (cache of sub transaction ids) and then use that to advertise that many
> Xid's at-a-time or just allow this optimization if number of
> subtransactions
> is lesser than or equal to the size of this new XidCache. I am not sure
> if it is good idea to use the existing XidCache for this purpose in which
> case we need to have a separate space in PGProc for this purpose. I
> don't see allocating space for 64 or so subxid's as a problem, however
> doing it for bigger number could be cause of concern.
>
>
>> We are trying to speed up real cases, not just benchmarks.
>>
>> So +1 for the concept, patch is going in right direction though lets do
>> the full press-up.
>>
>>
> I have mentioned above the reason for not doing it for sub transactions, if
> you think it is viable to reserve space in shared memory for this purpose,
> then
> I can include the optimization for subtransactions as well.
>

The number of subxids is unbounded, so as you say, reserving shmem isn't
viable.

I'm interested in real world cases, so allocating 65 xids per process isn't
needed, but we can say is that the optimization shouldn't break down
abruptly in the presence of a small/reasonable number of subtransactions.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-17 11:48:20
Message-ID: CAA4eK1+pgGLNuumZo6swNZGd1_=Sfve0fuT58JQ-KpYKF4064A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 17 November 2015 at 11:27, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> We are trying to speed up real cases, not just benchmarks.
>>>
>>> So +1 for the concept, patch is going in right direction though lets do
>>> the full press-up.
>>>
>>>
>> I have mentioned above the reason for not doing it for sub transactions,
>> if
>> you think it is viable to reserve space in shared memory for this
>> purpose, then
>> I can include the optimization for subtransactions as well.
>>
>
> The number of subxids is unbounded, so as you say, reserving shmem isn't
> viable.
>
> I'm interested in real world cases, so allocating 65 xids per process
> isn't needed, but we can say is that the optimization shouldn't break down
> abruptly in the presence of a small/reasonable number of subtransactions.
>
>
I think in that case what we can do is if the total number of
sub transactions is lesser than equal to 64 (we can find that by
overflowed flag in PGXact) , then apply this optimisation, else use
the existing flow to update the transaction status. I think for that we
don't even need to reserve any additional memory. Does that sound
sensible to you?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-17 12:20:16
Message-ID: CAA4eK1LjL6OrCKBubTtwPZgcYdZJ6ytW=wsGPv+w2JLjo6Zxjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 17, 2015 at 5:18 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs <simon(at)2ndquadrant(dot)com>
> wrote:
>
>> On 17 November 2015 at 11:27, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>>
>> We are trying to speed up real cases, not just benchmarks.
>>>>
>>>> So +1 for the concept, patch is going in right direction though lets do
>>>> the full press-up.
>>>>
>>>>
>>> I have mentioned above the reason for not doing it for sub transactions,
>>> if
>>> you think it is viable to reserve space in shared memory for this
>>> purpose, then
>>> I can include the optimization for subtransactions as well.
>>>
>>
>> The number of subxids is unbounded, so as you say, reserving shmem isn't
>> viable.
>>
>> I'm interested in real world cases, so allocating 65 xids per process
>> isn't needed, but we can say is that the optimization shouldn't break down
>> abruptly in the presence of a small/reasonable number of subtransactions.
>>
>>
> I think in that case what we can do is if the total number of
> sub transactions is lesser than equal to 64 (we can find that by
> overflowed flag in PGXact) , then apply this optimisation, else use
> the existing flow to update the transaction status. I think for that we
> don't even need to reserve any additional memory.
>

I think this won't work as it is, because subxids in XidCache could be
on different pages in which case we either need an additional flag
in XidCache array or a separate array to indicate for which subxids
we want to update the status. I don't see any better way to do this
optimization for sub transactions, do you have something else in
mind?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-17 13:00:54
Message-ID: CANP8+j+hZ8ekC++eMGq33+MFiNrPNwc-GnBNfRdRjktDPd+G0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17 November 2015 at 11:48, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs <simon(at)2ndquadrant(dot)com>
> wrote:
>
>> On 17 November 2015 at 11:27, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>>
>> We are trying to speed up real cases, not just benchmarks.
>>>>
>>>> So +1 for the concept, patch is going in right direction though lets do
>>>> the full press-up.
>>>>
>>>>
>>> I have mentioned above the reason for not doing it for sub transactions,
>>> if
>>> you think it is viable to reserve space in shared memory for this
>>> purpose, then
>>> I can include the optimization for subtransactions as well.
>>>
>>
>> The number of subxids is unbounded, so as you say, reserving shmem isn't
>> viable.
>>
>> I'm interested in real world cases, so allocating 65 xids per process
>> isn't needed, but we can say is that the optimization shouldn't break down
>> abruptly in the presence of a small/reasonable number of subtransactions.
>>
>>
> I think in that case what we can do is if the total number of
> sub transactions is lesser than equal to 64 (we can find that by
> overflowed flag in PGXact) , then apply this optimisation, else use
> the existing flow to update the transaction status. I think for that we
> don't even need to reserve any additional memory. Does that sound
> sensible to you?
>

I understand you to mean that the leader should look backwards through the
queue collecting xids while !(PGXACT->overflowed)

No additional shmem is required

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-17 13:41:28
Message-ID: CAA4eK1Ksgc=-5oPq67KoZ0z1B8v+gRJ8UNWJGtsqou=WuFUdWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 17, 2015 at 6:30 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 17 November 2015 at 11:48, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
>> On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs <simon(at)2ndquadrant(dot)com>
>> wrote:
>>
>>> On 17 November 2015 at 11:27, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>>> wrote:
>>>
>>> We are trying to speed up real cases, not just benchmarks.
>>>>>
>>>>> So +1 for the concept, patch is going in right direction though lets
>>>>> do the full press-up.
>>>>>
>>>>>
>>>> I have mentioned above the reason for not doing it for sub
>>>> transactions, if
>>>> you think it is viable to reserve space in shared memory for this
>>>> purpose, then
>>>> I can include the optimization for subtransactions as well.
>>>>
>>>
>>> The number of subxids is unbounded, so as you say, reserving shmem isn't
>>> viable.
>>>
>>> I'm interested in real world cases, so allocating 65 xids per process
>>> isn't needed, but we can say is that the optimization shouldn't break down
>>> abruptly in the presence of a small/reasonable number of subtransactions.
>>>
>>>
>> I think in that case what we can do is if the total number of
>> sub transactions is lesser than equal to 64 (we can find that by
>> overflowed flag in PGXact) , then apply this optimisation, else use
>> the existing flow to update the transaction status. I think for that we
>> don't even need to reserve any additional memory. Does that sound
>> sensible to you?
>>
>
> I understand you to mean that the leader should look backwards through the
> queue collecting xids while !(PGXACT->overflowed)
>
>
Yes, that is what the above idea is, but the problem with that is leader
won't be able to collect the subxids of member proc's (from each member
proc's XidCache) as it doesn't have information which of those subxid's
needs to be update as part of current transaction status update (for
subtransactions on different clog pages, we update the status of those
in multiple phases). I think it could only be possible to use the above
idea
if all the subtransactions are on same page, which we can identify in
function TransactionIdSetPageStatus(). Though it looks okay that we
can apply this optimization when number of subtransactions is lesser
than 65 and all exist on same page, still it would be better if we can
apply it generically for all cases when number of subtransactions is small
(say 32 or 64). Does this explanation clarify the problem with the above
idea to handle subtransactions?

No additional shmem is required
>
>
If we want to do it for all cases when number of subtransactions
are small then we need extra memory.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-27 07:32:15
Message-ID: CAA4eK1+_66KZxf8PA4xU21z7rhGz7==n7ytb-eVJWdhJDF+s+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 17, 2015 at 6:30 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 17 November 2015 at 11:48, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
>>
>> I think in that case what we can do is if the total number of
>> sub transactions is lesser than equal to 64 (we can find that by
>> overflowed flag in PGXact) , then apply this optimisation, else use
>> the existing flow to update the transaction status. I think for that we
>> don't even need to reserve any additional memory. Does that sound
>> sensible to you?
>>
>
> I understand you to mean that the leader should look backwards through the
> queue collecting xids while !(PGXACT->overflowed)
>
> No additional shmem is required
>
>
Okay, as discussed I have handled the case of sub-transactions without
additional shmem in the attached patch. Apart from that, I have tried
to apply this optimization for Prepared transactions as well, but as
the dummy proc used for such transactions doesn't have semaphore like
backend proc's, so it is not possible to use such a proc in group status
updation as each group member needs to wait on semaphore. It is not tad
difficult to add the support for that case if we are okay with creating
additional
semaphore for each such dummy proc which I was not sure, so I have left
it for now.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v2.patch application/octet-stream 13.6 KB

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-28 20:17:25
Message-ID: CAMkU=1yk-ad3AkfQd8uWPFDYQR941v+uNWLtJEjjr5nA1D95AA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 26, 2015 at 11:32 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Tue, Nov 17, 2015 at 6:30 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>
>> On 17 November 2015 at 11:48, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>>
>>>
>>> I think in that case what we can do is if the total number of
>>> sub transactions is lesser than equal to 64 (we can find that by
>>> overflowed flag in PGXact) , then apply this optimisation, else use
>>> the existing flow to update the transaction status. I think for that we
>>> don't even need to reserve any additional memory. Does that sound
>>> sensible to you?
>>
>>
>> I understand you to mean that the leader should look backwards through the
>> queue collecting xids while !(PGXACT->overflowed)
>>
>> No additional shmem is required
>>
>
> Okay, as discussed I have handled the case of sub-transactions without
> additional shmem in the attached patch. Apart from that, I have tried
> to apply this optimization for Prepared transactions as well, but as
> the dummy proc used for such transactions doesn't have semaphore like
> backend proc's, so it is not possible to use such a proc in group status
> updation as each group member needs to wait on semaphore. It is not tad
> difficult to add the support for that case if we are okay with creating
> additional
> semaphore for each such dummy proc which I was not sure, so I have left
> it for now.

Is this proposal instead of, or in addition to, the original thread
topic of increasing clog buffers to 64?

Thanks,

Jeff


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-11-29 11:59:56
Message-ID: CAA4eK1JrRFXTw_ozFc5q4vv6t2OL=tpktxznk-Q4uFkYrL6aUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 29, 2015 at 1:47 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
> On Thu, Nov 26, 2015 at 11:32 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > On Tue, Nov 17, 2015 at 6:30 PM, Simon Riggs <simon(at)2ndquadrant(dot)com>
wrote:
> >>
> >> On 17 November 2015 at 11:48, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> >>>
> >>>
> >>> I think in that case what we can do is if the total number of
> >>> sub transactions is lesser than equal to 64 (we can find that by
> >>> overflowed flag in PGXact) , then apply this optimisation, else use
> >>> the existing flow to update the transaction status. I think for that
we
> >>> don't even need to reserve any additional memory. Does that sound
> >>> sensible to you?
> >>
> >>
> >> I understand you to mean that the leader should look backwards through
the
> >> queue collecting xids while !(PGXACT->overflowed)
> >>
> >> No additional shmem is required
> >>
> >
> > Okay, as discussed I have handled the case of sub-transactions without
> > additional shmem in the attached patch. Apart from that, I have tried
> > to apply this optimization for Prepared transactions as well, but as
> > the dummy proc used for such transactions doesn't have semaphore like
> > backend proc's, so it is not possible to use such a proc in group status
> > updation as each group member needs to wait on semaphore. It is not tad
> > difficult to add the support for that case if we are okay with creating
> > additional
> > semaphore for each such dummy proc which I was not sure, so I have left
> > it for now.
>
> Is this proposal instead of, or in addition to, the original thread
> topic of increasing clog buffers to 64?
>

This is in addition to increasing the clog buffers to 64, but with this
patch
(Group Clog updation), the effect of increasing the clog buffers will be
lesser.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-02 15:29:11
Message-ID: CA+TgmoahCx6XgprR=p5==cF0g9uhSHsJxVdWdUEHN9H2Mv0gkw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 27, 2015 at 2:32 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> Okay, as discussed I have handled the case of sub-transactions without
> additional shmem in the attached patch. Apart from that, I have tried
> to apply this optimization for Prepared transactions as well, but as
> the dummy proc used for such transactions doesn't have semaphore like
> backend proc's, so it is not possible to use such a proc in group status
> updation as each group member needs to wait on semaphore. It is not tad
> difficult to add the support for that case if we are okay with creating
> additional
> semaphore for each such dummy proc which I was not sure, so I have left
> it for now.

"updation" is not a word. "acquirations" is not a word. "penality"
is spelled wrong.

I think the approach this patch takes is pretty darned strange, and
almost certainly not what we want. What you're doing here is putting
every outstanding CLOG-update request into a linked list, and then the
leader goes and does all of those CLOG updates. But there's no
guarantee that the pages that need to be updated are even present in a
CLOG buffer. If it turns out that all of the batched CLOG updates are
part of resident pages, then this is going to work great, just like
the similar ProcArrayLock optimization. But if the pages are not
resident, then you will get WORSE concurrency and SLOWER performance
than the status quo. The leader will sit there and read every page
that is needed, and to do that it will repeatedly release and
reacquire CLogControlLock (inside SimpleLruReadPage). If you didn't
have a leader, the reads of all those pages could happen at the same
time, but with this design, they get serialized. That's not good.

My idea for how this could possibly work is that you could have a list
of waiting backends for each SLRU buffer page. Pages with waiting
backends can't be evicted without performing the updates for which
backends are waiting. Updates to non-resident pages just work as they
do now. When a backend acquires CLogControlLock to perform updates to
a given page, it also performs all other pending updates to that page
and releases those waiters. When a backend acquires CLogControlLock
to evict a page, it must perform any pending updates and write the
page before completing the eviction.

I agree with Simon that it's probably a good idea for this
optimization to handle cases where a backend has a non-overflowed list
of subtransactions. That seems doable. Handling the case where the
subxid list has overflowed seems unimportant; it should happen rarely
and is therefore not performance-critical. Also, handling the case
where the XIDs are spread over multiple pages seems far too
complicated to be worth the effort of trying to fit into a "fast
path". Optimizing the case where there are 1+ XIDs that need to be
updated but all on the same page should cover well over 90% of commits
on real systems, very possibly over 99%. That should be plenty good
enough to get whatever contention-reduction benefit is possible here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-03 06:48:45
Message-ID: CAA4eK1+VD=f0EPd_FoGX60ygWDguHDxu8DsrPDr_DS2yy4r88w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 2, 2015 at 8:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>
> I think the approach this patch takes is pretty darned strange, and
> almost certainly not what we want. What you're doing here is putting
> every outstanding CLOG-update request into a linked list, and then the
> leader goes and does all of those CLOG updates. But there's no
> guarantee that the pages that need to be updated are even present in a
> CLOG buffer. If it turns out that all of the batched CLOG updates are
> part of resident pages, then this is going to work great, just like
> the similar ProcArrayLock optimization. But if the pages are not
> resident, then you will get WORSE concurrency and SLOWER performance
> than the status quo. The leader will sit there and read every page
> that is needed, and to do that it will repeatedly release and
> reacquire CLogControlLock (inside SimpleLruReadPage). If you didn't
> have a leader, the reads of all those pages could happen at the same
> time, but with this design, they get serialized. That's not good.
>

I think the way to address is don't add backend to Group list if it is
not intended to update the same page as Group leader. For transactions
to be on different pages, they have to be 32768 transactionid's far apart
and I don't see much possibility of that happening for concurrent
transactions that are going to be grouped.

> My idea for how this could possibly work is that you could have a list
> of waiting backends for each SLRU buffer page.
>

Won't this mean that first we need to ensure that page exists in one of
the buffers and once we have page in SLRU buffer, we can form the
list and ensure that before eviction, the list must be processed?
If my understanding is right, then for this to work we need to probably
acquire CLogControlLock in Shared mode in addition to acquiring it
in Exclusive mode for updating the status on page and performing
pending updates for other backends.

>
> I agree with Simon that it's probably a good idea for this
> optimization to handle cases where a backend has a non-overflowed list
> of subtransactions. That seems doable.
>

Agreed and I have already handled it in the last version of patch posted
by me.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-08 19:32:20
Message-ID: CA+TgmoaoBJ1s98OjOwTgH7XV5zJaHCRTvvg+tP1yJHk4z=reyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 3, 2015 at 1:48 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I think the way to address is don't add backend to Group list if it is
> not intended to update the same page as Group leader. For transactions
> to be on different pages, they have to be 32768 transactionid's far apart
> and I don't see much possibility of that happening for concurrent
> transactions that are going to be grouped.

That might work.

>> My idea for how this could possibly work is that you could have a list
>> of waiting backends for each SLRU buffer page.
>
> Won't this mean that first we need to ensure that page exists in one of
> the buffers and once we have page in SLRU buffer, we can form the
> list and ensure that before eviction, the list must be processed?
> If my understanding is right, then for this to work we need to probably
> acquire CLogControlLock in Shared mode in addition to acquiring it
> in Exclusive mode for updating the status on page and performing
> pending updates for other backends.

Hmm, that wouldn't be good. You're right: this is a problem with my
idea. We can try what you suggested above and see how that works. We
could also have two or more slots for groups - if a backend doesn't
get the lock, it joins the existing group for the same page, or else
creates a new group if any slot is unused. I think it might be
advantageous to have at least two groups because otherwise things
might slow down when some transactions are rolling over to a new page
while others are still in flight for the previous page. Perhaps we
should try it both ways and benchmark.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-12 13:03:10
Message-ID: CAA4eK1+SoW3FBrdZV+3m34uCByK3DMPy_9QQs34yvN8spByzyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 9, 2015 at 1:02 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Dec 3, 2015 at 1:48 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > I think the way to address is don't add backend to Group list if it is
> > not intended to update the same page as Group leader. For transactions
> > to be on different pages, they have to be 32768 transactionid's far
apart
> > and I don't see much possibility of that happening for concurrent
> > transactions that are going to be grouped.
>
> That might work.
>

Okay, attached patch group_update_clog_v3.patch implements the above.

> >> My idea for how this could possibly work is that you could have a list
> >> of waiting backends for each SLRU buffer page.
> >
> > Won't this mean that first we need to ensure that page exists in one of
> > the buffers and once we have page in SLRU buffer, we can form the
> > list and ensure that before eviction, the list must be processed?
> > If my understanding is right, then for this to work we need to probably
> > acquire CLogControlLock in Shared mode in addition to acquiring it
> > in Exclusive mode for updating the status on page and performing
> > pending updates for other backends.
>
> Hmm, that wouldn't be good. You're right: this is a problem with my
> idea. We can try what you suggested above and see how that works. We
> could also have two or more slots for groups - if a backend doesn't
> get the lock, it joins the existing group for the same page, or else
> creates a new group if any slot is unused.
>

I have implemented this idea as well in the attached patch
group_slots_update_clog_v3.patch

> I think it might be
> advantageous to have at least two groups because otherwise things
> might slow down when some transactions are rolling over to a new page
> while others are still in flight for the previous page. Perhaps we
> should try it both ways and benchmark.
>

Sure, I can do the benchmarks with both the patches, but before that
if you can once check whether group_slots_update_clog_v3.patch is inline
with what you have in mind then it will be helpful.

Note - I have used attached patch transaction_burner_v1.patch (extracted
from Jeff's patch upthread) to verify the transactions that fall into
different
page boundaries.

Attachment Content-Type Size
group_update_clog_v3.patch application/octet-stream 14.3 KB
group_slots_update_clog_v3.patch application/octet-stream 15.6 KB
transactionid_burner_v1.patch application/octet-stream 2.3 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-16 18:31:05
Message-ID: CA+TgmoanL8NwuZTnFxDeM5w+HLy1zuHZrtTMF99uM68ySxd2ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 12, 2015 at 8:03 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

>> I think it might be
>> advantageous to have at least two groups because otherwise things
>> might slow down when some transactions are rolling over to a new page
>> while others are still in flight for the previous page. Perhaps we
>> should try it both ways and benchmark.
>>
>
> Sure, I can do the benchmarks with both the patches, but before that
> if you can once check whether group_slots_update_clog_v3.patch is inline
> with what you have in mind then it will be helpful.

Benchmarking sounds good. This looks broadly like what I was thinking
about, although I'm not very sure you've got all the details right.

Some random comments:

- TransactionGroupUpdateXidStatus could do just as well without
add_proc_to_group. You could just say if (group_no >= NUM_GROUPS)
break; instead. Also, I think you could combine the two if statements
inside the loop. if (nextidx != INVALID_PGPROCNO &&
ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
something like that.

- memberXid and memberXidstatus are terrible names. Member of what?
That's going to be clear as mud to the next person looking at the
definitiono f PGPROC. And the capitalization of memberXidstatus isn't
even consistent. Nor is nextupdateXidStatusElem. Please do give some
thought to the names you pick for variables and structure members.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-18 06:16:01
Message-ID: CAA4eK1+Ebf4W7D74_1NfLZYEpUoXUKzxTjqxdankW0xMhqkLYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 17, 2015 at 12:01 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sat, Dec 12, 2015 at 8:03 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>
> >> I think it might be
> >> advantageous to have at least two groups because otherwise things
> >> might slow down when some transactions are rolling over to a new page
> >> while others are still in flight for the previous page. Perhaps we
> >> should try it both ways and benchmark.
> >>
> >
> > Sure, I can do the benchmarks with both the patches, but before that
> > if you can once check whether group_slots_update_clog_v3.patch is inline
> > with what you have in mind then it will be helpful.
>
> Benchmarking sounds good. This looks broadly like what I was thinking
> about, although I'm not very sure you've got all the details right.
>
>
Unfortunately, I didn't have access to high end Intel m/c on which I took
the performance data last time, so I took on Power-8 m/c where I/O
sub-system is not that good, so the write performance data at lower scale
factor like 300 is reasonably good and at higher scale factor (>= 1000)
it is mainly I/O bound, so there is not much difference with or without
patch.

Performance Data
-----------------------------
M/c configuration:
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=32GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

Attached files show the performance data with both the patches at
scale factor 300 and 1000.

Read Patch-1 and Patch-2 in graphs as below:

Patch-1 - group_update_clog_v3.patch
Patch-2 - group_slots_update_v3.patch

Observations
----------------------
1. At scale factor 300, there is gain of 11% at 128-client count and
27% at 256 client count with Patch-1. At 4 clients, the performance with
Patch is 0.6% less (which might be a run-to-run variation or there could
be a small regression, but I think it is too less to be bothered about)

2. At scale factor 1000, there is no visible difference and there is some
at lower client count there is a <1% regression which could be due to
I/O bound nature of test.

3. On these runs, Patch-2 is mostly always worse than Patch-1, but
the difference between them is not significant.

> Some random comments:
>
> - TransactionGroupUpdateXidStatus could do just as well without
> add_proc_to_group. You could just say if (group_no >= NUM_GROUPS)
> break; instead. Also, I think you could combine the two if statements
> inside the loop. if (nextidx != INVALID_PGPROCNO &&
> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
> something like that.
>
> - memberXid and memberXidstatus are terrible names. Member of what?
>

How about changing them to clogGroupMemberXid and
clogGroupMemberXidStatus?

> That's going to be clear as mud to the next person looking at the
> definitiono f PGPROC.

I understand that you don't like the naming convention, but using
such harsh language could sometimes hurt others.

> And the capitalization of memberXidstatus isn't
> even consistent. Nor is nextupdateXidStatusElem. Please do give some
> thought to the names you pick for variables and structure members.
>
>
Got it, I will do so.

Let me know what you think about whether we need to proceed with slots
approach and try some more performance data?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
image/png 59.3 KB
image/png 61.9 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-18 16:28:41
Message-ID: CA+TgmobysOsEajkhvMdEGFMH4shcAneGTUNXnPd_O-_=f8ucSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 18, 2015 at 1:16 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> 1. At scale factor 300, there is gain of 11% at 128-client count and
> 27% at 256 client count with Patch-1. At 4 clients, the performance with
> Patch is 0.6% less (which might be a run-to-run variation or there could
> be a small regression, but I think it is too less to be bothered about)
>
> 2. At scale factor 1000, there is no visible difference and there is some
> at lower client count there is a <1% regression which could be due to
> I/O bound nature of test.
>
> 3. On these runs, Patch-2 is mostly always worse than Patch-1, but
> the difference between them is not significant.

Hmm, that's interesting. So the slots don't help. I was concerned
that with only a single slot, you might have things moving quickly
until you hit the point where you switch over to the next clog
segment, and then you get a bad stall. It sounds like that either
doesn't happen in practice, or more likely it does happen but the
extra slot doesn't eliminate the stall because there's I/O at that
point. Either way, it sounds like we can forget the slots idea for
now.

>> Some random comments:
>>
>> - TransactionGroupUpdateXidStatus could do just as well without
>> add_proc_to_group. You could just say if (group_no >= NUM_GROUPS)
>> break; instead. Also, I think you could combine the two if statements
>> inside the loop. if (nextidx != INVALID_PGPROCNO &&
>> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
>> something like that.
>>
>> - memberXid and memberXidstatus are terrible names. Member of what?
>
> How about changing them to clogGroupMemberXid and
> clogGroupMemberXidStatus?

What we've currently got for group XID clearing for the ProcArray is
clearXid, nextClearXidElem, and backendLatestXid. We should try to
make these things consistent. Maybe rename those to
procArrayGroupMember, procArrayGroupNext, procArrayGroupXid and then
start all of these identifiers with clogGroup as you propose.

>> That's going to be clear as mud to the next person looking at the
>> definitiono f PGPROC.
>
> I understand that you don't like the naming convention, but using
> such harsh language could sometimes hurt others.

Sorry. If I am slightly frustrated here I think it is because this
same point has been raised about three times now, by me and also by
Andres, just with respect to this particular technique, and also on
other patches. But you are right - that is no excuse for being rude.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-21 06:27:25
Message-ID: CAA4eK1JCzEksmYYquxHrVP2-B_shsydg80TH3DOVy8u=bgZNwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 18, 2015 at 9:58 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Dec 18, 2015 at 1:16 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>
> >> Some random comments:
> >>
> >> - TransactionGroupUpdateXidStatus could do just as well without
> >> add_proc_to_group. You could just say if (group_no >= NUM_GROUPS)
> >> break; instead. Also, I think you could combine the two if statements
> >> inside the loop. if (nextidx != INVALID_PGPROCNO &&
> >> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
> >> something like that.
> >>

Changed as per suggestion.

> >> - memberXid and memberXidstatus are terrible names. Member of what?
> >
> > How about changing them to clogGroupMemberXid and
> > clogGroupMemberXidStatus?
>
> What we've currently got for group XID clearing for the ProcArray is
> clearXid, nextClearXidElem, and backendLatestXid. We should try to
> make these things consistent. Maybe rename those to
> procArrayGroupMember, procArrayGroupNext, procArrayGroupXid
>

Here procArrayGroupXid sounds like Xid at group level, how about
procArrayGroupMemberXid?
Find the patch with renamed variables for PGProc
(rename_pgproc_variables_v1.patch) attached with mail.

> and then
> start all of these identifiers with clogGroup as you propose.
>

I have changed them accordingly in the attached patch
(group_update_clog_v4.patch) and addressed other comments given by
you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
rename_pgproc_variables_v1.patch application/octet-stream 5.1 KB
group_update_clog_v4.patch application/octet-stream 14.3 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-22 17:13:32
Message-ID: CA+Tgmoab1AOcR6Fwqe2cYZQ7LeyzyL+DaWE=t06Ya7_Ho5C2hw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 21, 2015 at 1:27 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Fri, Dec 18, 2015 at 9:58 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> On Fri, Dec 18, 2015 at 1:16 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>>
>> >> Some random comments:
>> >>
>> >> - TransactionGroupUpdateXidStatus could do just as well without
>> >> add_proc_to_group. You could just say if (group_no >= NUM_GROUPS)
>> >> break; instead. Also, I think you could combine the two if statements
>> >> inside the loop. if (nextidx != INVALID_PGPROCNO &&
>> >> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
>> >> something like that.
>> >>
>
> Changed as per suggestion.
>
>> >> - memberXid and memberXidstatus are terrible names. Member of what?
>> >
>> > How about changing them to clogGroupMemberXid and
>> > clogGroupMemberXidStatus?
>>
>> What we've currently got for group XID clearing for the ProcArray is
>> clearXid, nextClearXidElem, and backendLatestXid. We should try to
>> make these things consistent. Maybe rename those to
>> procArrayGroupMember, procArrayGroupNext, procArrayGroupXid
>>
>
> Here procArrayGroupXid sounds like Xid at group level, how about
> procArrayGroupMemberXid?
> Find the patch with renamed variables for PGProc
> (rename_pgproc_variables_v1.patch) attached with mail.

I sort of hate to make these member names any longer, but I wonder if
we should make it procArrayGroupClearXid etc. Otherwise it might be
confused with some other time of grouping of PGPROCs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-23 09:16:21
Message-ID: CAA4eK1+5VvzD_u-Y504p-7SKdOUsiKikd8tfhVyziDQk93fwOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 22, 2015 at 10:43 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Dec 21, 2015 at 1:27 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > On Fri, Dec 18, 2015 at 9:58 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
> >>
> >> On Fri, Dec 18, 2015 at 1:16 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> >> wrote:
> >>
> >> >> Some random comments:
> >> >>
> >> >> - TransactionGroupUpdateXidStatus could do just as well without
> >> >> add_proc_to_group. You could just say if (group_no >= NUM_GROUPS)
> >> >> break; instead. Also, I think you could combine the two if
statements
> >> >> inside the loop. if (nextidx != INVALID_PGPROCNO &&
> >> >> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
> >> >> something like that.
> >> >>
> >
> > Changed as per suggestion.
> >
> >> >> - memberXid and memberXidstatus are terrible names. Member of what?
> >> >
> >> > How about changing them to clogGroupMemberXid and
> >> > clogGroupMemberXidStatus?
> >>
> >> What we've currently got for group XID clearing for the ProcArray is
> >> clearXid, nextClearXidElem, and backendLatestXid. We should try to
> >> make these things consistent. Maybe rename those to
> >> procArrayGroupMember, procArrayGroupNext, procArrayGroupXid
> >>
> >
> > Here procArrayGroupXid sounds like Xid at group level, how about
> > procArrayGroupMemberXid?
> > Find the patch with renamed variables for PGProc
> > (rename_pgproc_variables_v1.patch) attached with mail.
>
> I sort of hate to make these member names any longer, but I wonder if
> we should make it procArrayGroupClearXid etc.

If we go by this suggestion, then the name will look like:
PGProc
{
..
bool procArrayGroupClearXid, pg_atomic_uint32 procArrayGroupNextClearXid,
TransactionId procArrayGroupLatestXid;
..

PROC_HDR
{
..
pg_atomic_uint32 procArrayGroupFirstClearXid;
..
}

I think whatever I sent in last patch were better. It seems to me it is
better to add some comments before variable names, so that anybody
referring them can understand better and I have added comments in
attached patch rename_pgproc_variables_v2.patch to explain the same.

> Otherwise it might be
> confused with some other time of grouping of PGPROCs.
>

Won't procArray prefix distinguish it from other type of groupings?

About CLogControlLock patch, yesterday I noticed that SLRU code
can return error in some cases which can lead to hang for group
members, as once group leader errors out, there is no one to wake
them up. However on further looking into code, I found out that
this path (TransactionIdSetPageStatus()) is always called in critical
section (RecordTransactionCommit() ensures the same), so any
ERROR in this path will be converted to PANIC which don't require
any wakeup mechanism for group members. In any case, if you
find any other way which can lead to error (not being converted to
PANIC), I have already handled the error case in the attached patch
(group_update_clog_error_handling_v4.patch) and if you also don't
find any case, then previous patch stands good.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
rename_pgproc_variables_v2.patch application/octet-stream 5.3 KB
group_update_clog_error_handling_v4.patch application/octet-stream 16.2 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-24 02:50:47
Message-ID: CAB7nPqQR0oe1SJKUPiyoRThO8-KQ-+eXOxnb3WCVrPc97wbO1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 23, 2015 at 6:16 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> blah.

autovacuum log: Moved to next CF as thread is really active.
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2015-12-25 01:06:56
Message-ID: CA+TgmoajjobHxBdORn_N9giVKdEDihz6_KsBXpbZjMY7RCo+=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 23, 2015 at 1:16 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Tue, Dec 22, 2015 at 10:43 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> On Mon, Dec 21, 2015 at 1:27 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>> > On Fri, Dec 18, 2015 at 9:58 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
>> > wrote:
>> >>
>> >> On Fri, Dec 18, 2015 at 1:16 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> >> wrote:
>> >>
>> >> >> Some random comments:
>> >> >>
>> >> >> - TransactionGroupUpdateXidStatus could do just as well without
>> >> >> add_proc_to_group. You could just say if (group_no >= NUM_GROUPS)
>> >> >> break; instead. Also, I think you could combine the two if
>> >> >> statements
>> >> >> inside the loop. if (nextidx != INVALID_PGPROCNO &&
>> >> >> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
>> >> >> something like that.
>> >> >>
>> >
>> > Changed as per suggestion.
>> >
>> >> >> - memberXid and memberXidstatus are terrible names. Member of what?
>> >> >
>> >> > How about changing them to clogGroupMemberXid and
>> >> > clogGroupMemberXidStatus?
>> >>
>> >> What we've currently got for group XID clearing for the ProcArray is
>> >> clearXid, nextClearXidElem, and backendLatestXid. We should try to
>> >> make these things consistent. Maybe rename those to
>> >> procArrayGroupMember, procArrayGroupNext, procArrayGroupXid
>> >>
>> >
>> > Here procArrayGroupXid sounds like Xid at group level, how about
>> > procArrayGroupMemberXid?
>> > Find the patch with renamed variables for PGProc
>> > (rename_pgproc_variables_v1.patch) attached with mail.
>>
>> I sort of hate to make these member names any longer, but I wonder if
>> we should make it procArrayGroupClearXid etc.
>
> If we go by this suggestion, then the name will look like:
> PGProc
> {
> ..
> bool procArrayGroupClearXid, pg_atomic_uint32 procArrayGroupNextClearXid,
> TransactionId procArrayGroupLatestXid;
> ..
>
> PROC_HDR
> {
> ..
> pg_atomic_uint32 procArrayGroupFirstClearXid;
> ..
> }
>
> I think whatever I sent in last patch were better. It seems to me it is
> better to add some comments before variable names, so that anybody
> referring them can understand better and I have added comments in
> attached patch rename_pgproc_variables_v2.patch to explain the same.

Well, I don't know. Anybody else have an opinion?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-01-07 05:24:18
Message-ID: CAA4eK1K2RefbYcUAGuKAy4LLhGBV6a9DfnWnhp17++JEzVseqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 25, 2015 at 6:36 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Wed, Dec 23, 2015 at 1:16 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >> >
> >> > Here procArrayGroupXid sounds like Xid at group level, how about
> >> > procArrayGroupMemberXid?
> >> > Find the patch with renamed variables for PGProc
> >> > (rename_pgproc_variables_v1.patch) attached with mail.
> >>
> >> I sort of hate to make these member names any longer, but I wonder if
> >> we should make it procArrayGroupClearXid etc.
> >
> > If we go by this suggestion, then the name will look like:
> > PGProc
> > {
> > ..
> > bool procArrayGroupClearXid, pg_atomic_uint32 procArrayGroupNextClearXid,
> > TransactionId procArrayGroupLatestXid;
> > ..
> >
> > PROC_HDR
> > {
> > ..
> > pg_atomic_uint32 procArrayGroupFirstClearXid;
> > ..
> > }
> >
> > I think whatever I sent in last patch were better. It seems to me it is
> > better to add some comments before variable names, so that anybody
> > referring them can understand better and I have added comments in
> > attached patch rename_pgproc_variables_v2.patch to explain the same.
>
> Well, I don't know. Anybody else have an opinion?
>
>
It seems that either people don't have any opinion on this matter or they
are okay with either of the naming conventions being discussed. I think
specifying Member after procArrayGroup can help distinguishing which
variables are specific to the whole group and which are specific to a
particular member. I think that will be helpful for other places as well
if we use this technique to improve performance. Let me know what
you think about the same.

I have verified that previous patches can be applied cleanly and passes
make check-world. To avoid confusion, I am attaching the latest
patches with this mail.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
rename_pgproc_variables_v2.patch application/octet-stream 5.3 KB
group_update_clog_v4.patch application/octet-stream 14.3 KB

From: Thom Brown <thom(at)linux(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-09 13:56:07
Message-ID: CAA-aLv440-oCQmZD1E5cTRPUo6Ec1zeE1vYzwRHqpceRwrM7eg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7 January 2016 at 05:24, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Fri, Dec 25, 2015 at 6:36 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> On Wed, Dec 23, 2015 at 1:16 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>> >> >
>> >> > Here procArrayGroupXid sounds like Xid at group level, how about
>> >> > procArrayGroupMemberXid?
>> >> > Find the patch with renamed variables for PGProc
>> >> > (rename_pgproc_variables_v1.patch) attached with mail.
>> >>
>> >> I sort of hate to make these member names any longer, but I wonder if
>> >> we should make it procArrayGroupClearXid etc.
>> >
>> > If we go by this suggestion, then the name will look like:
>> > PGProc
>> > {
>> > ..
>> > bool procArrayGroupClearXid, pg_atomic_uint32
>> > procArrayGroupNextClearXid,
>> > TransactionId procArrayGroupLatestXid;
>> > ..
>> >
>> > PROC_HDR
>> > {
>> > ..
>> > pg_atomic_uint32 procArrayGroupFirstClearXid;
>> > ..
>> > }
>> >
>> > I think whatever I sent in last patch were better. It seems to me it is
>> > better to add some comments before variable names, so that anybody
>> > referring them can understand better and I have added comments in
>> > attached patch rename_pgproc_variables_v2.patch to explain the same.
>>
>> Well, I don't know. Anybody else have an opinion?
>>
>
> It seems that either people don't have any opinion on this matter or they
> are okay with either of the naming conventions being discussed. I think
> specifying Member after procArrayGroup can help distinguishing which
> variables are specific to the whole group and which are specific to a
> particular member. I think that will be helpful for other places as well
> if we use this technique to improve performance. Let me know what
> you think about the same.
>
> I have verified that previous patches can be applied cleanly and passes
> make check-world. To avoid confusion, I am attaching the latest
> patches with this mail.

Patches still apply 1 month later.

I don't really have an opinion on the variable naming. I guess they
only need making longer if there's going to be some confusion about
what they're for, but I'm guessing it's not a blocker here.

Thom


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Thom Brown <thom(at)linux(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-10 04:14:53
Message-ID: CAA4eK1JY4M+P9VT6VJkuNzO2CYYnc2eT08qRQ6cRYU0eUy2mVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 9, 2016 at 7:26 PM, Thom Brown <thom(at)linux(dot)com> wrote:
>
> On 7 January 2016 at 05:24, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > On Fri, Dec 25, 2015 at 6:36 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
> >>
> >> On Wed, Dec 23, 2015 at 1:16 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> >> wrote:
> >> >> >
> >> >> > Here procArrayGroupXid sounds like Xid at group level, how about
> >> >> > procArrayGroupMemberXid?
> >> >> > Find the patch with renamed variables for PGProc
> >> >> > (rename_pgproc_variables_v1.patch) attached with mail.
> >> >>
> >> >> I sort of hate to make these member names any longer, but I wonder
if
> >> >> we should make it procArrayGroupClearXid etc.
> >> >
> >> > If we go by this suggestion, then the name will look like:
> >> > PGProc
> >> > {
> >> > ..
> >> > bool procArrayGroupClearXid, pg_atomic_uint32
> >> > procArrayGroupNextClearXid,
> >> > TransactionId procArrayGroupLatestXid;
> >> > ..
> >> >
> >> > PROC_HDR
> >> > {
> >> > ..
> >> > pg_atomic_uint32 procArrayGroupFirstClearXid;
> >> > ..
> >> > }
> >> >
> >> > I think whatever I sent in last patch were better. It seems to me
it is
> >> > better to add some comments before variable names, so that anybody
> >> > referring them can understand better and I have added comments in
> >> > attached patch rename_pgproc_variables_v2.patch to explain the same.
> >>
> >> Well, I don't know. Anybody else have an opinion?
> >>
> >
> > It seems that either people don't have any opinion on this matter or
they
> > are okay with either of the naming conventions being discussed. I think
> > specifying Member after procArrayGroup can help distinguishing which
> > variables are specific to the whole group and which are specific to a
> > particular member. I think that will be helpful for other places as
well
> > if we use this technique to improve performance. Let me know what
> > you think about the same.
> >
> > I have verified that previous patches can be applied cleanly and passes
> > make check-world. To avoid confusion, I am attaching the latest
> > patches with this mail.
>
> Patches still apply 1 month later.
>

Thanks for verification!

>
> I don't really have an opinion on the variable naming. I guess they
> only need making longer if there's going to be some confusion about
> what they're for,

makes sense, that is the reason why I have added few comments
as well, but not sure if you are suggesting something else.

> but I'm guessing it's not a blocker here.
>

I also think so, but not sure what else is required here. The basic
idea of this rename_pgproc_variables_v2.patch is to rename
few variables in existing similar code, so that the main patch
group_update_clog can adapt those naming convention if required,
other than that I have handled all review comments raised in this
thread (mainly by Simon and Robert).

Is there anything, I can do to move this forward?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-11 14:04:54
Message-ID: CA+Tgmoa0w1gKTPSq7js4HN-KJjzF3ESBZR2xqLYYOL-+nPZAkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 9, 2016 at 11:14 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> Patches still apply 1 month later.
>
> Thanks for verification!
>
>>
>> I don't really have an opinion on the variable naming. I guess they
>> only need making longer if there's going to be some confusion about
>> what they're for,
>
> makes sense, that is the reason why I have added few comments
> as well, but not sure if you are suggesting something else.
>
>> but I'm guessing it's not a blocker here.
>>
>
> I also think so, but not sure what else is required here. The basic
> idea of this rename_pgproc_variables_v2.patch is to rename
> few variables in existing similar code, so that the main patch
> group_update_clog can adapt those naming convention if required,
> other than that I have handled all review comments raised in this
> thread (mainly by Simon and Robert).
>
> Is there anything, I can do to move this forward?

Well, looking at this again, I think I'm OK to go with your names.
That doesn't seem like the thing to hold up the patch for. So I'll go
ahead and push the renaming patch now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-11 14:32:52
Message-ID: CA+Tgmoa-OdD3CAAd5v8RwxYay-PvGVBwu8pcn4SBrEH_qsyYdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 11, 2016 at 9:04 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Is there anything, I can do to move this forward?
>
> Well, looking at this again, I think I'm OK to go with your names.
> That doesn't seem like the thing to hold up the patch for. So I'll go
> ahead and push the renaming patch now.

On the substantive part of the patch, this doesn't look safe:

+ /*
+ * Add ourselves to the list of processes needing a group XID status
+ * update.
+ */
+ proc->clogGroupMember = true;
+ proc->clogGroupMemberXid = xid;
+ proc->clogGroupMemberXidStatus = status;
+ proc->clogGroupMemberPage = pageno;
+ proc->clogGroupMemberLsn = lsn;
+ while (true)
+ {
+ nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
+
+ /*
+ * Add the proc to list if the clog page where we need to update the
+ * current transaction status is same as group leader's clog page.
+ */
+ if (nextidx != INVALID_PGPROCNO &&
+ ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
proc->clogGroupMemberPage)
+ return false;

DANGER HERE!

+ pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
+
+ if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
+ &nextidx,
+ (uint32) proc->pgprocno))
+ break;
+ }

There is a potential ABA problem here. Suppose that this code
executes in one process as far as the line that says DANGER HERE.
Then, the group leader wakes up, performs all of the CLOG
modifications, performs another write transaction, and again becomes
the group leader, but for a different member page. Then, the original
process that went to sleep at DANGER HERE wakes up. At this point,
the pg_atomic_compare_exchange_u32 will succeed and we'll have
processes with different pages in the list, contrary to the intention
of the code.

This kind of thing can be really subtle and difficult to fix. The
problem might not happen even with a very large amount of testing, and
then might happen in the real world on some other hardware or on
really rare occasions. In general, compare-and-swap loops need to be
really really simple with minimal dependencies on other data, ideally
none. It's very hard to make anything else work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-12 05:55:45
Message-ID: CAA4eK1JwMRcpDMYE72dAByVTM0Z95zaU-1-UiqW5X+-wwopd7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 11, 2016 at 8:02 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On the substantive part of the patch, this doesn't look safe:
>
> + /*
> + * Add ourselves to the list of processes needing a group XID status
> + * update.
> + */
> + proc->clogGroupMember = true;
> + proc->clogGroupMemberXid = xid;
> + proc->clogGroupMemberXidStatus = status;
> + proc->clogGroupMemberPage = pageno;
> + proc->clogGroupMemberLsn = lsn;
> + while (true)
> + {
> + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> +
> + /*
> + * Add the proc to list if the clog page where we need to update
the
> + * current transaction status is same as group leader's clog
page.
> + */
> + if (nextidx != INVALID_PGPROCNO &&
> + ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
> proc->clogGroupMemberPage)
> + return false;
>
> DANGER HERE!
>
> + pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> +
> + if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> + &nextidx,
> + (uint32) proc->pgprocno))
> + break;
> + }
>
> There is a potential ABA problem here. Suppose that this code
> executes in one process as far as the line that says DANGER HERE.
> Then, the group leader wakes up, performs all of the CLOG
> modifications, performs another write transaction, and again becomes
> the group leader, but for a different member page. Then, the original
> process that went to sleep at DANGER HERE wakes up. At this point,
> the pg_atomic_compare_exchange_u32 will succeed and we'll have
> processes with different pages in the list, contrary to the intention
> of the code.
>

Very Good Catch. I think if we want to address this we can detect
the non-group leader transactions that tries to update the different
CLOG page (different from group-leader) after acquiring
CLogControlLock and then mark these transactions such that
after waking they need to perform CLOG update via normal path.
Now this can decrease the latency of such transactions, but I
think there will be only very few transactions if at-all there which
can face this condition, because most of the concurrent transactions
should be on same page, otherwise the idea of multiple-slots we
have tried upthread would have shown benefits.
Another idea could be that we update the comments indicating the
possibility of multiple Clog-page updates in same group on the basis
that such cases will be less and even if it happens, it won't effect the
transaction status update.

Do you have anything else in mind?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-13 04:40:19
Message-ID: CA+TgmoZq0VfRmH6qt+tYaAhLQt0amRqkHTowjx0h0hfmT0BFPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 12, 2016 at 12:55 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> Very Good Catch. I think if we want to address this we can detect
> the non-group leader transactions that tries to update the different
> CLOG page (different from group-leader) after acquiring
> CLogControlLock and then mark these transactions such that
> after waking they need to perform CLOG update via normal path.
> Now this can decrease the latency of such transactions, but I

I think you mean "increase".

> think there will be only very few transactions if at-all there which
> can face this condition, because most of the concurrent transactions
> should be on same page, otherwise the idea of multiple-slots we
> have tried upthread would have shown benefits.
> Another idea could be that we update the comments indicating the
> possibility of multiple Clog-page updates in same group on the basis
> that such cases will be less and even if it happens, it won't effect the
> transaction status update.

I think either approach of those approaches could work, as long as the
logic is correct and the comments are clear. The important thing is
that the code had better do something safe if this situation ever
occurs, and the comments had better be clear that this is a possible
situation so that someone modifying the code in the future doesn't
think it's impossible, rely on it not happening, and consequently
introduce a very-low-probability bug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-21 04:57:25
Message-ID: CAA4eK1KUVPxBcGTdOuKyvf5p1sQ0HeUbSMbTxtQc=P65OxiZog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 13, 2016 at 10:10 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Feb 12, 2016 at 12:55 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > Very Good Catch. I think if we want to address this we can detect
> > the non-group leader transactions that tries to update the different
> > CLOG page (different from group-leader) after acquiring
> > CLogControlLock and then mark these transactions such that
> > after waking they need to perform CLOG update via normal path.
> > Now this can decrease the latency of such transactions, but I
>
> I think you mean "increase".
>

Yes.

> > think there will be only very few transactions if at-all there which
> > can face this condition, because most of the concurrent transactions
> > should be on same page, otherwise the idea of multiple-slots we
> > have tried upthread would have shown benefits.
> > Another idea could be that we update the comments indicating the
> > possibility of multiple Clog-page updates in same group on the basis
> > that such cases will be less and even if it happens, it won't effect the
> > transaction status update.
>
> I think either approach of those approaches could work, as long as the
> logic is correct and the comments are clear. The important thing is
> that the code had better do something safe if this situation ever
> occurs, and the comments had better be clear that this is a possible
> situation so that someone modifying the code in the future doesn't
> think it's impossible, rely on it not happening, and consequently
> introduce a very-low-probability bug.
>

Okay, I have updated the comments for such a possibility and the
possible improvement, if we ever face such a situation. I also
once again verified that even if group contains transaction status for
multiple pages, it works fine.

Performance data with attached patch is as below.

M/c configuration
-----------------------------
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

Client_Count/Patch_Ver 1 64 128 256
HEAD(481725c0) 963 28145 28593 26447
Patch-1 938 28152 31703 29402

We can see 10~11% performance improvement as observed
previously. You might see 0.02% performance difference with
patch as regression, but that is just a run-to-run variation.

Note - To take this performance data, I have to revert commit
ac1d7945 which is known issue in HEAD as reported here [1].

[1] -
http://www.postgresql.org/message-id/CAB-SwXZh44_2ybvS5Z67p_CDz=XFn4hNAD=CnMEF+QqkXwFrGg@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v5.patch application/octet-stream 15.5 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-21 06:32:31
Message-ID: CA+TgmoYVcNRJs3D2_Nk_ykUfxSVyp+WFqgkwYE+9FPUKVNrwGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

>
> Client_Count/Patch_Ver 1 64 128 256
> HEAD(481725c0) 963 28145 28593 26447
> Patch-1 938 28152 31703 29402
>
>
> We can see 10~11% performance improvement as observed
> previously. You might see 0.02% performance difference with
> patch as regression, but that is just a run-to-run variation.
>

Don't the single-client numbers show about a 3% regresssion? Surely not
0.02%.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-21 06:54:18
Message-ID: CAA4eK1J70Aj6Pvt31LYMD8ptMv7+9WFN0cQLq1LGhCYPFM9njw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 21, 2016 at 12:02 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>
>>
>> Client_Count/Patch_Ver 1 64 128 256
>> HEAD(481725c0) 963 28145 28593 26447
>> Patch-1 938 28152 31703 29402
>>
>>
>> We can see 10~11% performance improvement as observed
>> previously. You might see 0.02% performance difference with
>> patch as regression, but that is just a run-to-run variation.
>>
>
> Don't the single-client numbers show about a 3% regresssion? Surely not
> 0.02%.
>

Sorry, you are right, it is ~2.66%, but in read-write pgbench tests, I
could see such fluctuation. Also patch doesn't change single-client
case. However, if you still feel that there could be impact by patch,
I can re-run the single client case once again with different combinations
like first with HEAD and then patch and vice versa.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-21 08:58:16
Message-ID: CA+Tgmobf5PFpoi3w4zwf5DS0DzwYYpWzHDwwHmQw5Xto1dKyOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 21, 2016 at 12:24 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> On Sun, Feb 21, 2016 at 12:02 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
>
>> On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>>
>>>
>>> Client_Count/Patch_Ver 1 64 128 256
>>> HEAD(481725c0) 963 28145 28593 26447
>>> Patch-1 938 28152 31703 29402
>>>
>>>
>>> We can see 10~11% performance improvement as observed
>>> previously. You might see 0.02% performance difference with
>>> patch as regression, but that is just a run-to-run variation.
>>>
>>
>> Don't the single-client numbers show about a 3% regresssion? Surely not
>> 0.02%.
>>
>
>
> Sorry, you are right, it is ~2.66%, but in read-write pgbench tests, I
> could see such fluctuation. Also patch doesn't change single-client
> case. However, if you still feel that there could be impact by patch,
> I can re-run the single client case once again with different combinations
> like first with HEAD and then patch and vice versa.
>

Are these results from a single run, or median-of-three?

I mean, my basic feeling is that I would not accept a 2-3% regression in
the single client case to get a 10% speedup in the case where we have 128
clients. A lot of people will not have 128 clients; quite a few will have
a single session, or just a few. Sometimes just making the code more
complex can hurt performance in subtle ways, e.g. by making it fit into the
L1 instruction cache less well. If the numbers you have here are accurate,
I'd vote to reject the patch.

Note that we already have apparently regressed single-client performance
noticeably between 9.0 and 9.5:

http://bonesmoses.org/2016/01/08/pg-phriday-how-far-weve-come/

I bet that wasn't a single patch but a series of patches which made things
more complex to improve concurrency behavior, but in the process each one
made the single-client case a tiny bit slower. In the end, that adds up.
I think we need to put some effort into figuring out if there is a way we
can get some of that single-client performance (and ideally more) back.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-21 14:15:39
Message-ID: CAA4eK1+dic2HZiqMf4o0O4acUP88R-5-A-9kXU9uz3COOW2WiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 21, 2016 at 2:28 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sun, Feb 21, 2016 at 12:24 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>
>> On Sun, Feb 21, 2016 at 12:02 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
>> wrote:
>>
>>> On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>>> wrote:
>>>
>>>>
>>>> Client_Count/Patch_Ver 1 64 128 256
>>>> HEAD(481725c0) 963 28145 28593 26447
>>>> Patch-1 938 28152 31703 29402
>>>>
>>>>
>>>> We can see 10~11% performance improvement as observed
>>>> previously. You might see 0.02% performance difference with
>>>> patch as regression, but that is just a run-to-run variation.
>>>>
>>>
>>> Don't the single-client numbers show about a 3% regresssion? Surely not
>>> 0.02%.
>>>
>>
>>
>> Sorry, you are right, it is ~2.66%, but in read-write pgbench tests, I
>> could see such fluctuation. Also patch doesn't change single-client
>> case. However, if you still feel that there could be impact by patch,
>> I can re-run the single client case once again with different combinations
>> like first with HEAD and then patch and vice versa.
>>
>
> Are these results from a single run, or median-of-three?
>
>
This was median-of-three, but the highest tps with patch is 1119
and with HEAD, it is 969 which shows a gain at single client-count.
Sometimes, I see such differences, it could be due to auto vacuum
getting triggered at some situations which lead to such variations.
However, if I try 2-3 times, the difference generally gets disappeared
unless there is some real regression or if just switch off auto vacuum
and do manual vacuum after each run. This time, I haven't run the
tests multiple times.

> I mean, my basic feeling is that I would not accept a 2-3% regression in
> the single client case to get a 10% speedup in the case where we have 128
> clients.
>

I understand your point. I think to verify whether it is run-to-run
variation or an actual regression, I will re-run these tests on single
client multiple times and post the result.

> A lot of people will not have 128 clients; quite a few will have a
> single session, or just a few. Sometimes just making the code more complex
> can hurt performance in subtle ways, e.g. by making it fit into the L1
> instruction cache less well. If the numbers you have here are accurate,
> I'd vote to reject the patch.
>
>
One point to note is that this patch along with first patch which I
posted in this thread to increase clog buffers can make significant
reduction in contention on CLogControlLock. OTOH, I think introducing
regression at single-client is also not a sane thing to do, so lets
first try to find if there is actually any regression and if it is, can
we mitigate it by writing code with somewhat fewer instructions or
in a slightly different way and then we can decide whether it is good
to reject the patch or not. Does that sound reasonable to you?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-23 13:36:46
Message-ID: CA+TgmobTdYgh24XNHMoGansYn=zCH5WzAg58jAeNA8WEVUbrbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 21, 2016 at 7:45 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> I mean, my basic feeling is that I would not accept a 2-3% regression in
>> the single client case to get a 10% speedup in the case where we have 128
>> clients.
>>
>
> I understand your point. I think to verify whether it is run-to-run
> variation or an actual regression, I will re-run these tests on single
> client multiple times and post the result.
>

Perhaps you could also try it on a couple of different machines (e.g.
MacBook Pro and a couple of different large servers).

>
> A lot of people will not have 128 clients; quite a few will have a
>> single session, or just a few. Sometimes just making the code more complex
>> can hurt performance in subtle ways, e.g. by making it fit into the L1
>> instruction cache less well. If the numbers you have here are accurate,
>> I'd vote to reject the patch.
>>
> One point to note is that this patch along with first patch which I
> posted in this thread to increase clog buffers can make significant
> reduction in contention on CLogControlLock. OTOH, I think introducing
> regression at single-client is also not a sane thing to do, so lets
> first try to find if there is actually any regression and if it is, can
> we mitigate it by writing code with somewhat fewer instructions or
> in a slightly different way and then we can decide whether it is good
> to reject the patch or not. Does that sound reasonable to you?
>

Yes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-27 04:33:22
Message-ID: CAA4eK1L4iV-2qe7AyMVsb+nz7SiX8JvCO+CqhXwaiXgm3CaBUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 23, 2016 at 7:06 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sun, Feb 21, 2016 at 7:45 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>
>> I mean, my basic feeling is that I would not accept a 2-3% regression in
>>> the single client case to get a 10% speedup in the case where we have 128
>>> clients.
>>>
>>
>>
When I tried by running the pgbench first with patch and then with Head, I
see 1.2% performance increase with patch. TPS with patch is 976 and with
Head it is 964. For 3, 30 mins TPS data, refer "Patch –
group_clog_update_v5" and before that "HEAD – Commit 481725c0"
in perf_write_clogcontrollock_data_v6.ods attached with this mail.

Nonetheless, I have observed that below new check has been added by the
patch which can effect single client performance. So I have changed it
such that new check is done only when we there is actually a need of group
update which means when multiple clients tries to update clog at-a-time.

+ if (!InRecovery &&
+ all_trans_same_page &&
+ nsubxids < PGPROC_MAX_CACHED_SUBXIDS &&
+ !IsGXactActive())

> I understand your point. I think to verify whether it is run-to-run
>> variation or an actual regression, I will re-run these tests on single
>> client multiple times and post the result.
>>
>
> Perhaps you could also try it on a couple of different machines (e.g.
> MacBook Pro and a couple of different large servers).
>

Okay, I have tried latest patch (group_update_clog_v6.patch) on 2 different
big servers and then on Mac-Pro. The detailed data for various runs can be
found in attached document perf_write_clogcontrollock_data_v6.ods. I have
taken the performance data for higher client-counts with somewhat larger
scale factor (1000) and data for median of same is as below:

M/c configuration
-----------------------------
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)

Non-default parameters
------------------------------------
max_connections = 1000
shared_buffers=32GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

Client_Count/Patch_ver 1 8 64 128 256
HEAD 871 5090 17760 17616 13907
PATCH 900 5110 18331 20277 19263

Here, we can see that there is a gain of ~15% to ~38% at higher client
count.

The attached document (perf_write_clogcontrollock_data_v6.ods) contains
data, mainly focussing on single client performance. The data is for
multiple runs on different machines, so I thought it is better to present
in form of document rather than dumping everything in e-mail. Do let me
know if there is any confusion in understanding/interpreting the data.

Thanks to Dilip Kumar for helping me in conducting test of this patch on
MacBook-Pro.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v6.patch application/octet-stream 15.5 KB
perf_write_clogcontrollock_data_v6.ods application/vnd.oasis.opendocument.spreadsheet 19.0 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-27 04:37:52
Message-ID: CAA4eK1KZ1HVOjy39i3FRS9MsbR6GgH8D7cqyoYJxyb2PJt05Qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>
>
> Here, we can see that there is a gain of ~15% to ~38% at higher client
> count.
>
> The attached document (perf_write_clogcontrollock_data_v6.ods) contains
> data, mainly focussing on single client performance. The data is for
> multiple runs on different machines, so I thought it is better to present
> in form of document rather than dumping everything in e-mail. Do let me
> know if there is any confusion in understanding/interpreting the data.
>
>
Forgot to mention that all these tests have been done by
reverting commit-ac1d794.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-02-29 17:40:35
Message-ID: CA+TgmoYYr9ddcVE3PytjuZBmJ8L8P00=oXYS9Rjz+AOmtRp86g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 26, 2016 at 11:37 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>>
>> Here, we can see that there is a gain of ~15% to ~38% at higher client
>> count.
>>
>> The attached document (perf_write_clogcontrollock_data_v6.ods) contains
>> data, mainly focussing on single client performance. The data is for
>> multiple runs on different machines, so I thought it is better to present in
>> form of document rather than dumping everything in e-mail. Do let me know
>> if there is any confusion in understanding/interpreting the data.
>
> Forgot to mention that all these tests have been done by reverting
> commit-ac1d794.

OK, that seems better. But I have a question: if we don't really need
to make this optimization apply only when everything is on the same
page, then why even try? If we didn't try, we wouldn't need the
all_trans_same_page flag, which would reduce the amount of code
change. Would that hurt anything? Taking it even further, we could
remove the check from TransactionGroupUpdateXidStatus too. I'd be
curious to know whether that set of changes would improve performance
or regress it. Or maybe it does nothing, in which case perhaps
simpler is better.

All things being equal, it's probably better if the cases where
transactions from different pages get into the list together is
something that is more or less expected rather than a
once-in-a-blue-moon scenario - that way, if any bugs exist, we'll find
them. The downside of that is that we could increase latency for the
leader that way - doing other work on the same page shouldn't hurt
much but different pages is a bigger hit. But that hit might be
trivial enough not to be worth worrying about.

+ /*
+ * Now that we've released the lock, go back and wake everybody up. We
+ * don't do this under the lock so as to keep lock hold times to a
+ * minimum. The system calls we need to perform to wake other processes
+ * up are probably much slower than the simple memory writes
we did while
+ * holding the lock.
+ */

This comment was true in the place that you cut-and-pasted it from,
but it's not true here, since we potentially need to read from disk.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-01 04:13:34
Message-ID: CAA4eK1Lc7gQUMDaP4uGgAXQrdp1iv=-2O=NsmUqHZsu3MaVHng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 29, 2016 at 11:10 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Feb 26, 2016 at 11:37 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > wrote:
> >>
> >> Here, we can see that there is a gain of ~15% to ~38% at higher client
> >> count.
> >>
> >> The attached document (perf_write_clogcontrollock_data_v6.ods) contains
> >> data, mainly focussing on single client performance. The data is for
> >> multiple runs on different machines, so I thought it is better to
present in
> >> form of document rather than dumping everything in e-mail. Do let me
know
> >> if there is any confusion in understanding/interpreting the data.
> >
> > Forgot to mention that all these tests have been done by reverting
> > commit-ac1d794.
>
> OK, that seems better. But I have a question: if we don't really need
> to make this optimization apply only when everything is on the same
> page, then why even try? If we didn't try, we wouldn't need the
> all_trans_same_page flag, which would reduce the amount of code
> change.

I am not sure if I understood your question, do you want to know why at the
first place transactions spanning more than one-page call the
function TransactionIdSetPageStatus()? If we want to avoid trying the
transaction status update when they are on different page, then I think we
need some
major changes in TransactionIdSetTreeStatus().

> Would that hurt anything? Taking it even further, we could
> remove the check from TransactionGroupUpdateXidStatus too. I'd be
> curious to know whether that set of changes would improve performance
> or regress it. Or maybe it does nothing, in which case perhaps
> simpler is better.
>
> All things being equal, it's probably better if the cases where
> transactions from different pages get into the list together is
> something that is more or less expected rather than a
> once-in-a-blue-moon scenario - that way, if any bugs exist, we'll find
> them. The downside of that is that we could increase latency for the
> leader that way - doing other work on the same page shouldn't hurt
> much but different pages is a bigger hit. But that hit might be
> trivial enough not to be worth worrying about.
>

In my tests, the check related to same page
in TransactionGroupUpdateXidStatus() doesn't impact performance in any way
and I think the reason is that, it happens rarely that group contain
multiple pages and even it occurs, there is hardly much impact. So, I will
remove that check and I think thats what you also want for now.

> + /*
> + * Now that we've released the lock, go back and wake everybody
up. We
> + * don't do this under the lock so as to keep lock hold times to a
> + * minimum. The system calls we need to perform to wake other
processes
> + * up are probably much slower than the simple memory writes
> we did while
> + * holding the lock.
> + */
>
> This comment was true in the place that you cut-and-pasted it from,
> but it's not true here, since we potentially need to read from disk.
>

Okay, will change.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-01 15:47:42
Message-ID: CAA4eK1L5oeXFXW+ME=Na0oLrbo2mptsFAvHNX-49qYF+ZOLfVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 29, 2016 at 11:10 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Feb 26, 2016 at 11:37 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > wrote:
> >>
> >> Here, we can see that there is a gain of ~15% to ~38% at higher client
> >> count.
> >>
> >> The attached document (perf_write_clogcontrollock_data_v6.ods) contains
> >> data, mainly focussing on single client performance. The data is for
> >> multiple runs on different machines, so I thought it is better to
present in
> >> form of document rather than dumping everything in e-mail. Do let me
know
> >> if there is any confusion in understanding/interpreting the data.
> >
> > Forgot to mention that all these tests have been done by reverting
> > commit-ac1d794.
>
> OK, that seems better. But I have a question: if we don't really need
> to make this optimization apply only when everything is on the same
> page, then why even try?
>

This is to save the case when sub-transactions belonging to a transaction
are on different pages, and the reason for same is that currently I am
using XidCache as stored in each proc to pass the information of
subtransactions to TransactionIdSetPageStatusInternal(), now if we allow
subtransactions from different pages then I need to extract subxid's from
that cache which belong to the page on which we are trying to update the
status. Now this will add few more cycles in the code path under
ExclusiveLock without any clear benefit, thats why I have not implemented
it. I have explained the same in code comments as well:

This optimization is only applicable if the transaction and

+ * all child sub-transactions belong to same page which we presume to be
the

+ * most common case, we might be able to apply this when they are not on
same

+ * page, but that needs us to map sub-transactions in proc's XidCache based

+ * on pageno for which each time Group leader needs to set the transaction

+ * status and that can lead to some performance penalty as well because it

+ * needs to be done after acquiring CLogControlLock, so let's leave that

+ * case for now.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: David Steele <david(at)pgmasters(dot)net>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-14 18:30:05
Message-ID: 56E7032D.6060908@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2/26/16 11:37 PM, Amit Kapila wrote:

> On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com
>
> Here, we can see that there is a gain of ~15% to ~38% at higher
> client count.
>
> The attached document (perf_write_clogcontrollock_data_v6.ods)
> contains data, mainly focussing on single client performance. The
> data is for multiple runs on different machines, so I thought it is
> better to present in form of document rather than dumping everything
> in e-mail. Do let me know if there is any confusion in
> understanding/interpreting the data.
>
> Forgot to mention that all these tests have been done by
> reverting commit-ac1d794.

This patch no longer applies cleanly:

$ git apply ../other/group_update_clog_v6.patch
error: patch failed: src/backend/storage/lmgr/proc.c:404
error: src/backend/storage/lmgr/proc.c: patch does not apply
error: patch failed: src/include/storage/proc.h:152
error: src/include/storage/proc.h: patch does not apply

It's not clear to me whether Robert has completed a review of this code
or it still needs to be reviewed more comprehensively.

Other than a comment that needs to be fixed it seems that all questions
have been answered by Amit.

Is this "ready for committer" or still in need of further review?

--
-David
david(at)pgmasters(dot)net


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: David Steele <david(at)pgmasters(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-15 05:17:12
Message-ID: CAA4eK1LGgOyn9OpiK8W3PfrXqfHsvTi0hy0y00bu50YfE_X+MA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 15, 2016 at 12:00 AM, David Steele <david(at)pgmasters(dot)net> wrote:
>
> On 2/26/16 11:37 PM, Amit Kapila wrote:
>
>> On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com
>>
>> Here, we can see that there is a gain of ~15% to ~38% at higher
>> client count.
>>
>> The attached document (perf_write_clogcontrollock_data_v6.ods)
>> contains data, mainly focussing on single client performance. The
>> data is for multiple runs on different machines, so I thought it is
>> better to present in form of document rather than dumping everything
>> in e-mail. Do let me know if there is any confusion in
>> understanding/interpreting the data.
>>
>> Forgot to mention that all these tests have been done by
>> reverting commit-ac1d794.
>
>
> This patch no longer applies cleanly:
>
> $ git apply ../other/group_update_clog_v6.patch
> error: patch failed: src/backend/storage/lmgr/proc.c:404
> error: src/backend/storage/lmgr/proc.c: patch does not apply
> error: patch failed: src/include/storage/proc.h:152
> error: src/include/storage/proc.h: patch does not apply
>

For me, with patch -p1 < <path_of_patch> it works, but any how I have
updated the patch based on recent commit. Can you please check the latest
patch and see if it applies cleanly for you now.

>
> It's not clear to me whether Robert has completed a review of this code
or it still needs to be reviewed more comprehensively.
>
> Other than a comment that needs to be fixed it seems that all questions
have been answered by Amit.
>

I have updated the comments and changed the name of one of a variable from
"all_trans_same_page" to "all_xact_same_page" as pointed out offlist by
Alvaro.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v7.patch application/octet-stream 15.5 KB

From: David Steele <david(at)pgmasters(dot)net>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-15 14:24:19
Message-ID: 56E81B13.4000505@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/15/16 1:17 AM, Amit Kapila wrote:

> On Tue, Mar 15, 2016 at 12:00 AM, David Steele <david(at)pgmasters(dot)net
>
>> This patch no longer applies cleanly:
>>
>> $ git apply ../other/group_update_clog_v6.patch
>> error: patch failed: src/backend/storage/lmgr/proc.c:404
>> error: src/backend/storage/lmgr/proc.c: patch does not apply
>> error: patch failed: src/include/storage/proc.h:152
>> error: src/include/storage/proc.h: patch does not apply
>
> For me, with patch -p1 < <path_of_patch> it works, but any how I have
> updated the patch based on recent commit. Can you please check the
> latest patch and see if it applies cleanly for you now.

Yes, it now applies cleanly (101fd93).

--
-David
david(at)pgmasters(dot)net


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: David Steele <david(at)pgmasters(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-15 14:25:49
Message-ID: CAA4eK1JtdsbYrcb-fqkGWiFgqp5D9B=1Fj6pcLWLHM_yaOuvYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 15, 2016 at 7:54 PM, David Steele <david(at)pgmasters(dot)net> wrote:
>
> On 3/15/16 1:17 AM, Amit Kapila wrote:
>
> > On Tue, Mar 15, 2016 at 12:00 AM, David Steele <david(at)pgmasters(dot)net
> >
> >> This patch no longer applies cleanly:
> >>
> >> $ git apply ../other/group_update_clog_v6.patch
> >> error: patch failed: src/backend/storage/lmgr/proc.c:404
> >> error: src/backend/storage/lmgr/proc.c: patch does not apply
> >> error: patch failed: src/include/storage/proc.h:152
> >> error: src/include/storage/proc.h: patch does not apply
> >
> > For me, with patch -p1 < <path_of_patch> it works, but any how I have
> > updated the patch based on recent commit. Can you please check the
> > latest patch and see if it applies cleanly for you now.
>
> Yes, it now applies cleanly (101fd93).
>

Thanks for verification.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: David Steele <david(at)pgmasters(dot)net>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-15 15:11:53
Message-ID: 20160315151153.GA283827@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Steele wrote:

> This patch no longer applies cleanly:
>
> $ git apply ../other/group_update_clog_v6.patch

Normally "git apply -3" gives good results in these cases -- it applies
the 3-way merge algorithm just as if you had applied the patch to the
revision it was built on and later git-merged with the latest head.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-16 18:27:34
Message-ID: 56E9A596.2000607@redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/15/2016 01:17 AM, Amit Kapila wrote:
> I have updated the comments and changed the name of one of a variable from
> "all_trans_same_page" to "all_xact_same_page" as pointed out offlist by
> Alvaro.
>
>

I have done a run, and don't see any regressions.

Intel Xeon 28C/56T @ 2GHz w/ 256GB + 2 x RAID10 (data + xlog) SSD.

I can provide perf / flamegraph profiles if needed.

Thanks for working on this !

Best regards,
Jesper

Attachment Content-Type Size
image/png 32.3 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-17 03:39:36
Message-ID: CAA4eK1KALqw1cDELpgd1rf6vZoVoWY9td+un54hgLpyH8LfK5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 16, 2016 at 11:57 PM, Jesper Pedersen <
jesper(dot)pedersen(at)redhat(dot)com> wrote:
>
> On 03/15/2016 01:17 AM, Amit Kapila wrote:
>>
>> I have updated the comments and changed the name of one of a variable
from
>> "all_trans_same_page" to "all_xact_same_page" as pointed out offlist by
>> Alvaro.
>>
>>
>
> I have done a run, and don't see any regressions.
>

Can you provide the details of test, like is this pgbench read-write test
and if possible steps for doing test execution.

I wonder if you can do the test with unlogged tables (if you are using
pgbench, then I think you need to change the Create Table command to use
Unlogged option).

>
> Intel Xeon 28C/56T @ 2GHz w/ 256GB + 2 x RAID10 (data + xlog) SSD.
>

Can you provide CPU information (probably by using lscpu).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-22 07:03:07
Message-ID: CAFiTN-vwD_J+OAL9MJNWi63NBgLWXQQCh3X2jG5a9LOdQ2GZBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 17, 2016 at 11:39 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

I have reviewed the patch.. here are some review comments, I will continue
to review..

1.

+
+ /*
+ * Add the proc to list, if the clog page where we need to update the

+ */
+ if (nextidx != INVALID_PGPROCNO &&
+ ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
proc->clogGroupMemberPage)
+ return false;

Should we clear all these structure variable what we set above in case we
are not adding our self to group, I can see it will not have any problem
even if we don't clear them,
I think if we don't want to clear we can add some comment mentioning the
same.

+ proc->clogGroupMember = true;
+ proc->clogGroupMemberXid = xid;
+ proc->clogGroupMemberXidStatus = status;
+ proc->clogGroupMemberPage = pageno;
+ proc->clogGroupMemberLsn = lsn;

2.

Here we are updating in our own proc, I think we don't need atomic
operation here, we are not yet added to the list.

+ if (nextidx != INVALID_PGPROCNO &&
+ ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
proc->clogGroupMemberPage)
+ return false;
+
+ pg_atomic_write_u32(&proc->clogGroupNext, nextidx);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-22 10:52:21
Message-ID: 20160322105221.GD3790@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId *subxids,
> * Record the final state of transaction entries in the commit log for
> * all entries on a single page. Atomic only on this page.
> *
> + * Group the status update for transactions. This improves the efficiency
> + * of the transaction status update by reducing the number of lock
> + * acquisitions required for it. To achieve the group transaction status
> + * update, we need to populate the transaction status related information
> + * in shared memory and doing it for overflowed sub-transactions would need
> + * a big chunk of shared memory, so we are not doing this optimization for
> + * such cases. This optimization is only applicable if the transaction and
> + * all child sub-transactions belong to same page which we presume to be the
> + * most common case, we might be able to apply this when they are not on same
> + * page, but that needs us to map sub-transactions in proc's XidCache based
> + * on pageno for which each time a group leader needs to set the transaction
> + * status and that can lead to some performance penalty as well because it
> + * needs to be done after acquiring CLogControlLock, so let's leave that
> + * case for now. We don't do this optimization for prepared transactions
> + * as the dummy proc associated with such transactions doesn't have a
> + * semaphore associated with it and the same is required for group status
> + * update. We choose not to create a semaphore for dummy procs for this
> + * purpose as the advantage of using this optimization for prepared transactions
> + * is not clear.
> + *

I think you should try to break up some of the sentences, one of them
spans 7 lines.

I'm actually rather unconvinced that it's all that common that all
subtransactions are on one page. If you have concurrency - otherwise
there'd be not much point in this patch - they'll usually be heavily
interleaved, no? You can argue that you don't care about subxacts,
because they're more often used in less concurrent scenarios, but if
that's the argument, it should actually be made.

> * Otherwise API is same as TransactionIdSetTreeStatus()
> */
> static void
> TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
> TransactionId *subxids, XidStatus status,
> - XLogRecPtr lsn, int pageno)
> + XLogRecPtr lsn, int pageno,
> + bool all_xact_same_page)
> +{
> + /*
> + * If we can immediately acquire CLogControlLock, we update the status
> + * of our own XID and release the lock. If not, use group XID status
> + * update to improve efficiency and if still not able to update, then
> + * acquire CLogControlLock and update it.
> + */
> + if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> + {
> + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> + LWLockRelease(CLogControlLock);
> + }
> + else if (!all_xact_same_page ||
> + nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> + IsGXactActive() ||
> + !TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
> + {
> + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> +
> + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> +
> + LWLockRelease(CLogControlLock);
> + }
> +}
>

This code is a bit arcane. I think it should be restructured to
a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
lock acquire first can be rather expensive.
b) I'd rather see an explicit fallback for the
!TransactionGroupUpdateXidStatus case, this way it's too hard to
understand. It's also harder to add probes to detect whether that

> +
> +/*
> + * When we cannot immediately acquire CLogControlLock in exclusive mode at
> + * commit time, add ourselves to a list of processes that need their XIDs
> + * status update.

At this point my "ABA Problem" alarm goes off. If it's not an actual
danger, can you please document close by, why not?

> The first process to add itself to the list will acquire
> + * CLogControlLock in exclusive mode and perform TransactionIdSetPageStatusInternal
> + * on behalf of all group members. This avoids a great deal of contention
> + * around CLogControlLock when many processes are trying to commit at once,
> + * since the lock need not be repeatedly handed off from one committing
> + * process to the next.
> + *
> + * Returns true, if transaction status is updated in clog page, else return
> + * false.
> + */
> +static bool
> +TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
> + XLogRecPtr lsn, int pageno)
> +{
> + volatile PROC_HDR *procglobal = ProcGlobal;
> + PGPROC *proc = MyProc;
> + uint32 nextidx;
> + uint32 wakeidx;
> + int extraWaits = -1;
> +
> + /* We should definitely have an XID whose status needs to be updated. */
> + Assert(TransactionIdIsValid(xid));
> +
> + /*
> + * Add ourselves to the list of processes needing a group XID status
> + * update.
> + */
> + proc->clogGroupMember = true;
> + proc->clogGroupMemberXid = xid;
> + proc->clogGroupMemberXidStatus = status;
> + proc->clogGroupMemberPage = pageno;
> + proc->clogGroupMemberLsn = lsn;
> + while (true)
> + {
> + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> +
> + /*
> + * Add the proc to list, if the clog page where we need to update the
> + * current transaction status is same as group leader's clog page.
> + * There is a race condition here such that after doing the below
> + * check and before adding this proc's clog update to a group, if the
> + * group leader already finishes the group update for this page and
> + * becomes group leader of another group which updates different clog
> + * page, then it will lead to a situation where a single group can
> + * have different clog page updates. Now the chances of such a race
> + * condition are less and even if it happens, the only downside is
> + * that it could lead to serial access of clog pages from disk if
> + * those pages are not in memory. Tests doesn't indicate any
> + * performance hit due to different clog page updates in same group,
> + * however in future, if we want to improve the situation, then we can
> + * detect the non-group leader transactions that tries to update the
> + * different CLOG page after acquiring CLogControlLock and then mark
> + * these transactions such that after waking they need to perform CLOG
> + * update via normal path.
> + */

Needs a good portion of polishing.

> + if (nextidx != INVALID_PGPROCNO &&
> + ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
> + return false;

I think we're returning with clogGroupMember = true - that doesn't look
right.

> + pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> +
> + if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> + &nextidx,
> + (uint32) proc->pgprocno))
> + break;
> + }

So this indeed has ABA type problems. And you appear to be arguing above
that that's ok. Need to ponder that for a bit.

So, we enqueue ourselves as the *head* of the wait list, if there's
other waiters. Seems like it could lead to the first element after the
leader to be delayed longer than the others.

FWIW, You can move the nextidx = part of out the loop,
pgatomic_compare_exchange will update the nextidx value from memory; no
need for another load afterwards.

> + /*
> + * If the list was not empty, the leader will update the status of our
> + * XID. It is impossible to have followers without a leader because the
> + * first process that has added itself to the list will always have
> + * nextidx as INVALID_PGPROCNO.
> + */
> + if (nextidx != INVALID_PGPROCNO)
> + {
> + /* Sleep until the leader updates our XID status. */
> + for (;;)
> + {
> + /* acts as a read barrier */
> + PGSemaphoreLock(&proc->sem);
> + if (!proc->clogGroupMember)
> + break;
> + extraWaits++;
> + }
> +
> + Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
> +
> + /* Fix semaphore count for any absorbed wakeups */
> + while (extraWaits-- > 0)
> + PGSemaphoreUnlock(&proc->sem);
> + return true;
> + }
> +
> + /* We are the leader. Acquire the lock on behalf of everyone. */
> + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> +
> + /*
> + * Now that we've got the lock, clear the list of processes waiting for
> + * group XID status update, saving a pointer to the head of the list.
> + * Trying to pop elements one at a time could lead to an ABA problem.
> + */
> + while (true)
> + {
> + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> + if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> + &nextidx,
> + INVALID_PGPROCNO))
> + break;
> + }

Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
rather than compare_exchange?

> diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
> index c4fd9ef..120b9c0 100644
> --- a/src/backend/access/transam/twophase.c
> +++ b/src/backend/access/transam/twophase.c
> @@ -177,7 +177,7 @@ static TwoPhaseStateData *TwoPhaseState;
> /*
> * Global transaction entry currently locked by us, if any.
> */
> -static GlobalTransaction MyLockedGxact = NULL;
> +GlobalTransaction MyLockedGxact = NULL;

Hm, I'm doubtful it's worthwhile to expose this, just so we can use an
inline function, but whatever.

> +#include "access/clog.h"
> #include "access/xlogdefs.h"
> #include "lib/ilist.h"
> #include "storage/latch.h"
> @@ -154,6 +155,17 @@ struct PGPROC
>
> uint32 wait_event_info; /* proc's wait information */
>
> + /* Support for group transaction status update. */
> + bool clogGroupMember; /* true, if member of clog group */
> + pg_atomic_uint32 clogGroupNext; /* next clog group member */
> + TransactionId clogGroupMemberXid; /* transaction id of clog group member */
> + XidStatus clogGroupMemberXidStatus; /* transaction status of clog
> + * group member */
> + int clogGroupMemberPage; /* clog page corresponding to
> + * transaction id of clog group member */
> + XLogRecPtr clogGroupMemberLsn; /* WAL location of commit record for
> + * clog group member */
> +

Man, we're surely bloating PGPROC at a prodigious rate.

That's my first pass over the code itself.

Hm. Details aside, what concerns me most is that the whole group
mechanism, as implemented, only works als long as transactions only span
a short and regular amount of time. As soon as there's some variance in
transaction duration, the likelihood of building a group, where all xids
are on one page, diminishes. That likely works well in benchmarking, but
I'm afraid it's much less the case in the real world, where there's
network latency involved, and where applications actually contain
computations themselves.

If I understand correctly, without having followed the thread, the
reason you came up with this batching on a per-page level is to bound
the amount of effort spent by the leader; and thus bound the latency?

I think it's worthwhile to create a benchmark that does something like
BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
completely realistic values for network RTT + application computation),
the success rate of group updates shrinks noticeably.

Greetings,

Andres Freund


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-22 12:49:48
Message-ID: CAA4eK1+pWLPZGD6xYfJP=M6WHHzES-yyYSA4qpCA-jy0npKSUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId
*subxids,
> > * Record the final state of transaction entries in the commit log for
> > * all entries on a single page. Atomic only on this page.
> > *
> > + * Group the status update for transactions. This improves the
efficiency
> > + * of the transaction status update by reducing the number of lock
> > + * acquisitions required for it. To achieve the group transaction
status
> > + * update, we need to populate the transaction status related
information
> > + * in shared memory and doing it for overflowed sub-transactions would
need
> > + * a big chunk of shared memory, so we are not doing this optimization
for
> > + * such cases. This optimization is only applicable if the transaction
and
> > + * all child sub-transactions belong to same page which we presume to
be the
> > + * most common case, we might be able to apply this when they are not
on same
> > + * page, but that needs us to map sub-transactions in proc's XidCache
based
> > + * on pageno for which each time a group leader needs to set the
transaction
> > + * status and that can lead to some performance penalty as well
because it
> > + * needs to be done after acquiring CLogControlLock, so let's leave
that
> > + * case for now. We don't do this optimization for prepared
transactions
> > + * as the dummy proc associated with such transactions doesn't have a
> > + * semaphore associated with it and the same is required for group
status
> > + * update. We choose not to create a semaphore for dummy procs for
this
> > + * purpose as the advantage of using this optimization for prepared
transactions
> > + * is not clear.
> > + *
>
> I think you should try to break up some of the sentences, one of them
> spans 7 lines.
>

Okay, I will try to do so in next version.

> I'm actually rather unconvinced that it's all that common that all
> subtransactions are on one page. If you have concurrency - otherwise
> there'd be not much point in this patch - they'll usually be heavily
> interleaved, no? You can argue that you don't care about subxacts,
> because they're more often used in less concurrent scenarios, but if
> that's the argument, it should actually be made.
>

Note, that we are doing it only when a transaction has less than equal to
64 sub transactions. Now, I am not denying from the fact that there will
be cases where subtransactions won't fall into different pages, but I think
chances of such transactions to participate in group mode will be less and
this patch is mainly targeting scalability for short transactions.

>
> > * Otherwise API is same as TransactionIdSetTreeStatus()
> > */
> > static void
> > TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
> > TransactionId
*subxids, XidStatus status,
> > - XLogRecPtr lsn, int
pageno)
> > + XLogRecPtr lsn, int
pageno,
> > + bool
all_xact_same_page)
> > +{
> > + /*
> > + * If we can immediately acquire CLogControlLock, we update the
status
> > + * of our own XID and release the lock. If not, use group XID
status
> > + * update to improve efficiency and if still not able to update,
then
> > + * acquire CLogControlLock and update it.
> > + */
> > + if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> > + {
> > + TransactionIdSetPageStatusInternal(xid, nsubxids,
subxids, status, lsn, pageno);
> > + LWLockRelease(CLogControlLock);
> > + }
> > + else if (!all_xact_same_page ||
> > + nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> > + IsGXactActive() ||
> > + !TransactionGroupUpdateXidStatus(xid, status,
lsn, pageno))
> > + {
> > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > + TransactionIdSetPageStatusInternal(xid, nsubxids,
subxids, status, lsn, pageno);
> > +
> > + LWLockRelease(CLogControlLock);
> > + }
> > +}
> >
>
> This code is a bit arcane. I think it should be restructured to
> a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
> PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
> lock acquire first can be rather expensive.

The previous version (v5 - [1]) has code that way, but that adds few extra
instructions for single client case and I was seeing minor performance
regression for single client case due to which it has been changed as per
current code.

> b) I'd rather see an explicit fallback for the
> !TransactionGroupUpdateXidStatus case, this way it's too hard to
> understand. It's also harder to add probes to detect whether that
>

Considering above reply to (a), do you want to see it done as a separate
else if loop in patch?

>
> > +
> > +/*
> > + * When we cannot immediately acquire CLogControlLock in exclusive
mode at
> > + * commit time, add ourselves to a list of processes that need their
XIDs
> > + * status update.
>
> At this point my "ABA Problem" alarm goes off. If it's not an actual
> danger, can you please document close by, why not?
>

Why this won't lead to ABA problem is explained below in comments. Refer

+ /*

+ * Now that we've got the lock, clear the list of processes waiting for

+ * group XID status update, saving a pointer to the head of the list.

+ * Trying to pop elements one at a time could lead to an ABA problem.

+ */

>
> > The first process to add itself to the list will acquire
> > + * CLogControlLock in exclusive mode and perform
TransactionIdSetPageStatusInternal
> > + * on behalf of all group members. This avoids a great deal of
contention
> > + * around CLogControlLock when many processes are trying to commit at
once,
> > + * since the lock need not be repeatedly handed off from one committing
> > + * process to the next.
> > + *
> > + * Returns true, if transaction status is updated in clog page, else
return
> > + * false.
> > + */
> > +static bool
> > +TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
> > +
XLogRecPtr lsn, int pageno)
> > +{
> > + volatile PROC_HDR *procglobal = ProcGlobal;
> > + PGPROC *proc = MyProc;
> > + uint32 nextidx;
> > + uint32 wakeidx;
> > + int extraWaits = -1;
> > +
> > + /* We should definitely have an XID whose status needs to be
updated. */
> > + Assert(TransactionIdIsValid(xid));
> > +
> > + /*
> > + * Add ourselves to the list of processes needing a group XID
status
> > + * update.
> > + */
> > + proc->clogGroupMember = true;
> > + proc->clogGroupMemberXid = xid;
> > + proc->clogGroupMemberXidStatus = status;
> > + proc->clogGroupMemberPage = pageno;
> > + proc->clogGroupMemberLsn = lsn;
> > + while (true)
> > + {
> > + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > +
> > + /*
> > + * Add the proc to list, if the clog page where we need
to update the
> > + * current transaction status is same as group leader's
clog page.
> > + * There is a race condition here such that after doing
the below
> > + * check and before adding this proc's clog update to a
group, if the
> > + * group leader already finishes the group update for
this page and
> > + * becomes group leader of another group which updates
different clog
> > + * page, then it will lead to a situation where a single
group can
> > + * have different clog page updates. Now the chances of
such a race
> > + * condition are less and even if it happens, the only
downside is
> > + * that it could lead to serial access of clog pages from
disk if
> > + * those pages are not in memory. Tests doesn't indicate
any
> > + * performance hit due to different clog page updates in
same group,
> > + * however in future, if we want to improve the
situation, then we can
> > + * detect the non-group leader transactions that tries to
update the
> > + * different CLOG page after acquiring CLogControlLock
and then mark
> > + * these transactions such that after waking they need to
perform CLOG
> > + * update via normal path.
> > + */
>
> Needs a good portion of polishing.
>
>
> > + if (nextidx != INVALID_PGPROCNO &&
> > + ProcGlobal->allProcs[nextidx].clogGroupMemberPage
!= proc->clogGroupMemberPage)
> > + return false;
>
> I think we're returning with clogGroupMember = true - that doesn't look
> right.
>

I think it won't create a problem, but surely it is not good to return as
true, will change in next version of patch.

>
> > + pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> > +
> > + if
(pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > +
&nextidx,
> > +
(uint32) proc->pgprocno))
> > + break;
> > + }
>
> So this indeed has ABA type problems. And you appear to be arguing above
> that that's ok. Need to ponder that for a bit.
>
> So, we enqueue ourselves as the *head* of the wait list, if there's
> other waiters. Seems like it could lead to the first element after the
> leader to be delayed longer than the others.
>

It will not matter because we are waking the queued process only once we
are done with xid status update.

>
> FWIW, You can move the nextidx = part of out the loop,
> pgatomic_compare_exchange will update the nextidx value from memory; no
> need for another load afterwards.
>

Not sure, if I understood which statement you are referring here (are you
referring to atomic read operation) and how can we save the load operation?

>
> > + /*
> > + * If the list was not empty, the leader will update the status
of our
> > + * XID. It is impossible to have followers without a leader
because the
> > + * first process that has added itself to the list will always
have
> > + * nextidx as INVALID_PGPROCNO.
> > + */
> > + if (nextidx != INVALID_PGPROCNO)
> > + {
> > + /* Sleep until the leader updates our XID status. */
> > + for (;;)
> > + {
> > + /* acts as a read barrier */
> > + PGSemaphoreLock(&proc->sem);
> > + if (!proc->clogGroupMember)
> > + break;
> > + extraWaits++;
> > + }
> > +
> > + Assert(pg_atomic_read_u32(&proc->clogGroupNext) ==
INVALID_PGPROCNO);
> > +
> > + /* Fix semaphore count for any absorbed wakeups */
> > + while (extraWaits-- > 0)
> > + PGSemaphoreUnlock(&proc->sem);
> > + return true;
> > + }
> > +
> > + /* We are the leader. Acquire the lock on behalf of everyone. */
> > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > + /*
> > + * Now that we've got the lock, clear the list of processes
waiting for
> > + * group XID status update, saving a pointer to the head of the
list.
> > + * Trying to pop elements one at a time could lead to an ABA
problem.
> > + */
> > + while (true)
> > + {
> > + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > + if
(pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > +
&nextidx,
> > +
INVALID_PGPROCNO))
> > + break;
> > + }
>
> Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
> rather than compare_exchange?
>

We need to remember the head of list to wake up the processes due to which
I think above loop is required.

>
> > diff --git a/src/backend/access/transam/twophase.c
b/src/backend/access/transam/twophase.c
> > index c4fd9ef..120b9c0 100644
> > --- a/src/backend/access/transam/twophase.c
> > +++ b/src/backend/access/transam/twophase.c
> > @@ -177,7 +177,7 @@ static TwoPhaseStateData *TwoPhaseState;
> > /*
> > * Global transaction entry currently locked by us, if any.
> > */
> > -static GlobalTransaction MyLockedGxact = NULL;
> > +GlobalTransaction MyLockedGxact = NULL;
>
> Hm, I'm doubtful it's worthwhile to expose this, just so we can use an
> inline function, but whatever.
>

I have done considering this as a hot-path to save an additional function
call, but can change if you think so.

>
> > +#include "access/clog.h"
> > #include "access/xlogdefs.h"
> > #include "lib/ilist.h"
> > #include "storage/latch.h"
> > @@ -154,6 +155,17 @@ struct PGPROC
> >
> > uint32 wait_event_info; /* proc's wait
information */
> >
> > + /* Support for group transaction status update. */
> > + bool clogGroupMember; /* true, if member of
clog group */
> > + pg_atomic_uint32 clogGroupNext; /* next clog group member
*/
> > + TransactionId clogGroupMemberXid; /* transaction id of clog
group member */
> > + XidStatus clogGroupMemberXidStatus; /*
transaction status of clog
> > +
* group member */
> > + int clogGroupMemberPage; /* clog page
corresponding to
> > +
* transaction id of clog group member */
> > + XLogRecPtr clogGroupMemberLsn; /* WAL location
of commit record for
> > +
* clog group member */
> > +
>
> Man, we're surely bloating PGPROC at a prodigious rate.
>
>
> That's my first pass over the code itself.
>
>
> Hm. Details aside, what concerns me most is that the whole group
> mechanism, as implemented, only works als long as transactions only span
> a short and regular amount of time.
>

Yes, thats the main case which will be targeted by this patch and I think
there are many such cases in OLTP workloads where there are very short
transactions.

>
> If I understand correctly, without having followed the thread, the
> reason you came up with this batching on a per-page level is to bound
> the amount of effort spent by the leader; and thus bound the latency?
>

This is mainly to save the case where multiple pages are not-in-memory and
leader needs to perform the I/O serially. Refer mail [2] for point raised
by Robert.

> I think it's worthwhile to create a benchmark that does something like
> BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> completely realistic values for network RTT + application computation),
> the success rate of group updates shrinks noticeably.
>

I think it will happen that way, but what do we want to see with that
benchmark? I think the results will be that for such a workload either
there is no benefit or will be very less as compare to short transactions.

[1] -
http://www.postgresql.org/message-id/CAA4eK1KUVPxBcGTdOuKyvf5p1sQ0HeUbSMbTxtQc=P65OxiZog@mail.gmail.com
[2] -
http://www.postgresql.org/message-id/CA+TgmoahCx6XgprR=p5==cF0g9uhSHsJxVdWdUEHN9H2Mv0gkw@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-22 12:59:57
Message-ID: 20160322125957.GH3790@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-03-22 18:19:48 +0530, Amit Kapila wrote:
> > I'm actually rather unconvinced that it's all that common that all
> > subtransactions are on one page. If you have concurrency - otherwise
> > there'd be not much point in this patch - they'll usually be heavily
> > interleaved, no? You can argue that you don't care about subxacts,
> > because they're more often used in less concurrent scenarios, but if
> > that's the argument, it should actually be made.
> >
>
> Note, that we are doing it only when a transaction has less than equal to
> 64 sub transactions.

So?

> > This code is a bit arcane. I think it should be restructured to
> > a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
> > PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
> > lock acquire first can be rather expensive.
>
> The previous version (v5 - [1]) has code that way, but that adds few extra
> instructions for single client case and I was seeing minor performance
> regression for single client case due to which it has been changed as per
> current code.

I don't believe that changing conditions here is likely to cause a
measurable regression.

> > So, we enqueue ourselves as the *head* of the wait list, if there's
> > other waiters. Seems like it could lead to the first element after the
> > leader to be delayed longer than the others.
> >
>
> It will not matter because we are waking the queued process only once we
> are done with xid status update.

If there's only N cores, process N+1 won't be run immediately. But yea,
it's probably not large.

> > FWIW, You can move the nextidx = part of out the loop,
> > pgatomic_compare_exchange will update the nextidx value from memory; no
> > need for another load afterwards.
> >
>
> Not sure, if I understood which statement you are referring here (are you
> referring to atomic read operation) and how can we save the load operation?

Yes, to the atomic read. And we can save it in the loop, because
compare_exchange returns the current value if it fails.

> > > + * Now that we've got the lock, clear the list of processes
> waiting for
> > > + * group XID status update, saving a pointer to the head of the
> list.
> > > + * Trying to pop elements one at a time could lead to an ABA
> problem.
> > > + */
> > > + while (true)
> > > + {
> > > + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > > + if
> (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > > +
> &nextidx,
> > > +
> INVALID_PGPROCNO))
> > > + break;
> > > + }
> >
> > Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
> > rather than compare_exchange?
> >
>
> We need to remember the head of list to wake up the processes due to which
> I think above loop is required.

exchange returns the old value? There's no need for a compare here.

> > I think it's worthwhile to create a benchmark that does something like
> > BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> > INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> > completely realistic values for network RTT + application computation),
> > the success rate of group updates shrinks noticeably.
> >
>
> I think it will happen that way, but what do we want to see with that
> benchmark? I think the results will be that for such a workload either
> there is no benefit or will be very less as compare to short transactions.

Because we want our performance improvements to matter in reality, not
just in unrealistic benchmarks where the benchmarking tool is running on
the same machine as the the database, and uses unix sockets. That not
actually an all that realistic workload.

Andres


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-22 14:36:13
Message-ID: CAA4eK1K9Z5L8rBu6nAk55oSTE4iC_aw=d+MyqpEy4UL6=aTG8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 22, 2016 at 6:29 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2016-03-22 18:19:48 +0530, Amit Kapila wrote:
> > > I'm actually rather unconvinced that it's all that common that all
> > > subtransactions are on one page. If you have concurrency - otherwise
> > > there'd be not much point in this patch - they'll usually be heavily
> > > interleaved, no? You can argue that you don't care about subxacts,
> > > because they're more often used in less concurrent scenarios, but if
> > > that's the argument, it should actually be made.
> > >
> >
> > Note, that we are doing it only when a transaction has less than equal
to
> > 64 sub transactions.
>
> So?
>

They should fall on one page, unless they are heavily interleaved as
pointed by you. I think either subtransactions are present or not, this
patch won't help for bigger transactions.

I will address your other review comments and send an updated patch.

Thanks for the review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-22 14:40:28
Message-ID: CA+TgmoaF57QK13vci5=f2BzS6TTO8hhHv+ByFFPM29yLoojXCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 22, 2016 at 6:52 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I'm actually rather unconvinced that it's all that common that all
> subtransactions are on one page. If you have concurrency - otherwise
> there'd be not much point in this patch - they'll usually be heavily
> interleaved, no? You can argue that you don't care about subxacts,
> because they're more often used in less concurrent scenarios, but if
> that's the argument, it should actually be made.

But a single clog page holds a lot of transactions - I think it's
~32k. If you have 100 backends running, and each one allocates an XID
in turn, and then each allocates a sub-XID in turn, and then they all
commit, and then you repeat this pattern, >99% of transactions will be
on a single CLOG page. And that is a pretty pathological case.

It's true that if you have many short-running transactions interleaved
with occasional long-running transactions, and the latter use
subxacts, the optimization might fail to apply to the long-running
subxacts fairly often. But who cares? Those are, by definition, a
small percentage of the overall transaction stream.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-22 15:11:12
Message-ID: 20160322151112.GL3790@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-03-22 10:40:28 -0400, Robert Haas wrote:
> On Tue, Mar 22, 2016 at 6:52 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I'm actually rather unconvinced that it's all that common that all
> > subtransactions are on one page. If you have concurrency - otherwise
> > there'd be not much point in this patch - they'll usually be heavily
> > interleaved, no? You can argue that you don't care about subxacts,
> > because they're more often used in less concurrent scenarios, but if
> > that's the argument, it should actually be made.
>
> But a single clog page holds a lot of transactions - I think it's
> ~32k.

At 30-40k TPS/sec that's not actually all that much.

> If you have 100 backends running, and each one allocates an XID
> in turn, and then each allocates a sub-XID in turn, and then they all
> commit, and then you repeat this pattern, >99% of transactions will be
> on a single CLOG page. And that is a pretty pathological case.

I think it's much more likely that some backends will immediately
allocate and others won't for a short while.

> It's true that if you have many short-running transactions interleaved
> with occasional long-running transactions, and the latter use
> subxacts, the optimization might fail to apply to the long-running
> subxacts fairly often. But who cares? Those are, by definition, a
> small percentage of the overall transaction stream.

Leaving subtransactions aside, I think the problem is that if you're
having slightly longer running transactions on a regular basis (and I'm
thinking 100-200ms, very common on OLTP systems due to network and
client processing), the effectiveness of the batching will be greatly
reduced.

I'll play around with the updated patch Amit promised, and see how high
the batching rate is over time, depending on the type of transaction
processed.

Andres


From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-23 03:21:19
Message-ID: 56F20BAF.10803@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/22/16 9:36 AM, Amit Kapila wrote:
> > > Note, that we are doing it only when a transaction has less than
> equal to
> > > 64 sub transactions.
> >
> > So?
> >
>
> They should fall on one page, unless they are heavily interleaved as
> pointed by you. I think either subtransactions are present or not, this
> patch won't help for bigger transactions.

FWIW, the use case that comes to mind here is the "upsert" example in
the docs. AFAIK that's going to create a subtransaction every time it's
called, regardless if whether it performs actual DML. I've used that in
places that would probably have moderately high concurrency, and I
suspect I'm not alone in that.

That said, it wouldn't surprise me if plpgsql overhead swamps an effect
this patch has, so perhaps it's a moot point.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-23 06:56:35
Message-ID: CAA4eK1+8gQTyGSZLe1Rb7jeM1Beh4FqA4VNjtpZcmvwizDQ0hw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId
*subxids,
> > * Record the final state of transaction entries in the commit log for
> > * all entries on a single page. Atomic only on this page.
> > *
> > + * Group the status update for transactions. This improves the
efficiency
> > + * of the transaction status update by reducing the number of lock
> > + * acquisitions required for it. To achieve the group transaction
status
> > + * update, we need to populate the transaction status related
information
> > + * in shared memory and doing it for overflowed sub-transactions would
need
> > + * a big chunk of shared memory, so we are not doing this optimization
for
> > + * such cases. This optimization is only applicable if the transaction
and
> > + * all child sub-transactions belong to same page which we presume to
be the
> > + * most common case, we might be able to apply this when they are not
on same
> > + * page, but that needs us to map sub-transactions in proc's XidCache
based
> > + * on pageno for which each time a group leader needs to set the
transaction
> > + * status and that can lead to some performance penalty as well
because it
> > + * needs to be done after acquiring CLogControlLock, so let's leave
that
> > + * case for now. We don't do this optimization for prepared
transactions
> > + * as the dummy proc associated with such transactions doesn't have a
> > + * semaphore associated with it and the same is required for group
status
> > + * update. We choose not to create a semaphore for dummy procs for
this
> > + * purpose as the advantage of using this optimization for prepared
transactions
> > + * is not clear.
> > + *
>
> I think you should try to break up some of the sentences, one of them
> spans 7 lines.
>

Okay, I have simplified the sentences in the comment.

>
>
> > * Otherwise API is same as TransactionIdSetTreeStatus()
> > */
> > static void
> > TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
> > TransactionId
*subxids, XidStatus status,
> > - XLogRecPtr lsn, int
pageno)
> > + XLogRecPtr lsn, int
pageno,
> > + bool
all_xact_same_page)
> > +{
> > + /*
> > + * If we can immediately acquire CLogControlLock, we update the
status
> > + * of our own XID and release the lock. If not, use group XID
status
> > + * update to improve efficiency and if still not able to update,
then
> > + * acquire CLogControlLock and update it.
> > + */
> > + if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> > + {
> > + TransactionIdSetPageStatusInternal(xid, nsubxids,
subxids, status, lsn, pageno);
> > + LWLockRelease(CLogControlLock);
> > + }
> > + else if (!all_xact_same_page ||
> > + nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> > + IsGXactActive() ||
> > + !TransactionGroupUpdateXidStatus(xid, status,
lsn, pageno))
> > + {
> > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > + TransactionIdSetPageStatusInternal(xid, nsubxids,
subxids, status, lsn, pageno);
> > +
> > + LWLockRelease(CLogControlLock);
> > + }
> > +}
> >
>
> This code is a bit arcane. I think it should be restructured to
> a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
> PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
> lock acquire first can be rather expensive.
> b) I'd rather see an explicit fallback for the
> !TransactionGroupUpdateXidStatus case, this way it's too hard to
> understand. It's also harder to add probes to detect whether that
>

Changed.

>
>
> > The first process to add itself to the list will acquire
> > + * CLogControlLock in exclusive mode and perform
TransactionIdSetPageStatusInternal
> > + * on behalf of all group members. This avoids a great deal of
contention
> > + * around CLogControlLock when many processes are trying to commit at
once,
> > + * since the lock need not be repeatedly handed off from one committing
> > + * process to the next.
> > + *
> > + * Returns true, if transaction status is updated in clog page, else
return
> > + * false.
> > + */
> > +static bool
> > +TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
> > +
XLogRecPtr lsn, int pageno)
> > +{
> > + volatile PROC_HDR *procglobal = ProcGlobal;
> > + PGPROC *proc = MyProc;
> > + uint32 nextidx;
> > + uint32 wakeidx;
> > + int extraWaits = -1;
> > +
> > + /* We should definitely have an XID whose status needs to be
updated. */
> > + Assert(TransactionIdIsValid(xid));
> > +
> > + /*
> > + * Add ourselves to the list of processes needing a group XID
status
> > + * update.
> > + */
> > + proc->clogGroupMember = true;
> > + proc->clogGroupMemberXid = xid;
> > + proc->clogGroupMemberXidStatus = status;
> > + proc->clogGroupMemberPage = pageno;
> > + proc->clogGroupMemberLsn = lsn;
> > + while (true)
> > + {
> > + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > +
> > + /*
> > + * Add the proc to list, if the clog page where we need
to update the
> > + * current transaction status is same as group leader's
clog page.
> > + * There is a race condition here such that after doing
the below
> > + * check and before adding this proc's clog update to a
group, if the
> > + * group leader already finishes the group update for
this page and
> > + * becomes group leader of another group which updates
different clog
> > + * page, then it will lead to a situation where a single
group can
> > + * have different clog page updates. Now the chances of
such a race
> > + * condition are less and even if it happens, the only
downside is
> > + * that it could lead to serial access of clog pages from
disk if
> > + * those pages are not in memory. Tests doesn't indicate
any
> > + * performance hit due to different clog page updates in
same group,
> > + * however in future, if we want to improve the
situation, then we can
> > + * detect the non-group leader transactions that tries to
update the
> > + * different CLOG page after acquiring CLogControlLock
and then mark
> > + * these transactions such that after waking they need to
perform CLOG
> > + * update via normal path.
> > + */
>
> Needs a good portion of polishing.
>

Okay, I have tried to simplify the comment as well.

>
> > + if (nextidx != INVALID_PGPROCNO &&
> > + ProcGlobal->allProcs[nextidx].clogGroupMemberPage
!= proc->clogGroupMemberPage)
> > + return false;
>
> I think we're returning with clogGroupMember = true - that doesn't look
> right.
>

Changed as per suggestion.

>
> > + pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> > +
> > + if
(pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > +
&nextidx,
> > +
(uint32) proc->pgprocno))
> > + break;
> > + }
>
> So this indeed has ABA type problems. And you appear to be arguing above
> that that's ok. Need to ponder that for a bit.
>
> So, we enqueue ourselves as the *head* of the wait list, if there's
> other waiters. Seems like it could lead to the first element after the
> leader to be delayed longer than the others.
>
>
> FWIW, You can move the nextidx = part of out the loop,
> pgatomic_compare_exchange will update the nextidx value from memory; no
> need for another load afterwards.
>

Changed as per suggestion.

>
> > + /*
> > + * If the list was not empty, the leader will update the status
of our
> > + * XID. It is impossible to have followers without a leader
because the
> > + * first process that has added itself to the list will always
have
> > + * nextidx as INVALID_PGPROCNO.
> > + */
> > + if (nextidx != INVALID_PGPROCNO)
> > + {
> > + /* Sleep until the leader updates our XID status. */
> > + for (;;)
> > + {
> > + /* acts as a read barrier */
> > + PGSemaphoreLock(&proc->sem);
> > + if (!proc->clogGroupMember)
> > + break;
> > + extraWaits++;
> > + }
> > +
> > + Assert(pg_atomic_read_u32(&proc->clogGroupNext) ==
INVALID_PGPROCNO);
> > +
> > + /* Fix semaphore count for any absorbed wakeups */
> > + while (extraWaits-- > 0)
> > + PGSemaphoreUnlock(&proc->sem);
> > + return true;
> > + }
> > +
> > + /* We are the leader. Acquire the lock on behalf of everyone. */
> > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > + /*
> > + * Now that we've got the lock, clear the list of processes
waiting for
> > + * group XID status update, saving a pointer to the head of the
list.
> > + * Trying to pop elements one at a time could lead to an ABA
problem.
> > + */
> > + while (true)
> > + {
> > + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > + if
(pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > +
&nextidx,
> > +
INVALID_PGPROCNO))
> > + break;
> > + }
>
> Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
> rather than compare_exchange?
>

Changed as per suggestion.

>
> I think it's worthwhile to create a benchmark that does something like
> BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> completely realistic values for network RTT + application computation),
> the success rate of group updates shrinks noticeably.
>

Will do some tests based on above test and share results.

Attached patch contains all the changes suggested by you. Let me know if I
have missed anything or you want it differently.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v8.patch application/octet-stream 15.6 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-23 07:03:22
Message-ID: CAA4eK1LqO6wzAZ3ik7Pej__DFTgA-0GTQcXiFwnNRtQ+LmBttA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 23, 2016 at 12:26 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>
> On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> >
> > I think it's worthwhile to create a benchmark that does something like
> > BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> > INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> > completely realistic values for network RTT + application computation),
> > the success rate of group updates shrinks noticeably.
> >
>
> Will do some tests based on above test and share results.
>

Forgot to mention that the effect of patch is better visible with unlogged
tables, so will do the test with those and request you to use same if you
yourself is also planning to perform some tests.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-23 07:48:16
Message-ID: CAA4eK1JeGYfUoQi+tNmJrFyvr6G0Jmb=1gH2hNKqptcnNxdx_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 22, 2016 at 12:33 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
>
> On Thu, Mar 17, 2016 at 11:39 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>
> I have reviewed the patch.. here are some review comments, I will
continue to review..
>
> 1.
>
> +
> + /*
> + * Add the proc to list, if the clog page where we need to update the
>
> + */
> + if (nextidx != INVALID_PGPROCNO &&
> + ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
proc->clogGroupMemberPage)
> + return false;
>
> Should we clear all these structure variable what we set above in case we
are not adding our self to group, I can see it will not have any problem
even if we don't clear them,
> I think if we don't want to clear we can add some comment mentioning the
same.
>

I have updated the patch to just clear clogGroupMember as that is what is
done when we wake the processes.

>
> 2.
>
> Here we are updating in our own proc, I think we don't need atomic
operation here, we are not yet added to the list.
>
> + if (nextidx != INVALID_PGPROCNO &&
> + ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
proc->clogGroupMemberPage)
> + return false;
> +
> + pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
>
>

We won't be able to assign nextidx to clogGroupNext is of
type pg_atomic_uint32.

Thanks for the review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-23 20:43:41
Message-ID: 20160323204341.GB4686@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-03-23 12:33:22 +0530, Amit Kapila wrote:
> On Wed, Mar 23, 2016 at 12:26 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >
> > On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >
> > >
> > > I think it's worthwhile to create a benchmark that does something like
> > > BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> > > INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> > > completely realistic values for network RTT + application computation),
> > > the success rate of group updates shrinks noticeably.
> > >
> >
> > Will do some tests based on above test and share results.
> >
>
> Forgot to mention that the effect of patch is better visible with unlogged
> tables, so will do the test with those and request you to use same if you
> yourself is also planning to perform some tests.

I'm playing around with SELECT txid_current(); right now - that should
be about the most specific load for setting clog bits.

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-24 00:10:55
Message-ID: 20160324001055.GD4686@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-03-23 21:43:41 +0100, Andres Freund wrote:
> I'm playing around with SELECT txid_current(); right now - that should
> be about the most specific load for setting clog bits.

Or so I thought.

In my testing that showed just about zero performance difference between
the patch and master. And more surprisingly, profiling showed very
little contention on the control lock. Hacking
TransactionIdSetPageStatus() to return without doing anything, actually
only showed minor performance benefits.

[there's also the fact that txid_current() indirectly acquires two
lwlock twice, which showed up more prominently than control lock, but
that I could easily hack around by adding a xid_current().]

Similar with an INSERT only workload. And a small scale pgbench.

Looking through the thread showed that the positive results you'd posted
all were with relativey big scale factors. Which made me think. Running
a bigger pgbench showed that most the interesting (i.e. long) lock waits
were both via TransactionIdSetPageStatus *and* TransactionIdGetStatus().

So I think what happens is that once you have a big enough table, the
UPDATEs standard pgbench does start to often hit *old* xids (in unhinted
rows). Thus old pages have to be read in, potentially displacing slru
content needed very shortly after.

Have you, in your evaluation of the performance of this patch, done
profiles over time? I.e. whether the performance benefits are the
immediately, or only after a significant amount of test time? Comparing
TPS over time, for both patched/unpatched looks relevant.

Even after changing to scale 500, the performance benefits on this,
older 2 socket, machine were minor; even though contention on the
ClogControlLock was the second most severe (after ProcArrayLock).

Afaics that squares with Jesper's result, which basically also didn't
show a difference either way?

I'm afraid that this patch might be putting bandaid on some of the
absolutely worst cases, without actually addressing the core
problem. Simon's patch in [1] seems to come closer addressing that
(which I don't believe it's safe without going doing every status
manipulation atomically, as individual status bits are smaller than 4
bytes). Now it's possibly to argue that the bandaid might slow the
bleeding to a survivable level, but I have to admit I'm doubtful.

Here's the stats for a -s 500 run btw:
Performance counter stats for 'system wide':
18,747 probe_postgres:TransactionIdSetTreeStatus
68,884 probe_postgres:TransactionIdGetStatus
9,718 probe_postgres:PGSemaphoreLock
(the PGSemaphoreLock is over 50% ProcArrayLock, followed by ~15%
SimpleLruReadPage_ReadOnly)

My suspicion is that a better approach for now would be to take Simon's
patch, but add a (per-page?) 'ClogModificationLock'; to avoid the need
of doing something fancier in TransactionIdSetStatusBit().

Andres

[1]: http://archives.postgresql.org/message-id/CANP8%2Bj%2BimQfHxkChFyfnXDyi6k-arAzRV%2BZG-V_OFxEtJjOL2Q%40mail.gmail.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-24 00:14:25
Message-ID: 20160324001425.GE4686@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2016-03-24 01:10:55 +0100, Andres Freund wrote:
> I'm afraid that this patch might be putting bandaid on some of the
> absolutely worst cases, without actually addressing the core
> problem. Simon's patch in [1] seems to come closer addressing that
> (which I don't believe it's safe without going doing every status
> manipulation atomically, as individual status bits are smaller than 4
> bytes). Now it's possibly to argue that the bandaid might slow the
> bleeding to a survivable level, but I have to admit I'm doubtful.
>
> Here's the stats for a -s 500 run btw:
> Performance counter stats for 'system wide':
> 18,747 probe_postgres:TransactionIdSetTreeStatus
> 68,884 probe_postgres:TransactionIdGetStatus
> 9,718 probe_postgres:PGSemaphoreLock
> (the PGSemaphoreLock is over 50% ProcArrayLock, followed by ~15%
> SimpleLruReadPage_ReadOnly)
>
>
> My suspicion is that a better approach for now would be to take Simon's
> patch, but add a (per-page?) 'ClogModificationLock'; to avoid the need
> of doing something fancier in TransactionIdSetStatusBit().
>
> Andres
>
> [1]: http://archives.postgresql.org/message-id/CANP8%2Bj%2BimQfHxkChFyfnXDyi6k-arAzRV%2BZG-V_OFxEtJjOL2Q%40mail.gmail.com

Simon, would you mind if I took your patch for a spin like roughly
suggested above?

Andres


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-24 02:38:02
Message-ID: CAA4eK1JZ-V0R5PiY4gU17kq1fPcpDDeGcaUW+-OBDkUT6q5nwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2016-03-23 21:43:41 +0100, Andres Freund wrote:
> > I'm playing around with SELECT txid_current(); right now - that should
> > be about the most specific load for setting clog bits.
>
> Or so I thought.
>
> In my testing that showed just about zero performance difference between
> the patch and master. And more surprisingly, profiling showed very
> little contention on the control lock. Hacking
> TransactionIdSetPageStatus() to return without doing anything, actually
> only showed minor performance benefits.
>
> [there's also the fact that txid_current() indirectly acquires two
> lwlock twice, which showed up more prominently than control lock, but
> that I could easily hack around by adding a xid_current().]
>
> Similar with an INSERT only workload. And a small scale pgbench.
>
>
> Looking through the thread showed that the positive results you'd posted
> all were with relativey big scale factors.
>

I have seen smaller benefits at 300 scale factor and somewhat larger
benefits at 1000 scale factor. Also Mithun has done similar testing with
unlogged tables and the results of same [1] also looks good.

>
> Which made me think. Running
> a bigger pgbench showed that most the interesting (i.e. long) lock waits
> were both via TransactionIdSetPageStatus *and* TransactionIdGetStatus().
>

Yes, this is same what I have observed as well.

>
> So I think what happens is that once you have a big enough table, the
> UPDATEs standard pgbench does start to often hit *old* xids (in unhinted
> rows). Thus old pages have to be read in, potentially displacing slru
> content needed very shortly after.
>
>
> Have you, in your evaluation of the performance of this patch, done
> profiles over time? I.e. whether the performance benefits are the
> immediately, or only after a significant amount of test time? Comparing
> TPS over time, for both patched/unpatched looks relevant.
>

I have mainly done it with half-hour read-write tests. What do you want to
observe via smaller tests, sometimes it gives inconsistent data for
read-write tests?

>
> Even after changing to scale 500, the performance benefits on this,
> older 2 socket, machine were minor; even though contention on the
> ClogControlLock was the second most severe (after ProcArrayLock).
>

I have tried this patch on mainly 8 socket machine with 300 & 1000 scale
factor. I am hoping that you have tried this test on unlogged tables and
by the way at what client count, you have seen these results.

> Afaics that squares with Jesper's result, which basically also didn't
> show a difference either way?
>

One difference was that I think Jesper has done testing with
synchronous_commit mode as off whereas my tests were with synchronous
commit mode on.

>
> I'm afraid that this patch might be putting bandaid on some of the
> absolutely worst cases, without actually addressing the core
> problem. Simon's patch in [1] seems to come closer addressing that
> (which I don't believe it's safe without going doing every status
> manipulation atomically, as individual status bits are smaller than 4
> bytes). Now it's possibly to argue that the bandaid might slow the
> bleeding to a survivable level, but I have to admit I'm doubtful.
>
> Here's the stats for a -s 500 run btw:
> Performance counter stats for 'system wide':
> 18,747 probe_postgres:TransactionIdSetTreeStatus
> 68,884 probe_postgres:TransactionIdGetStatus
> 9,718 probe_postgres:PGSemaphoreLock
> (the PGSemaphoreLock is over 50% ProcArrayLock, followed by ~15%
> SimpleLruReadPage_ReadOnly)
>
>
> My suspicion is that a better approach for now would be to take Simon's
> patch, but add a (per-page?) 'ClogModificationLock'; to avoid the need
> of doing something fancier in TransactionIdSetStatusBit().
>

I think we can try that as well and if you see better results by that
Approach, then we can use that instead of current patch.

[1] -
http://www.postgresql.org/message-id/CAD__OujrdwQdJdoVHahQLDg-6ivu6iBCi9iJ1qPu6AtUQpL4UQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-24 04:39:16
Message-ID: CAA4eK1LEYt+ZWpuVVpEBnMuT3dDrm_H71CPpDPdJwj=ymP_rgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>
> On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Even after changing to scale 500, the performance benefits on this,
> > older 2 socket, machine were minor; even though contention on the
> > ClogControlLock was the second most severe (after ProcArrayLock).
> >
>

One more point, I wanted to say here which is that I think the benefit will
be shown mainly when the ClogControlLock has contention more than or near
to ProcArrayLock, otherwise even if patch reduces contention (you can see
via LWLock stats), the performance doesn't increase. From Mithun's data
[1], related to LWLocks, it seems like at 88 clients in his test, the
contention on CLOGControlLock becomes more than ProcArrayLock and that is
the point where it has started showing noticeable performance gain. I have
explained some more on that thread [2] about this point. Is it possible
for you to once test in similar situation and see the behaviour (like for
client count greater than number of cores) w.r.t locking contention and TPS.

[1] -
http://www.postgresql.org/message-id/CAD__Ouh6PahJ+q1mzzjDzLo4v9GjyufegtJNAyXc0_Lfh-4coQ@mail.gmail.com
[2] -
http://www.postgresql.org/message-id/CAA4eK1LBOQ4e3Ycge+Fe0euzVu89CqGTuGNeajOienxJR0AEKA@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-24 13:16:31
Message-ID: CAA4eK1KoGTUTWH=X3yqWAEqfHt0mKrBCMynY_sEoE4fEzPAfgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>
> On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Have you, in your evaluation of the performance of this patch, done
> > profiles over time? I.e. whether the performance benefits are the
> > immediately, or only after a significant amount of test time? Comparing
> > TPS over time, for both patched/unpatched looks relevant.
> >
>
> I have mainly done it with half-hour read-write tests. What do you want
to observe via smaller tests, sometimes it gives inconsistent data for
read-write tests?
>

I have done some tests on both intel and power m/c (configuration of which
are mentioned at end-of-mail) to see the results at different
time-intervals and it is always showing greater than 50% improvement in
power m/c at 128 client-count and greater than 29% improvement in Intel m/c
at 88 client-count.

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

pgbench setup
------------------------
scale factor - 300
used *unlogged* tables : pgbench -i --unlogged-tables -s 300 ..
pgbench -M prepared tpc-b

Results on Intel m/c
--------------------------------
client-count - 88

Time (minutes) Base Patch %
5 39978 51858 29.71
10 38169 52195 36.74
20 36992 52173 41.03
30 37042 52149 40.78

Results on power m/c
-----------------------------------
Client-count - 128

Time (minutes) Base Patch %
5 42479 65655 54.55
10 41876 66050 57.72
20 38099 65200 71.13
30 37838 61908 63.61
>
> >
> > Even after changing to scale 500, the performance benefits on this,
> > older 2 socket, machine were minor; even though contention on the
> > ClogControlLock was the second most severe (after ProcArrayLock).
> >
>
> I have tried this patch on mainly 8 socket machine with 300 & 1000 scale
factor. I am hoping that you have tried this test on unlogged tables and
by the way at what client count, you have seen these results.
>

Do you think in your tests, we don't see increase in performance in your
tests because of m/c difference (sockets/cpu cores) or client-count?

Intel m/c config (lscpu)
-------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 8
NUMA node(s): 8
Vendor ID: GenuineIntel
CPU family: 6
Model: 47
Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz
Stepping: 2
CPU MHz: 1064.000
BogoMIPS: 4266.62
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 24576K
NUMA node0 CPU(s): 0,65-71,96-103
NUMA node1 CPU(s): 72-79,104-111
NUMA node2 CPU(s): 80-87,112-119
NUMA node3 CPU(s): 88-95,120-127
NUMA node4 CPU(s): 1-8,33-40
NUMA node5 CPU(s): 9-16,41-48
NUMA node6 CPU(s): 17-24,49-56
NUMA node7 CPU(s): 25-32,57-64

Power m/c config (lscpu)
-------------------------------------
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 24
NUMA node(s): 4
Model: IBM,8286-42A
L1d cache: 64K
L1i cache: 32K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-47
NUMA node1 CPU(s): 48-95
NUMA node2 CPU(s): 96-143
NUMA node3 CPU(s): 144-191

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-26 04:05:54
Message-ID: CAA4eK1+zeq4u0AQ9jwSQ5MkmeeP9QsvjSBLzBcUWQHQFtxTjNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>
> On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Even after changing to scale 500, the performance benefits on this,
> > older 2 socket, machine were minor; even though contention on the
> > ClogControlLock was the second most severe (after ProcArrayLock).
> >
>
> I have tried this patch on mainly 8 socket machine with 300 & 1000 scale
factor. I am hoping that you have tried this test on unlogged tables and
by the way at what client count, you have seen these results.
>
> > Afaics that squares with Jesper's result, which basically also didn't
> > show a difference either way?
> >
>
> One difference was that I think Jesper has done testing with
synchronous_commit mode as off whereas my tests were with synchronous
commit mode on.
>

On again looking at results posted by Jesper [1] and Mithun [2], I have one
more observation which is that in HEAD, the performance doesn't dip even at
higher client count (>75) on tests done by Jesper, whereas the results of
tests done by Mithun indicate that it dips at high client count (>64) in
HEAD and that is where the patch is helping. Now there is certainly some
difference in test environment like Jesper has done testing on 2 socket m/c
whereas mine and Mithun's tests were done 4 or 8 socket m/c. So I think
the difference in TPS due to reduced contention on CLogControlLock are
mainly visible with high socket m/c.

Can anybody having access to 4 or more socket m/c help in testing this
patch with --unlogged-tables?

[1] - http://www.postgresql.org/message-id/56E9A596.2000607@redhat.com
[2] -
http://www.postgresql.org/message-id/CAD__OujrdwQdJdoVHahQLDg-6ivu6iBCi9iJ1qPu6AtUQpL4UQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-28 17:20:49
Message-ID: CAA4eK1+-=18HOrdqtLXqOMwZDbC_15WTyHiFruz7BvVArZPaAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>
> On Thu, Sep 3, 2015 at 5:11 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
>
> Updated comments and the patch (increate_clog_bufs_v2.patch)
> containing the same is attached.
>

Andres mentioned to me in off-list discussion, that he thinks we should
first try to fix the clog buffers problem as he sees in his tests that clog
buffer replacement is one of the bottlenecks. He also suggested me a test
to see if the increase in buffers could lead to regression. The basic idea
of test was to ensure every access on clog access to be a disk one. Based
on his suggestion, I have written a SQL statement which will allow every
access of CLOG to be a disk access and the query used for same is as below:
With ins AS (INSERT INTO test_clog_access values(default) RETURNING c1)
Select * from test_clog_access where c1 = (Select c1 from ins) - 32768 *
:client_id;

Test Results
---------------------
HEAD - commit d12e5bb7 Clog Buffers - 32
Patch-1 - Clog Buffers - 64
Patch-2 - Clog Buffers - 128

Patch_Ver/Client_Count 1 64
HEAD 12677 57470
Patch-1 12305 58079
Patch-2 12761 58637

Above data is a median of 3 10-min runs. Above data indicates that there
is no substantial dip in increasing clog buffers.

Test scripts used in testing are attached with this mail. In
perf_clog_access.sh, you need to change data_directory path as per your
m/c, also you might want to change the binary name, if you want to create
postgres binaries with different names.

Andres, Is this test inline with what you have in mind?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
access_clog_prep.sql application/octet-stream 207 bytes
access_clog.sql application/octet-stream 157 bytes
perf_clog_access.sh application/x-sh 1.9 KB

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-30 23:09:14
Message-ID: 20160330230914.GH13305@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >
> > On Thu, Sep 3, 2015 at 5:11 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >
> >
> > Updated comments and the patch (increate_clog_bufs_v2.patch)
> > containing the same is attached.
> >
>
> Andres mentioned to me in off-list discussion, that he thinks we should
> first try to fix the clog buffers problem as he sees in his tests that clog
> buffer replacement is one of the bottlenecks. He also suggested me a test
> to see if the increase in buffers could lead to regression. The basic idea
> of test was to ensure every access on clog access to be a disk one. Based
> on his suggestion, I have written a SQL statement which will allow every
> access of CLOG to be a disk access and the query used for same is as below:
> With ins AS (INSERT INTO test_clog_access values(default) RETURNING c1)
> Select * from test_clog_access where c1 = (Select c1 from ins) - 32768 *
> :client_id;
>
> Test Results
> ---------------------
> HEAD - commit d12e5bb7 Clog Buffers - 32
> Patch-1 - Clog Buffers - 64
> Patch-2 - Clog Buffers - 128
>
>
> Patch_Ver/Client_Count 1 64
> HEAD 12677 57470
> Patch-1 12305 58079
> Patch-2 12761 58637
>
> Above data is a median of 3 10-min runs. Above data indicates that there
> is no substantial dip in increasing clog buffers.
>
> Test scripts used in testing are attached with this mail. In
> perf_clog_access.sh, you need to change data_directory path as per your
> m/c, also you might want to change the binary name, if you want to create
> postgres binaries with different names.
>
> Andres, Is this test inline with what you have in mind?

Yes. That looks good. My testing shows that increasing the number of
buffers can increase both throughput and reduce latency variance. The
former is a smaller effect with one of the discussed patches applied,
the latter seems to actually increase in scale (with increased
throughput).

I've attached patches to:
0001: Increase the max number of clog buffers
0002: Implement 64bit atomics fallback and optimize read/write
0003: Edited version of Simon's clog scalability patch

WRT 0003 - still clearly WIP - I've:
- made group_lsn pg_atomic_u64*, to allow for tear-free reads
- split content from IO lock
- made SimpleLruReadPage_optShared always return with only share lock
held
- Implement a different, experimental, concurrency model for
SetStatusBit using cmpxchg. A define USE_CONTENT_LOCK controls which
bit is used.

I've tested this and saw this outperform Amit's approach. Especially so
when using a read/write mix, rather then only reads. I saw over 30%
increase on a large EC2 instance with -btpcb-like(at)1 -bselect-only(at)3(dot) But
that's in a virtualized environment, not very good for reproducability.

Amit, could you run benchmarks on your bigger hardware? Both with
USE_CONTENT_LOCK commented out and in?

I think we should go for 1) and 2) unconditionally. And then evaluate
whether to go with your, or 3) from above. If the latter, we've to do
some cleanup :)

Greetings,

Andres Freund

Attachment Content-Type Size
0001-Improve-64bit-atomics-support.patch text/x-patch 10.4 KB
0002-Increase-max-number-of-buffers-in-clog-SLRU-to-128.patch text/x-patch 826 bytes
0003-Use-a-much-more-granular-locking-model-for-the-clog-.patch text/x-patch 17.7 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-31 09:37:22
Message-ID: CAA4eK1L2KO18G5-ajNitztnmn1G_Ex1N4oUHhF0Xx3QrY57Ufw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 31, 2016 at 4:39 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> > On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > wrote:
> > >
>
> Amit, could you run benchmarks on your bigger hardware? Both with
> USE_CONTENT_LOCK commented out and in?
>

Yes.

> I think we should go for 1) and 2) unconditionally.

Yes, that makes sense. On 20 min read-write pgbench --unlogged-tables
benchmark, I see that with HEAD Tps is 36241 and with increase the clog
buffers patch, Tps is 69340 at 128 client count (very good performance
boost) which indicates that we should go ahead with 1) and 2) patches.

0002-Increase-max-number-of-buffers-in-clog-SLRU-to-128

Size

CLOGShmemBuffers(void)

{

- return Min(32, Max(4, NBuffers / 512));

+ return Min(128, Max(4, NBuffers / 512));

}

I think we should change comments on top of this function. I have changed
the comments as per my previous patch and attached the modified patch with
this mail, see if that makes sense.

0001-Improve-64bit-atomics-support

+#if 0
+#ifndef PG_HAVE_ATOMIC_READ_U64
+#define PG_HAVE_ATOMIC_READ_U64
+static inline uint64

What the purpose of above #if 0? Other than that patch looks good to me.

> And then evaluate
> whether to go with your, or 3) from above. If the latter, we've to do
> some cleanup :)
>

Yes, that makes sense to me, so lets go with 1) and 2) first.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
increase_clog_bufs_v3.patch application/octet-stream 2.1 KB

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-31 10:18:05
Message-ID: 20160331101804.GD23562@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> On Thu, Mar 31, 2016 at 4:39 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> > > On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > > wrote:
> > > >
> >
> > Amit, could you run benchmarks on your bigger hardware? Both with
> > USE_CONTENT_LOCK commented out and in?
> >
>
> Yes.

Cool.

> > I think we should go for 1) and 2) unconditionally.

> Yes, that makes sense. On 20 min read-write pgbench --unlogged-tables
> benchmark, I see that with HEAD Tps is 36241 and with increase the clog
> buffers patch, Tps is 69340 at 128 client count (very good performance
> boost) which indicates that we should go ahead with 1) and 2) patches.

Especially considering the line count... I do wonder about going crazy
and increasing to 256 immediately. It otherwise seems likely that we'll
have the the same issue in a year. Could you perhaps run your test
against that as well?

> I think we should change comments on top of this function.

Yes, definitely.

> 0001-Improve-64bit-atomics-support
>
> +#if 0
> +#ifndef PG_HAVE_ATOMIC_READ_U64
> +#define PG_HAVE_ATOMIC_READ_U64
> +static inline uint64
>
> What the purpose of above #if 0? Other than that patch looks good to me.

I think I was investigating something. Other than that obviously there's
no point. Sorry for that.

Greetings,

Andres Freund


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-31 12:22:12
Message-ID: CAA4eK1LDeD2xJvitmX4mnx4ap9uaTAoJVUHBT0Wa3xhc6mP3Pw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 31, 2016 at 3:48 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> > On Thu, Mar 31, 2016 at 4:39 AM, Andres Freund <andres(at)anarazel(dot)de>
wrote:
> > >
> > > On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> > > > On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <
amit(dot)kapila16(at)gmail(dot)com>
> > > > wrote:
> > > > >
> > >
> > > Amit, could you run benchmarks on your bigger hardware? Both with
> > > USE_CONTENT_LOCK commented out and in?
> > >
> >
> > Yes.
>
> Cool.
>
>
> > > I think we should go for 1) and 2) unconditionally.
>
> > Yes, that makes sense. On 20 min read-write pgbench --unlogged-tables
> > benchmark, I see that with HEAD Tps is 36241 and with increase the clog
> > buffers patch, Tps is 69340 at 128 client count (very good performance
> > boost) which indicates that we should go ahead with 1) and 2) patches.
>
> Especially considering the line count... I do wonder about going crazy
> and increasing to 256 immediately. It otherwise seems likely that we'll
> have the the same issue in a year. Could you perhaps run your test
> against that as well?
>

Unfortunately, it dipped to 65005 with 256 clog bufs. So I think 128 is
appropriate number.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-31 12:24:55
Message-ID: 20160331122455.c65s4spjlwiy6ind@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-03-31 17:52:12 +0530, Amit Kapila wrote:
> On Thu, Mar 31, 2016 at 3:48 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> > > On Thu, Mar 31, 2016 at 4:39 AM, Andres Freund <andres(at)anarazel(dot)de>
> wrote:
> > > >
> > > > On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> > > > > On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <
> amit(dot)kapila16(at)gmail(dot)com>
> > > > > wrote:
> > > > > >
> > > >
> > > > Amit, could you run benchmarks on your bigger hardware? Both with
> > > > USE_CONTENT_LOCK commented out and in?
> > > >
> > >
> > > Yes.
> >
> > Cool.
> >
> >
> > > > I think we should go for 1) and 2) unconditionally.
> >
> > > Yes, that makes sense. On 20 min read-write pgbench --unlogged-tables
> > > benchmark, I see that with HEAD Tps is 36241 and with increase the clog
> > > buffers patch, Tps is 69340 at 128 client count (very good performance
> > > boost) which indicates that we should go ahead with 1) and 2) patches.
> >
> > Especially considering the line count... I do wonder about going crazy
> > and increasing to 256 immediately. It otherwise seems likely that we'll
> > have the the same issue in a year. Could you perhaps run your test
> > against that as well?
> >
>
> Unfortunately, it dipped to 65005 with 256 clog bufs. So I think 128 is
> appropriate number.

Ah, interesting. Then let's go with that.


From: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-31 21:13:46
Message-ID: 56FD930A.1030005@redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 03/30/2016 07:09 PM, Andres Freund wrote:
> Yes. That looks good. My testing shows that increasing the number of
> buffers can increase both throughput and reduce latency variance. The
> former is a smaller effect with one of the discussed patches applied,
> the latter seems to actually increase in scale (with increased
> throughput).
>
>
> I've attached patches to:
> 0001: Increase the max number of clog buffers
> 0002: Implement 64bit atomics fallback and optimize read/write
> 0003: Edited version of Simon's clog scalability patch
>
> WRT 0003 - still clearly WIP - I've:
> - made group_lsn pg_atomic_u64*, to allow for tear-free reads
> - split content from IO lock
> - made SimpleLruReadPage_optShared always return with only share lock
> held
> - Implement a different, experimental, concurrency model for
> SetStatusBit using cmpxchg. A define USE_CONTENT_LOCK controls which
> bit is used.
>
> I've tested this and saw this outperform Amit's approach. Especially so
> when using a read/write mix, rather then only reads. I saw over 30%
> increase on a large EC2 instance with -btpcb-like(at)1 -bselect-only(at)3(dot) But
> that's in a virtualized environment, not very good for reproducability.
>
> Amit, could you run benchmarks on your bigger hardware? Both with
> USE_CONTENT_LOCK commented out and in?
>
> I think we should go for 1) and 2) unconditionally. And then evaluate
> whether to go with your, or 3) from above. If the latter, we've to do
> some cleanup :)
>

I have been testing Amit's patch in various setups and work loads, with
up to 400 connections on a 2 x Xeon E5-2683 (28C/56T @ 2 GHz), not
seeing an improvement, but no regression either.

Testing with 0001 and 0002 do show up to a 5% improvement when using a
HDD for data + wal - about 1% when using 2 x RAID10 SSD - unlogged.

I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.

Thanks for your work on this !

Best regards,
Jesper


From: Andres Freund <andres(at)anarazel(dot)de>
To: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>,pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-31 22:21:12
Message-ID: 64AFABAF-8854-4877-BB79-16A845C62782@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:

>I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.

Yes please. I think the lock variant is realistic, the lockless did isn't.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


From: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-01 20:25:51
Message-ID: 56FED94F.5050807@redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 03/31/2016 06:21 PM, Andres Freund wrote:
> On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>
>> I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.
>
> Yes please. I think the lock variant is realistic, the lockless did isn't.
>

I have done a run with -M prepared on unlogged running 10min per data
point, up to 300 connections. Using data + wal on HDD.

I'm not seeing a difference between with and without USE_CONTENT_LOCK --
all points are within +/- 0.5%.

Let me know if there are other tests I can perform.

Best regards,
Jesper


From: Andres Freund <andres(at)anarazel(dot)de>
To: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>,pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-01 20:39:23
Message-ID: 2F91B844-D053-4FC0-A43A-50659DB95719@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On April 1, 2016 10:25:51 PM GMT+02:00, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>Hi,
>
>On 03/31/2016 06:21 PM, Andres Freund wrote:
>> On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen
><jesper(dot)pedersen(at)redhat(dot)com> wrote:
>>
>>> I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.
>>
>> Yes please. I think the lock variant is realistic, the lockless did
>isn't.
>>
>
>I have done a run with -M prepared on unlogged running 10min per data
>point, up to 300 connections. Using data + wal on HDD.
>
>I'm not seeing a difference between with and without USE_CONTENT_LOCK
>--
>all points are within +/- 0.5%.
>
>Let me know if there are other tests I can perform

How do either compare to just 0002 applied?

Thanks!
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-02 11:55:50
Message-ID: CAA4eK1Kxcv8aj1GfWWcU2aByiKT4-DBh_STdXoccVBVBqVbL5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 31, 2016 at 3:48 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> > On Thu, Mar 31, 2016 at 4:39 AM, Andres Freund <andres(at)anarazel(dot)de>
wrote:
> > >
> > > On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> > > > On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <
amit(dot)kapila16(at)gmail(dot)com>
> > > > wrote:
> > > > >
> > >
> > > Amit, could you run benchmarks on your bigger hardware? Both with
> > > USE_CONTENT_LOCK commented out and in?
> > >
> >
> > Yes.
>
> Cool.
>

Here is the performance data (configuration of machine used to perform this
test is mentioned at end of mail):

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

median of 3, 20-min pgbench tpc-b results for --unlogged-tables

Client Count/No. Of Runs (tps) 2 64 128
HEAD+clog_buf_128 4930 66754 68818
group_clog_v8 5753 69002 78843
content_lock 5668 70134 70501
nocontent_lock 4787 69531 70663

I am not exactly sure why using content lock (defined USE_CONTENT_LOCK in
0003-Use-a-much-more-granular-locking-model-for-the-clog-) patch or no
content lock (not defined USE_CONTENT_LOCK) patch gives poor performance at
128 client, may it is due to some bug in patch or due to some reason
mentioned by Robert [1] (usage of two locks instead of one). On running it
many-2 times with content lock and no content lock patch, some times it
gives 80 ~ 81K TPS at 128 client count which is approximately 3% higher
than group_clog_v8 patch which indicates that group clog approach is able
to address most of the remaining contention (after increasing clog buffers)
around CLOGControlLock. There is one small regression observed with no
content lock patch at lower client count (2) which might be due to
run-to-run variation or may be it is due to increased number of
instructions due to atomic ops, need to be investigated if we want to
follow no content lock approach.

Note, I have not posted TPS numbers with HEAD, as I have already shown
above that increasing clog bufs has increased TPS from ~36K to ~68K at 128
client-count.

M/c details
-----------------
Power m/c config (lscpu)
-------------------------------------
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 24
NUMA node(s): 4
Model: IBM,8286-42A
L1d cache: 64K
L1i cache: 32K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-47
NUMA node1 CPU(s): 48-95
NUMA node2 CPU(s): 96-143
NUMA node3 CPU(s): 144-191

[1] -
http://www.postgresql.org/message-id/CA+TgmoYjpNKdHDFUtJLAMna-O5LGuTDnanHFAOT5=hN_VAuW2Q@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-04 15:25:56
Message-ID: 57028784.6050107@redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/01/2016 04:39 PM, Andres Freund wrote:
> On April 1, 2016 10:25:51 PM GMT+02:00, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>> Hi,
>>
>> On 03/31/2016 06:21 PM, Andres Freund wrote:
>>> On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen
>> <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>>>
>>>> I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.
>>>
>>> Yes please. I think the lock variant is realistic, the lockless did
>> isn't.
>>>
>>
>> I have done a run with -M prepared on unlogged running 10min per data
>> point, up to 300 connections. Using data + wal on HDD.
>>
>> I'm not seeing a difference between with and without USE_CONTENT_LOCK
>> --
>> all points are within +/- 0.5%.
>>
>> Let me know if there are other tests I can perform
>
> How do either compare to just 0002 applied?
>

0001 + 0002 compared to 0001 + 0002 + 0003 (either way) were pretty much
the same +/- 0.5% on the HDD run.

Best regards,
Jesper


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-05 03:32:18
Message-ID: CAA4eK1LtcAOoiSZBLfLs4Ny+g6sUxjxwayN5ACU0DHhDD=2tfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 4, 2016 at 8:55 PM, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
wrote:

> On 04/01/2016 04:39 PM, Andres Freund wrote:
>
>> On April 1, 2016 10:25:51 PM GMT+02:00, Jesper Pedersen <
>> jesper(dot)pedersen(at)redhat(dot)com> wrote:
>>
>>> Hi,
>>>
>>> On 03/31/2016 06:21 PM, Andres Freund wrote:
>>>
>>>> On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen
>>>>
>>> <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>>>
>>>>
>>>> I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.
>>>>>
>>>>
>>>> Yes please. I think the lock variant is realistic, the lockless did
>>>>
>>> isn't.
>>>
>>>>
>>>>
>>> I have done a run with -M prepared on unlogged running 10min per data
>>> point, up to 300 connections. Using data + wal on HDD.
>>>
>>> I'm not seeing a difference between with and without USE_CONTENT_LOCK
>>> --
>>> all points are within +/- 0.5%.
>>>
>>> Let me know if there are other tests I can perform
>>>
>>
>> How do either compare to just 0002 applied?
>>
>>
> 0001 + 0002 compared to 0001 + 0002 + 0003 (either way) were pretty much
> the same +/- 0.5% on the HDD run.
>
>
I think the main reason why there is no significant gain shown in your
tests is that on the m/c where you are testing the contention due to
CLOGControlLock is not high enough that any reduction on the same will
help. To me, it is visible in some of the high-end machines like which
have 4 or more sockets. So, I think these results should be taken as an
indication that there is no regression in the tests performed by you.

Thanks for doing all the tests for these patches.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-07 03:44:00
Message-ID: CAA4eK1J12fSGAmFSeq0wdUgqD+4Ue43rZDr=ZEMbySMgxfGJKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 2, 2016 at 5:25 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Thu, Mar 31, 2016 at 3:48 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Here is the performance data (configuration of machine used to perform
> this test is mentioned at end of mail):
>
> Non-default parameters
> ------------------------------------
> max_connections = 300
> shared_buffers=8GB
> min_wal_size=10GB
> max_wal_size=15GB
> checkpoint_timeout =35min
> maintenance_work_mem = 1GB
> checkpoint_completion_target = 0.9
> wal_buffers = 256MB
>
> median of 3, 20-min pgbench tpc-b results for --unlogged-tables
>

I have ran exactly same test on intel x86 m/c and the results are as below:

Client Count/Patch_ver (tps) 2 128 256
HEAD – Commit 2143f5e1 2832 35001 26756
clog_buf_128 2909 50685 40998
clog_buf_128 +group_update_clog_v8 2981 53043 50779
clog_buf_128 +content_lock 2843 56261 54059
clog_buf_128 +nocontent_lock 2630 56554 54429

In this m/c, I don't see any run-to-run variation, however the trend of
results seems somewhat similar to power m/c. Clearly the first patch
increasing clog bufs to 128 shows upto 50% performance improvement on 256
client-count. We can also observe that group clog patch gives ~24% gain on
top of increase clog bufs patch at 256 client count. Both content lock and
no content lock patches show similar performance gains and the performance
is 6~7% better than group clog patch. Also as on power m/c, no content
lock patch seems to show some regression at lower client count (2 clients
in this case).

Based on above results, increase_clog_bufs to 128 is a clear winner and I
think we might not want to proceed with no content lock approach patch as
that shows some regression and also it is no better than using content lock
approach patch. Now, I think we need to decide between group clog mode
approach patch and use content lock approach patch, it seems to me that the
difference between both of these is not high (6~7%) and I think that when
there are sub-transactions involved (sub-transactions on same page as main
transaction) group clog mode patch should give better performance as then
content lock in itself will start becoming bottleneck. Now, I think we can
address that case for content lock approach by using grouping technique on
content lock or something similar, but not sure if that is worth the
effort. Also, I see some variation in performance data with content lock
patch on power m/c, but again that might be attributed to m/c
characteristics. So, I think we can proceed with either group clog patch
or content lock patch and if we want to proceed with content lock approach,
then we need to do some more work on it.

Note - For both content and no content lock, I have
applied 0001-Improve-64bit-atomics-support patch.

m/c config (lscpu)
---------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 8
NUMA node(s): 8
Vendor ID: GenuineIntel
CPU family: 6
Model: 47
Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz
Stepping: 2
CPU MHz: 1064.000
BogoMIPS: 4266.62
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 24576K
NUMA node0 CPU(s): 0,65-71,96-103
NUMA node1 CPU(s): 72-79,104-111
NUMA node2 CPU(s): 80-87,112-119
NUMA node3 CPU(s): 88-95,120-127
NUMA node4 CPU(s): 1-8,33-40
NUMA node5 CPU(s): 9-16,41-48
NUMA node6 CPU(s): 17-24,49-56
NUMA node7 CPU(s): 25-32,57-64

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-07 04:46:16
Message-ID: 20160407044616.omi7tjqa2k763ppd@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2016-04-07 09:14:00 +0530, Amit Kapila wrote:
> On Sat, Apr 2, 2016 at 5:25 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I have ran exactly same test on intel x86 m/c and the results are as below:

Thanks for running these tests!

> Client Count/Patch_ver (tps) 2 128 256
> HEAD – Commit 2143f5e1 2832 35001 26756
> clog_buf_128 2909 50685 40998
> clog_buf_128 +group_update_clog_v8 2981 53043 50779
> clog_buf_128 +content_lock 2843 56261 54059
> clog_buf_128 +nocontent_lock 2630 56554 54429

Interesting.

could you perhaps also run a test with -btpcb-like(at)1 -bselect-only(at)3?
That much represents real world loads, and it's where I saw simon's
approach outshining yours considerably...

Greetings,

Andres Freund


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-07 13:10:14
Message-ID: CAA4eK1KHxLaqqC9_8e5KwEsQiSB9eiDsrg4tjTqRmjrmuK=+Yg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 7, 2016 at 10:16 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2016-04-07 09:14:00 +0530, Amit Kapila wrote:
> > On Sat, Apr 2, 2016 at 5:25 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > I have ran exactly same test on intel x86 m/c and the results are as
below:
>
> Thanks for running these tests!
>
> > Client Count/Patch_ver (tps) 2 128 256
> > HEAD – Commit 2143f5e1 2832 35001 26756
> > clog_buf_128 2909 50685 40998
> > clog_buf_128 +group_update_clog_v8 2981 53043 50779
> > clog_buf_128 +content_lock 2843 56261 54059
> > clog_buf_128 +nocontent_lock 2630 56554 54429
>
> Interesting.
>
> could you perhaps also run a test with -btpcb-like(at)1 -bselect-only(at)3?
>

This is the data with -b tpcb-like(at)1 with 20-min run for each version and I
could see almost similar results as the data posted in previous e-mail.

Client Count/Patch_ver (tps) 256
clog_buf_128 40617
clog_buf_128 +group_clog_v8 51137
clog_buf_128 +content_lock 54188

For -b select-only(at)3, I have done quicktest for each version and number is
same 62K~63K for all version, why do you think this will improve
select-only workload?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-07 13:18:04
Message-ID: 20160407131804.j3kn6xpaoedcxzrr@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-04-07 18:40:14 +0530, Amit Kapila wrote:
> On Thu, Apr 7, 2016 at 10:16 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Hi,
> >
> > On 2016-04-07 09:14:00 +0530, Amit Kapila wrote:
> > > On Sat, Apr 2, 2016 at 5:25 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > > I have ran exactly same test on intel x86 m/c and the results are as
> below:
> >
> > Thanks for running these tests!
> >
> > > Client Count/Patch_ver (tps) 2 128 256
> > > HEAD – Commit 2143f5e1 2832 35001 26756
> > > clog_buf_128 2909 50685 40998
> > > clog_buf_128 +group_update_clog_v8 2981 53043 50779
> > > clog_buf_128 +content_lock 2843 56261 54059
> > > clog_buf_128 +nocontent_lock 2630 56554 54429
> >
> > Interesting.
> >
> > could you perhaps also run a test with -btpcb-like(at)1 -bselect-only(at)3?

> This is the data with -b tpcb-like(at)1 with 20-min run for each version and I
> could see almost similar results as the data posted in previous e-mail.
>
> Client Count/Patch_ver (tps) 256
> clog_buf_128 40617
> clog_buf_128 +group_clog_v8 51137
> clog_buf_128 +content_lock 54188
>
> For -b select-only(at)3, I have done quicktest for each version and number is
> same 62K~63K for all version, why do you think this will improve
> select-only workload?

What I was looking for was pgbench with both -btpcb-like(at)1
-bselect-only(at)3 specified; i.e. a mixed read/write test. In my
measurement that's where Simon's approach shines (not surprising if you
look at the way it works), and it's of immense practical importance -
most workloads are mixed.

Regards,

Andres


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-07 15:40:51
Message-ID: CAA4eK1LY5A-Ni1jXvdwcFQCd6CMpLcCmUvwdeApYc2Kh1P0hyw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 7, 2016 at 6:48 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2016-04-07 18:40:14 +0530, Amit Kapila wrote:
> > This is the data with -b tpcb-like(at)1 with 20-min run for each version
and I
> > could see almost similar results as the data posted in previous e-mail.
> >
> > Client Count/Patch_ver (tps) 256
> > clog_buf_128 40617
> > clog_buf_128 +group_clog_v8 51137
> > clog_buf_128 +content_lock 54188
> >
> > For -b select-only(at)3, I have done quicktest for each version and
number is
> > same 62K~63K for all version, why do you think this will improve
> > select-only workload?
>
> What I was looking for was pgbench with both -btpcb-like(at)1
> -bselect-only(at)3 specified; i.e. a mixed read/write test.
>

Okay, I can again take the performance data, but on what basis are we
ignoring the variation of results on power m/c, previous to this I have not
seen such a variation for read-write tests.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-08 07:37:05
Message-ID: CAA4eK1+6oBO4gCyTWGbHwPtS+DSGHU0q347yZhsF8nN+5MadoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 7, 2016 at 6:48 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2016-04-07 18:40:14 +0530, Amit Kapila wrote:
>
> > This is the data with -b tpcb-like(at)1 with 20-min run for each version
> and I
> > could see almost similar results as the data posted in previous e-mail.
> >
> > Client Count/Patch_ver (tps) 256
> > clog_buf_128 40617
> > clog_buf_128 +group_clog_v8 51137
> > clog_buf_128 +content_lock 54188
> >
> > For -b select-only(at)3, I have done quicktest for each version and
> number is
> > same 62K~63K for all version, why do you think this will improve
> > select-only workload?
>
> What I was looking for was pgbench with both -btpcb-like(at)1
> -bselect-only(at)3 specified; i.e. a mixed read/write test.

I have taken the data in the suggested way and the performance seems to be
neutral for both the patches. Detailed data for all the runs for three
versions is attached.

Median of 3 20-minutes run.

Client Count/Patch_ver (tps) 256
clog_buf_128 110630
clog_buf_128 +group_clog_v8 111575
clog_buf_128 +content_lock 96581

Now, from above data, it appears that content lock patch has some
regression, but if you see in detailed data attached with this mail, the
highest TPS is close to other patches, but still on the lesser side.

> In my
> measurement that's where Simon's approach shines (not surprising if you
> look at the way it works), and it's of immense practical importance -
> most workloads are mixed.
>
>
I have tried above tests two times, but didn't notice any gain with content
lock approach.

I think by now, we have done many tests with both approaches and we find
that in some cases, it is slightly better and in most cases it is neutral
and in some cases it is worse than group clog approach. I feel we should
go with group clog approach now as that has been tested and reviewed
multiple times and in future if we find that other approach is giving
substantial gain, then we can anyway change it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
test_results_list_300_8GB_group_clog_v8.txt text/plain 2.5 KB
test_results_list_300_8GB_clog_bufs_128.txt text/plain 2.5 KB
test_results_list_300_8GB_content_lock_v1.txt text/plain 2.5 KB

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-08 15:30:54
Message-ID: 20160408153054.hgubabeycowlu3tc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> I think we should change comments on top of this function. I have changed
> the comments as per my previous patch and attached the modified patch with
> this mail, see if that makes sense.

I've applied this patch.

Regards,

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-08 15:32:28
Message-ID: 20160408153228.j36cknuy3dalxo5h@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-04-08 13:07:05 +0530, Amit Kapila wrote:
> I think by now, we have done many tests with both approaches and we find
> that in some cases, it is slightly better and in most cases it is neutral
> and in some cases it is worse than group clog approach. I feel we should
> go with group clog approach now as that has been tested and reviewed
> multiple times and in future if we find that other approach is giving
> substantial gain, then we can anyway change it.

I think that's a discussion for the 9.7 cycle unfortunately. I've now
pushed the #clog-buffers patch; that's going to help the worst cases.

Greetings,

Andres Freund


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-04-08 16:25:50
Message-ID: CAA4eK1+6URLZYM+xjnvc027+mqQztWc8VDm0udY2X+bdC2HP+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Apr 8, 2016 at 9:00 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> > I think we should change comments on top of this function. I have
changed
> > the comments as per my previous patch and attached the modified patch
with
> > this mail, see if that makes sense.
>
> I've applied this patch.
>

Thanks!

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-04 21:48:55
Message-ID: 8bab43dd-bc76-0c41-950e-8101269dbf4b@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

This thread started a year ago, different people contributed various
patches, some of which already got committed. Can someone please post a
summary of this thread, so that it's a bit more clear what needs
review/testing, what are the main open questions and so on?

I'm interested in doing some tests on the hardware I have available, but
I'm not willing spending my time untangling the discussion.

thanks

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-05 04:03:03
Message-ID: CAA4eK1JJOWO6u99gUMMQWELXbci428vRBbJnxg=V04NuZwBCSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 5, 2016 at 3:18 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> This thread started a year ago, different people contributed various
> patches, some of which already got committed. Can someone please post a
> summary of this thread, so that it's a bit more clear what needs
> review/testing, what are the main open questions and so on?
>

Okay, let me try to summarize this thread. This thread started off to
ameliorate the CLOGControlLock contention with a patch to increase the
clog buffers to 128 (which got committed in 9.6). Then the second
patch was developed to use Group mode to further reduce the
CLOGControlLock contention, latest version of which is upthread [1] (I
have checked that version still gets applied). Then Andres suggested
to compare the Group lock mode approach with an alternative (more
granular) locking model approach for which he has posted patches
upthread [2]. There are three patches on that link, the patches of
interest are 0001-Improve-64bit-atomics-support and
0003-Use-a-much-more-granular-locking-model-for-the-clog-. I have
checked that second one of those doesn't get applied, so I have
rebased it and attached it with this mail. In the more granular
locking approach, actually, you can comment USE_CONTENT_LOCK to make
it use atomic operations (I could not compile it by disabling
USE_CONTENT_LOCK on my windows box, you can try by commenting that as
well, if it works for you). So, in short we have to compare three
approaches here.

1) Group mode to reduce CLOGControlLock contention
2) Use granular locking model
3) Use atomic operations

For approach-1, you can use patch [1]. For approach-2, you can use
0001-Improve-64bit-atomics-support patch[2] and the patch attached
with this mail. For approach-3, you can use
0001-Improve-64bit-atomics-support patch[2] and the patch attached
with this mail by commenting USE_CONTENT_LOCK. If the third doesn't
work for you then for now we can compare approach-1 and approach-2.

I have done some testing of these patches for read-write pgbench
workload and doesn't find big difference. Now the interesting test
case could be to use few sub-transactions (may be 4-8) for each
transaction as with that we can see more contention for
CLOGControlLock.

Few points to note for performance testing, one should use --unlogged
tables, else the WAL writing and WALWriteLock contention masks the
impact of this patch. The impact of this patch is visible at
higher-client counts (say at 64~128).

> I'm interested in doing some tests on the hardware I have available, but
> I'm not willing spending my time untangling the discussion.
>

Thanks for showing the interest and let me know if something is still
un-clear or you need more information to proceed.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2B8gQTyGSZLe1Rb7jeM1Beh4FqA4VNjtpZcmvwizDQ0hw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/20160330230914.GH13305%40awork2.anarazel.de

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
use-granular-locking-v2.patch application/octet-stream 17.0 KB

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-05 08:30:24
Message-ID: CABOikdM_N7i38j5ErwVxP=siD2nhvsWMN=GA43Li+dj_vDnP_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 5, 2016 at 3:18 AM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> Hi,
>
> This thread started a year ago, different people contributed various
> patches, some of which already got committed. Can someone please post a
> summary of this thread, so that it's a bit more clear what needs
> review/testing, what are the main open questions and so on?
>
> I'm interested in doing some tests on the hardware I have available, but
> I'm not willing spending my time untangling the discussion.
>
>
I signed up for reviewing this patch. But as Amit explained later, there
are two different and independent implementations to solve the problem.
Since Tomas has volunteered to do some benchmarking, I guess I should wait
for the results because that might influence which approach we choose.

Does that sound correct? Or do we already know which implementation is more
likely to be pursued, in which case I can start reviewing that patch.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-05 09:27:19
Message-ID: CAA4eK1JcyjZsDoaZ_RJn8Y_patJ8YzudMqDPwTgPH0xPSFyCuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 5, 2016 at 2:00 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> wrote:
>
>
> On Mon, Sep 5, 2016 at 3:18 AM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> wrote:
>>
>> Hi,
>>
>> This thread started a year ago, different people contributed various
>> patches, some of which already got committed. Can someone please post a
>> summary of this thread, so that it's a bit more clear what needs
>> review/testing, what are the main open questions and so on?
>>
>> I'm interested in doing some tests on the hardware I have available, but
>> I'm not willing spending my time untangling the discussion.
>>
>
> I signed up for reviewing this patch. But as Amit explained later, there are
> two different and independent implementations to solve the problem. Since
> Tomas has volunteered to do some benchmarking, I guess I should wait for the
> results because that might influence which approach we choose.
>
> Does that sound correct?
>

Sounds correct to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-05 18:04:58
Message-ID: 195ecd3b-0085-fe6a-762b-cd5dc3321e8c@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/05/2016 06:03 AM, Amit Kapila wrote:
> On Mon, Sep 5, 2016 at 3:18 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Hi,
>>
>> This thread started a year ago, different people contributed various
>> patches, some of which already got committed. Can someone please post a
>> summary of this thread, so that it's a bit more clear what needs
>> review/testing, what are the main open questions and so on?
>>
>
> Okay, let me try to summarize this thread. This thread started off to
> ameliorate the CLOGControlLock contention with a patch to increase the
> clog buffers to 128 (which got committed in 9.6). Then the second
> patch was developed to use Group mode to further reduce the
> CLOGControlLock contention, latest version of which is upthread [1] (I
> have checked that version still gets applied). Then Andres suggested
> to compare the Group lock mode approach with an alternative (more
> granular) locking model approach for which he has posted patches
> upthread [2]. There are three patches on that link, the patches of
> interest are 0001-Improve-64bit-atomics-support and
> 0003-Use-a-much-more-granular-locking-model-for-the-clog-. I have
> checked that second one of those doesn't get applied, so I have
> rebased it and attached it with this mail. In the more granular
> locking approach, actually, you can comment USE_CONTENT_LOCK to make
> it use atomic operations (I could not compile it by disabling
> USE_CONTENT_LOCK on my windows box, you can try by commenting that as
> well, if it works for you). So, in short we have to compare three
> approaches here.
>
> 1) Group mode to reduce CLOGControlLock contention
> 2) Use granular locking model
> 3) Use atomic operations
>
> For approach-1, you can use patch [1]. For approach-2, you can use
> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
> with this mail. For approach-3, you can use
> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
> with this mail by commenting USE_CONTENT_LOCK. If the third doesn't
> work for you then for now we can compare approach-1 and approach-2.
>

OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
attempts to update to a newer version were unsuccessful so far.

> I have done some testing of these patches for read-write pgbench
> workload and doesn't find big difference. Now the interesting test
> case could be to use few sub-transactions (may be 4-8) for each
> transaction as with that we can see more contention for
> CLOGControlLock.

Understood. So a bunch of inserts/updates interleaved by savepoints?

I presume you started looking into this based on a real-world
performance issue, right? Would that be a good test case?

>
> Few points to note for performance testing, one should use --unlogged
> tables, else the WAL writing and WALWriteLock contention masks the
> impact of this patch. The impact of this patch is visible at
> higher-client counts (say at 64~128).
>

Even on good hardware (say, PCIe SSD storage that can do thousands of
fsyncs per second)? Does it then make sense to try optimizing this if
the effect can only be observed without the WAL overhead (so almost
never in practice)?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-06 02:49:19
Message-ID: CAA4eK1Ky+8Okznto0Xtd_oPzfk6kd1A=BZQ-BP2kV04fj4uoqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 5, 2016 at 11:34 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>
> On 09/05/2016 06:03 AM, Amit Kapila wrote:
>> So, in short we have to compare three
>> approaches here.
>>
>> 1) Group mode to reduce CLOGControlLock contention
>> 2) Use granular locking model
>> 3) Use atomic operations
>>
>> For approach-1, you can use patch [1]. For approach-2, you can use
>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>> with this mail. For approach-3, you can use
>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>> with this mail by commenting USE_CONTENT_LOCK. If the third doesn't
>> work for you then for now we can compare approach-1 and approach-2.
>>
>
> OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
> the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
> attempts to update to a newer version were unsuccessful so far.
>

So which all patches your are able to compile on 4-socket m/c? I
think it is better to measure the performance on bigger machine.

>> I have done some testing of these patches for read-write pgbench
>> workload and doesn't find big difference. Now the interesting test
>> case could be to use few sub-transactions (may be 4-8) for each
>> transaction as with that we can see more contention for
>> CLOGControlLock.
>
> Understood. So a bunch of inserts/updates interleaved by savepoints?
>

Yes.

> I presume you started looking into this based on a real-world
> performance issue, right? Would that be a good test case?
>

I had started looking into it based on LWLOCK_STATS data for
read-write workload (pgbench tpc-b). I think it will depict many of
the real-world read-write workloads.

>>
>> Few points to note for performance testing, one should use --unlogged
>> tables, else the WAL writing and WALWriteLock contention masks the
>> impact of this patch. The impact of this patch is visible at
>> higher-client counts (say at 64~128).
>>
>
> Even on good hardware (say, PCIe SSD storage that can do thousands of
> fsyncs per second)?

Not sure, because it could be masked by WALWriteLock contention.

> Does it then make sense to try optimizing this if
> the effect can only be observed without the WAL overhead (so almost
> never in practice)?
>

It is not that there is no improvement with WAL overhead (like one can
observe that via LWLOCK_STATS apart from TPS), but it is clearly
visible with unlogged tables. The situation is not that simple,
because let us say we don't do anything for the remaining contention
for CLOGControlLock, then when we try to reduce the contention around
other locks like WALWriteLock or may be ProcArrayLock, there is a
chance that contention will shift to CLOGControlLock. So, the basic
idea is to get the big benefits, we need to eliminate contention
around each of the locks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-06 19:38:35
Message-ID: 800c71e5-13a6-887b-250d-0ab8706533e3@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/06/2016 04:49 AM, Amit Kapila wrote:
> On Mon, Sep 5, 2016 at 11:34 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>
>> On 09/05/2016 06:03 AM, Amit Kapila wrote:
>>> So, in short we have to compare three
>>> approaches here.
>>>
>>> 1) Group mode to reduce CLOGControlLock contention
>>> 2) Use granular locking model
>>> 3) Use atomic operations
>>>
>>> For approach-1, you can use patch [1]. For approach-2, you can use
>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>> with this mail. For approach-3, you can use
>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>> with this mail by commenting USE_CONTENT_LOCK. If the third doesn't
>>> work for you then for now we can compare approach-1 and approach-2.
>>>
>>
>> OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
>> the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
>> attempts to update to a newer version were unsuccessful so far.
>>
>
> So which all patches your are able to compile on 4-socket m/c? I
> think it is better to measure the performance on bigger machine.

Oh, sorry - I forgot to mention that only the last test (with
USE_CONTENT_LOCK commented out) fails to compile, because the functions
for atomics were added in gcc-4.7.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-07 11:13:08
Message-ID: CAA4eK1K+2g7nVkQBC_HsKUKYvACh_sAMb6Q19dg2c_6-DQyRfw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 7, 2016 at 1:08 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 09/06/2016 04:49 AM, Amit Kapila wrote:
>> On Mon, Sep 5, 2016 at 11:34 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>
>>>
>>> On 09/05/2016 06:03 AM, Amit Kapila wrote:
>>>> So, in short we have to compare three
>>>> approaches here.
>>>>
>>>> 1) Group mode to reduce CLOGControlLock contention
>>>> 2) Use granular locking model
>>>> 3) Use atomic operations
>>>>
>>>> For approach-1, you can use patch [1]. For approach-2, you can use
>>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>>> with this mail. For approach-3, you can use
>>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>>> with this mail by commenting USE_CONTENT_LOCK. If the third doesn't
>>>> work for you then for now we can compare approach-1 and approach-2.
>>>>
>>>
>>> OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
>>> the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
>>> attempts to update to a newer version were unsuccessful so far.
>>>
>>
>> So which all patches your are able to compile on 4-socket m/c? I
>> think it is better to measure the performance on bigger machine.
>
> Oh, sorry - I forgot to mention that only the last test (with
> USE_CONTENT_LOCK commented out) fails to compile, because the functions
> for atomics were added in gcc-4.7.
>

No issues, in that case we can leave the last test for now and do it later.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-07 12:14:03
Message-ID: 5b4e50ff-02a7-2838-10e6-da758637338f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/07/2016 01:13 PM, Amit Kapila wrote:
> On Wed, Sep 7, 2016 at 1:08 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 09/06/2016 04:49 AM, Amit Kapila wrote:
>>> On Mon, Sep 5, 2016 at 11:34 PM, Tomas Vondra
>>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>
>>>>
>>>> On 09/05/2016 06:03 AM, Amit Kapila wrote:
>>>>> So, in short we have to compare three
>>>>> approaches here.
>>>>>
>>>>> 1) Group mode to reduce CLOGControlLock contention
>>>>> 2) Use granular locking model
>>>>> 3) Use atomic operations
>>>>>
>>>>> For approach-1, you can use patch [1]. For approach-2, you can use
>>>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>>>> with this mail. For approach-3, you can use
>>>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>>>> with this mail by commenting USE_CONTENT_LOCK. If the third doesn't
>>>>> work for you then for now we can compare approach-1 and approach-2.
>>>>>
>>>>
>>>> OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
>>>> the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
>>>> attempts to update to a newer version were unsuccessful so far.
>>>>
>>>
>>> So which all patches your are able to compile on 4-socket m/c? I
>>> think it is better to measure the performance on bigger machine.
>>
>> Oh, sorry - I forgot to mention that only the last test (with
>> USE_CONTENT_LOCK commented out) fails to compile, because the functions
>> for atomics were added in gcc-4.7.
>>
>
> No issues, in that case we can leave the last test for now and do it later.
>

FWIW I've managed to compile a new GCC on the system (all I had to do
was to actually read the damn manual), so I'm ready to do the test once
I get a bit of time.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-14 04:55:16
Message-ID: CAFiTN-u3=XUi7z8dTOgxZ98E7gL1tzL=q9Yd=CwWCtTtS6pOZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 5, 2016 at 9:33 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> USE_CONTENT_LOCK on my windows box, you can try by commenting that as
> well, if it works for you). So, in short we have to compare three
> approaches here.
>
> 1) Group mode to reduce CLOGControlLock contention
> 2) Use granular locking model
> 3) Use atomic operations

I have tested performance with approach 1 and approach 2.

1. Transaction (script.sql): I have used below transaction to run my
bench mark, We can argue that this may not be an ideal workload, but I
tested this to put more load on ClogControlLock during commit
transaction.

-----------
\set aid random (1,30000000)
\set tid random (1,3000)

BEGIN;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s1;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
END;
-----------

2. Results
./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
scale factor: 300
Clients head(tps) grouplock(tps) granular(tps)
------- --------- ---------- -------
128 29367 39326 37421
180 29777 37810 36469
256 28523 37418 35882

grouplock --> 1) Group mode to reduce CLOGControlLock contention
granular --> 2) Use granular locking model

I will test with 3rd approach also, whenever I get time.

3. Summary:
1. I can see on head we are gaining almost ~30 % performance at higher
client count (128 and beyond).
2. group lock is ~5% better compared to granular lock.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-14 07:38:19
Message-ID: CAFiTN-uoKz31HcmPTrAaZVXbvTHvO5CUNKTRdJ-fY_7-uAnwRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 14, 2016 at 10:25 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I have tested performance with approach 1 and approach 2.
>
> 1. Transaction (script.sql): I have used below transaction to run my
> bench mark, We can argue that this may not be an ideal workload, but I
> tested this to put more load on ClogControlLock during commit
> transaction.
>
> -----------
> \set aid random (1,30000000)
> \set tid random (1,3000)
>
> BEGIN;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> SAVEPOINT s1;
> SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
> SAVEPOINT s2;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> END;
> -----------
>
> 2. Results
> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
> scale factor: 300
> Clients head(tps) grouplock(tps) granular(tps)
> ------- --------- ---------- -------
> 128 29367 39326 37421
> 180 29777 37810 36469
> 256 28523 37418 35882
>
>
> grouplock --> 1) Group mode to reduce CLOGControlLock contention
> granular --> 2) Use granular locking model
>
> I will test with 3rd approach also, whenever I get time.
>
> 3. Summary:
> 1. I can see on head we are gaining almost ~30 % performance at higher
> client count (128 and beyond).
> 2. group lock is ~5% better compared to granular lock.

Forgot to mention that, this test is on unlogged tables.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-14 15:29:48
Message-ID: CA+TgmoYO1wNi7dRwNxGUupmMPhOkae-pmU38ndC1s6FDQ-USJg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 14, 2016 at 12:55 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> 2. Results
> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
> scale factor: 300
> Clients head(tps) grouplock(tps) granular(tps)
> ------- --------- ---------- -------
> 128 29367 39326 37421
> 180 29777 37810 36469
> 256 28523 37418 35882
>
>
> grouplock --> 1) Group mode to reduce CLOGControlLock contention
> granular --> 2) Use granular locking model
>
> I will test with 3rd approach also, whenever I get time.
>
> 3. Summary:
> 1. I can see on head we are gaining almost ~30 % performance at higher
> client count (128 and beyond).
> 2. group lock is ~5% better compared to granular lock.

Sure, but you're testing at *really* high client counts here. Almost
nobody is going to benefit from a 5% improvement at 256 clients. You
need to test 64 clients and 32 clients and 16 clients and 8 clients
and see what happens there. Those cases are a lot more likely than
these stratospheric client counts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-14 16:04:55
Message-ID: CAFiTN-t-VKZTXUdOX_L_X4Nw6bXOX=Fbmm2Oq=PmD4KqCufHBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 14, 2016 at 8:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Sure, but you're testing at *really* high client counts here. Almost
> nobody is going to benefit from a 5% improvement at 256 clients.

I agree with your point, but here we need to consider one more thing,
that on head we are gaining ~30% with both the approaches.

So for comparing these two patches we can consider..

A. Other workloads (one can be as below)
-> Load on CLogControlLock at commit (exclusive mode) + Load on
CLogControlLock at Transaction status (shared mode).
I think we can mix (savepoint + updates)

B. Simplicity of the patch (if both are performing almost equal in all
practical scenarios).

C. Bases on algorithm whichever seems winner.

I will try to test these patches with other workloads...

> You
> need to test 64 clients and 32 clients and 16 clients and 8 clients
> and see what happens there. Those cases are a lot more likely than
> these stratospheric client counts.

I tested with 64 clients as well..
1. On head we are gaining ~15% with both the patches.
2. But group lock vs granular lock is almost same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-17 01:24:33
Message-ID: 3bb2699f-fe51-7419-b42b-9ad5bfd0d506@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/14/2016 06:04 PM, Dilip Kumar wrote:
> On Wed, Sep 14, 2016 at 8:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Sure, but you're testing at *really* high client counts here. Almost
>> nobody is going to benefit from a 5% improvement at 256 clients.
>
> I agree with your point, but here we need to consider one more thing,
> that on head we are gaining ~30% with both the approaches.
>
> So for comparing these two patches we can consider..
>
> A. Other workloads (one can be as below)
> -> Load on CLogControlLock at commit (exclusive mode) + Load on
> CLogControlLock at Transaction status (shared mode).
> I think we can mix (savepoint + updates)
>
> B. Simplicity of the patch (if both are performing almost equal in all
> practical scenarios).
>
> C. Bases on algorithm whichever seems winner.
>
> I will try to test these patches with other workloads...
>
>> You
>> need to test 64 clients and 32 clients and 16 clients and 8 clients
>> and see what happens there. Those cases are a lot more likely than
>> these stratospheric client counts.
>
> I tested with 64 clients as well..
> 1. On head we are gaining ~15% with both the patches.
> 2. But group lock vs granular lock is almost same.
>

I've been doing some testing too, but I haven't managed to measure any
significant difference between master and any of the patches. Not sure
why, I've repeated the test from scratch to make sure I haven't done
anything stupid, but I got the same results (which is one of the main
reasons why the testing took me so long).

Attached is an archive with a script running the benchmark (including
SQL scripts generating the data and custom transaction for pgbench), and
results in a CSV format.

The benchmark is fairly simple - for each case (master + 3 different
patches) we do 10 runs, 5 minutes each, for 32, 64, 128 and 192 clients
(the machine has 32 physical cores).

The transaction is using a single unlogged table initialized like this:

create unlogged table t(id int, val int);
insert into t select i, i from generate_series(1,100000) s(i);
vacuum t;
create index on t(id);

(I've also ran it with 100M rows, called "large" in the results), and
pgbench is running this transaction:

\set id random(1, 100000)

BEGIN;
UPDATE t SET val = val + 1 WHERE id = :id;
SAVEPOINT s1;
UPDATE t SET val = val + 1 WHERE id = :id;
SAVEPOINT s2;
UPDATE t SET val = val + 1 WHERE id = :id;
SAVEPOINT s3;
UPDATE t SET val = val + 1 WHERE id = :id;
SAVEPOINT s4;
UPDATE t SET val = val + 1 WHERE id = :id;
SAVEPOINT s5;
UPDATE t SET val = val + 1 WHERE id = :id;
SAVEPOINT s6;
UPDATE t SET val = val + 1 WHERE id = :id;
SAVEPOINT s7;
UPDATE t SET val = val + 1 WHERE id = :id;
SAVEPOINT s8;
COMMIT;

So 8 simple UPDATEs interleaved by savepoints. The benchmark was running
on a machine with 256GB of RAM, 32 cores (4x E5-4620) and a fairly large
SSD array. I'd done some basic tuning on the system, most importantly:

effective_io_concurrency = 32
work_mem = 512MB
maintenance_work_mem = 512MB
max_connections = 300
checkpoint_completion_target = 0.9
checkpoint_timeout = 3600
max_wal_size = 128GB
min_wal_size = 16GB
shared_buffers = 16GB

Although most of the changes probably does not matter much for unlogged
tables (I planned to see how this affects regular tables, but as I see
no difference for unlogged ones, I haven't done that yet).

So the question is why Dilip sees +30% improvement, while my results are
almost exactly the same. Looking at Dilip's benchmark, I see he only ran
the test for 10 seconds, and I'm not sure how many runs he did, warmup
etc. Dilip, can you provide additional info?

I'll ask someone else to redo the benchmark after the weekend to make
sure it's not actually some stupid mistake of mine.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
clog.tgz application/x-compressed-tar 4.3 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-17 03:23:38
Message-ID: CAA4eK1+bCz3vo+jm7zWVRp6SDxmErOKNWAfkid-cm2a9MDNNSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 17, 2016 at 6:54 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 09/14/2016 06:04 PM, Dilip Kumar wrote:
>>
>> On Wed, Sep 14, 2016 at 8:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
>> wrote:
>>>
>>> Sure, but you're testing at *really* high client counts here. Almost
>>> nobody is going to benefit from a 5% improvement at 256 clients.
>>
>>
>> I agree with your point, but here we need to consider one more thing,
>> that on head we are gaining ~30% with both the approaches.
>>
>> So for comparing these two patches we can consider..
>>
>> A. Other workloads (one can be as below)
>> -> Load on CLogControlLock at commit (exclusive mode) + Load on
>> CLogControlLock at Transaction status (shared mode).
>> I think we can mix (savepoint + updates)
>>
>> B. Simplicity of the patch (if both are performing almost equal in all
>> practical scenarios).
>>
>> C. Bases on algorithm whichever seems winner.
>>
>> I will try to test these patches with other workloads...
>>
>>> You
>>> need to test 64 clients and 32 clients and 16 clients and 8 clients
>>> and see what happens there. Those cases are a lot more likely than
>>> these stratospheric client counts.
>>
>>
>> I tested with 64 clients as well..
>> 1. On head we are gaining ~15% with both the patches.
>> 2. But group lock vs granular lock is almost same.
>>
>
>
> The transaction is using a single unlogged table initialized like this:
>
> create unlogged table t(id int, val int);
> insert into t select i, i from generate_series(1,100000) s(i);
> vacuum t;
> create index on t(id);
>
> (I've also ran it with 100M rows, called "large" in the results), and
> pgbench is running this transaction:
>
> \set id random(1, 100000)
>
> BEGIN;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s1;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s2;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s3;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s4;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s5;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s6;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s7;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s8;
> COMMIT;
>
> So 8 simple UPDATEs interleaved by savepoints.
>

The difference between these and tests performed by Dilip is that he
has lesser savepoints. I think if you want to try it again, then can
you once do it with either no savepoint or 1~2 savepoints. The other
thing you could try out is the same test as Dilip has done (with and
without 2 savepoints).

> The benchmark was running on
> a machine with 256GB of RAM, 32 cores (4x E5-4620) and a fairly large SSD
> array. I'd done some basic tuning on the system, most importantly:
>
> effective_io_concurrency = 32
> work_mem = 512MB
> maintenance_work_mem = 512MB
> max_connections = 300
> checkpoint_completion_target = 0.9
> checkpoint_timeout = 3600
> max_wal_size = 128GB
> min_wal_size = 16GB
> shared_buffers = 16GB
>
> Although most of the changes probably does not matter much for unlogged
> tables (I planned to see how this affects regular tables, but as I see no
> difference for unlogged ones, I haven't done that yet).
>

You are right. Unless, we don't see the benefit with unlogged tables,
there is no point in doing it for regular tables.

> So the question is why Dilip sees +30% improvement, while my results are
> almost exactly the same. Looking at Dilip's benchmark, I see he only ran the
> test for 10 seconds, and I'm not sure how many runs he did, warmup etc.
> Dilip, can you provide additional info?
>
> I'll ask someone else to redo the benchmark after the weekend to make sure
> it's not actually some stupid mistake of mine.
>

I think there is not much point in repeating the tests you have done,
rather it is better if we can try again the tests done by Dilip in
your environment to see the results.

Thanks for doing the tests.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-17 03:42:20
Message-ID: a214524a-064c-f253-eb1c-75fbc88a87ac@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/17/2016 05:23 AM, Amit Kapila wrote:
> On Sat, Sep 17, 2016 at 6:54 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 09/14/2016 06:04 PM, Dilip Kumar wrote:
>>>
...
>>
>> (I've also ran it with 100M rows, called "large" in the results), and
>> pgbench is running this transaction:
>>
>> \set id random(1, 100000)
>>
>> BEGIN;
>> UPDATE t SET val = val + 1 WHERE id = :id;
>> SAVEPOINT s1;
>> UPDATE t SET val = val + 1 WHERE id = :id;
>> SAVEPOINT s2;
>> UPDATE t SET val = val + 1 WHERE id = :id;
>> SAVEPOINT s3;
>> UPDATE t SET val = val + 1 WHERE id = :id;
>> SAVEPOINT s4;
>> UPDATE t SET val = val + 1 WHERE id = :id;
>> SAVEPOINT s5;
>> UPDATE t SET val = val + 1 WHERE id = :id;
>> SAVEPOINT s6;
>> UPDATE t SET val = val + 1 WHERE id = :id;
>> SAVEPOINT s7;
>> UPDATE t SET val = val + 1 WHERE id = :id;
>> SAVEPOINT s8;
>> COMMIT;
>>
>> So 8 simple UPDATEs interleaved by savepoints.
>>
>
> The difference between these and tests performed by Dilip is that he
> has lesser savepoints. I think if you want to try it again, then can
> you once do it with either no savepoint or 1~2 savepoints. The other
> thing you could try out is the same test as Dilip has done (with and
> without 2 savepoints).
>

I don't follow. My understanding is the patches should make savepoints
cheaper - so why would using fewer savepoints increase the effect of the
patches?

FWIW I've already done a quick test with 2 savepoints, no difference. I
can do a full test of course.

>> The benchmark was running on
>> a machine with 256GB of RAM, 32 cores (4x E5-4620) and a fairly large SSD
>> array. I'd done some basic tuning on the system, most importantly:
>>
>> effective_io_concurrency = 32
>> work_mem = 512MB
>> maintenance_work_mem = 512MB
>> max_connections = 300
>> checkpoint_completion_target = 0.9
>> checkpoint_timeout = 3600
>> max_wal_size = 128GB
>> min_wal_size = 16GB
>> shared_buffers = 16GB
>>
>> Although most of the changes probably does not matter much for unlogged
>> tables (I planned to see how this affects regular tables, but as I see no
>> difference for unlogged ones, I haven't done that yet).
>>
>
> You are right. Unless, we don't see the benefit with unlogged tables,
> there is no point in doing it for regular tables.
>
>> So the question is why Dilip sees +30% improvement, while my results are
>> almost exactly the same. Looking at Dilip's benchmark, I see he only ran the
>> test for 10 seconds, and I'm not sure how many runs he did, warmup etc.
>> Dilip, can you provide additional info?
>>
>> I'll ask someone else to redo the benchmark after the weekend to make sure
>> it's not actually some stupid mistake of mine.
>>
>
> I think there is not much point in repeating the tests you have
> done, rather it is better if we can try again the tests done by Dilip
> in your environment to see the results.
>

I'm OK with running Dilip's tests, but I'm not sure why there's not much
point in running the tests I've done. Or perhaps I'd like to understand
why "my tests" show no improvement whatsoever first - after all, they're
not that different from Dilip's.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-17 03:47:31
Message-ID: 0733fb9c-c3dc-fe30-1ad1-8c3ddf05c92d@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/14/2016 05:29 PM, Robert Haas wrote:
> On Wed, Sep 14, 2016 at 12:55 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> 2. Results
>> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
>> scale factor: 300
>> Clients head(tps) grouplock(tps) granular(tps)
>> ------- --------- ---------- -------
>> 128 29367 39326 37421
>> 180 29777 37810 36469
>> 256 28523 37418 35882
>>
>>
>> grouplock --> 1) Group mode to reduce CLOGControlLock contention
>> granular --> 2) Use granular locking model
>>
>> I will test with 3rd approach also, whenever I get time.
>>
>> 3. Summary:
>> 1. I can see on head we are gaining almost ~30 % performance at higher
>> client count (128 and beyond).
>> 2. group lock is ~5% better compared to granular lock.
>
> Sure, but you're testing at *really* high client counts here. Almost
> nobody is going to benefit from a 5% improvement at 256 clients. You
> need to test 64 clients and 32 clients and 16 clients and 8 clients
> and see what happens there. Those cases are a lot more likely than
> these stratospheric client counts.
>

Right. My impression from the discussion so far is that the patches only
improve performance with very many concurrent clients - but as Robert
points out, almost no one is running with 256 active clients, unless
they have 128 cores or so. At least not if they value latency more than
throughput.

So while it's nice to improve throughput in those cases, it's a bit like
a tree falling in the forest without anyone around.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-17 04:48:16
Message-ID: CAA4eK1Lz=JFE00=JDNhGaFfy9FNEYYWA7tzn5SsoO=RoNSj1sw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 17, 2016 at 9:12 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 09/17/2016 05:23 AM, Amit Kapila wrote:
>>
>> The difference between these and tests performed by Dilip is that he
>> has lesser savepoints. I think if you want to try it again, then can
>> you once do it with either no savepoint or 1~2 savepoints. The other
>> thing you could try out is the same test as Dilip has done (with and
>> without 2 savepoints).
>>
>
> I don't follow. My understanding is the patches should make savepoints
> cheaper - so why would using fewer savepoints increase the effect of the
> patches?
>

Oh, no the purpose of the patch is not to make savepoints cheaper (I
know I have earlier suggested to check by having few savepoints, but
that was not intended to mean that this patch makes savepoint cheaper,
rather it might show the difference between different approaches,
sorry if that was not clearly stated earlier). The purpose of this
patch('es) is to make commits cheaper and in particular updating the
status in CLOG. Let me try to explain in brief about the CLOG
contention and what these patches try to accomplish. As of head, when
we try to update the status in CLOG (TransactionIdSetPageStatus), we
take CLOGControlLock in EXCLUSIVE mode for reading the appropriate
CLOG page (most of the time, it will be in memory, so it is cheap) and
then updating the transaction status in the same. We take
CLOGControlLock in SHARED mode (if we the required clog page is in
memory, otherwise the lock is upgraded to Exclusive) while reading the
transaction status which happen when we access the tuple where hint
bit is not set.

So, we have two different type of contention around CLOGControlLock,
(a) all the transactions that try to commit at same time, each of them
have to do it almost serially (b) readers of transaction status
contend with writers.

Now with the patch that went in 9.6 (increasing the clog buffers), the
second type of contention is mostly reduced as most of the required
pages are in-memory and we are hoping that this patch will help in
reducing first type (a) of contention as well.

>>
>
> I'm OK with running Dilip's tests, but I'm not sure why there's not much
> point in running the tests I've done. Or perhaps I'd like to understand why
> "my tests" show no improvement whatsoever first - after all, they're not
> that different from Dilip's.
>

The test which Dilip is doing "Select ... For Update" is mainly aiming
at first type (a) of contention as it doesn't change the hint bits, so
mostly it should not go for reading the transaction status when
accessing the tuple. Whereas, the tests you are doing is mainly
focussed on second type (b) of contention.

I think one point we have to keep in mind here is that this contention
is visible in bigger socket m/c, last time Jesper also tried these
patches, but didn't find much difference in his environment and on
further analyzing (IIRC) we found that the reason was that contention
was not visible in his environment.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-17 05:05:55
Message-ID: CAA4eK1J85_243PcJbG-AJoRq3wwZcc4xvMJEQ605Yz+1TXJpow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 17, 2016 at 9:17 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 09/14/2016 05:29 PM, Robert Haas wrote:
>>
>> On Wed, Sep 14, 2016 at 12:55 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
>> wrote:
>>>
>>> 2. Results
>>> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f
>>> script.sql
>>> scale factor: 300
>>> Clients head(tps) grouplock(tps) granular(tps)
>>> ------- --------- ---------- -------
>>> 128 29367 39326 37421
>>> 180 29777 37810 36469
>>> 256 28523 37418 35882
>>>
>>>
>>> grouplock --> 1) Group mode to reduce CLOGControlLock contention
>>> granular --> 2) Use granular locking model
>>>
>>> I will test with 3rd approach also, whenever I get time.
>>>
>>> 3. Summary:
>>> 1. I can see on head we are gaining almost ~30 % performance at higher
>>> client count (128 and beyond).
>>> 2. group lock is ~5% better compared to granular lock.
>>
>>
>> Sure, but you're testing at *really* high client counts here. Almost
>> nobody is going to benefit from a 5% improvement at 256 clients. You
>> need to test 64 clients and 32 clients and 16 clients and 8 clients
>> and see what happens there. Those cases are a lot more likely than
>> these stratospheric client counts.
>>
>
> Right. My impression from the discussion so far is that the patches only
> improve performance with very many concurrent clients - but as Robert points
> out, almost no one is running with 256 active clients, unless they have 128
> cores or so. At least not if they value latency more than throughput.
>

See, I am also not in favor of going with any of these patches, if
they doesn't help in reduction of contention. However, I think it is
important to understand, under what kind of workload and which
environment it can show the benefit or regression whichever is
applicable. Just FYI, couple of days back one of EDB's partner who
was doing the performance tests by using HammerDB (which is again OLTP
focussed workload) on 9.5 based code has found that CLogControlLock
has the significantly high contention. They were using
synchronous_commit=off in their settings. Now, it is quite possible
that with improvements done in 9.6, the contention they are seeing
will be eliminated, but we have yet to figure that out. I just shared
this information to you with the intention that this seems to be a
real problem and we should try to work on it unless we are able to
convince ourselves that this is not a problem.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-17 05:56:55
Message-ID: CAFiTN-sAPN5yUeDfOAf0WZYKH-Mj+2Wa7mqx-ejM7CXVL4pNNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 17, 2016 at 6:54 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Although most of the changes probably does not matter much for unlogged
> tables (I planned to see how this affects regular tables, but as I see no
> difference for unlogged ones, I haven't done that yet).
>
> So the question is why Dilip sees +30% improvement, while my results are
> almost exactly the same. Looking at Dilip's benchmark, I see he only ran the
> test for 10 seconds, and I'm not sure how many runs he did, warmup etc.
> Dilip, can you provide additional info?

Actually I ran test for 10 minutes.

Sorry for the confusion (I copy paste my script and manually replaced
the variable and made mistake )

My script is like this

scale_factor=300
shared_bufs=8GB
time_for_reading=600

./postgres -c shared_buffers=8GB -c checkpoint_timeout=40min -c
max_wal_size=20GB -c max_connections=300 -c maintenance_work_mem=1GB&
./pgbench -i -s $scale_factor --unlogged-tables postgres
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared
postgres -f ../../script.sql >> test_results.txt

I am taking median of three readings..

with below script, I can repeat my results every time (64 client 15%
gain on head and 128+ client 30% gain on head).

I will repeat my test with 8,16 and 32 client and post the results soon.

> \set aid random (1,30000000)
> \set tid random (1,3000)
>
> BEGIN;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> SAVEPOINT s1;
> SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
> SAVEPOINT s2;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> END;
> -----------

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-17 17:55:44
Message-ID: 3dfb27e5-d3d3-20bb-2b35-3a0e2747625f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/17/2016 07:05 AM, Amit Kapila wrote:
> On Sat, Sep 17, 2016 at 9:17 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 09/14/2016 05:29 PM, Robert Haas wrote:
...
>>> Sure, but you're testing at *really* high client counts here.
>>> Almost nobody is going to benefit from a 5% improvement at 256
>>> clients. You need to test 64 clients and 32 clients and 16
>>> clients and 8 clients and see what happens there. Those cases are
>>> a lot more likely than these stratospheric client counts.
>>>
>>
>> Right. My impression from the discussion so far is that the patches
>> only improve performance with very many concurrent clients - but as
>> Robert points out, almost no one is running with 256 active
>> clients, unless they have 128 cores or so. At least not if they
>> value latency more than throughput.
>>
>
> See, I am also not in favor of going with any of these patches, if
> they doesn't help in reduction of contention. However, I think it is
> important to understand, under what kind of workload and which
> environment it can show the benefit or regression whichever is
> applicable.

Sure. Which is why I initially asked what type of workload should I be
testing, and then done the testing with multiple savepoints as that's
what you suggested. But apparently that's not a workload that could
benefit from this patch, so I'm a bit confused.

> Just FYI, couple of days back one of EDB's partner who was doing the
> performance tests by using HammerDB (which is again OLTP focussed
> workload) on 9.5 based code has found that CLogControlLock has the
> significantly high contention. They were using synchronous_commit=off
> in their settings. Now, it is quite possible that with improvements
> done in 9.6, the contention they are seeing will be eliminated, but
> we have yet to figure that out. I just shared this information to you
> with the intention that this seems to be a real problem and we should
> try to work on it unless we are able to convince ourselves that this
> is not a problem.
>

So, can we approach the problem from this direction instead? That is,
instead of looking for workloads that might benefit from the patches,
look at real-world examples of CLOG lock contention and then evaluate
the impact on those?

Extracting the workload from benchmarks probably is not ideal, but it's
still better than constructing the workload on our own to fit the patch.

FWIW I'll do a simple pgbench test - first with synchronous_commit=on
and then with synchronous_commit=off. Probably the workloads we should
have started with anyway, I guess.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-18 04:08:53
Message-ID: CAA4eK1LOk=6omxO15fhhpaG4iOzdzW+9oMr=fh6nCeD37JvB6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 17, 2016 at 11:25 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 09/17/2016 07:05 AM, Amit Kapila wrote:
>>
>> On Sat, Sep 17, 2016 at 9:17 AM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>
>>> On 09/14/2016 05:29 PM, Robert Haas wrote:
>
> ...
>>>>
>>>> Sure, but you're testing at *really* high client counts here.
>>>> Almost nobody is going to benefit from a 5% improvement at 256
>>>> clients. You need to test 64 clients and 32 clients and 16
>>>> clients and 8 clients and see what happens there. Those cases are
>>>> a lot more likely than these stratospheric client counts.
>>>>
>>>
>>> Right. My impression from the discussion so far is that the patches
>>> only improve performance with very many concurrent clients - but as
>>> Robert points out, almost no one is running with 256 active
>>> clients, unless they have 128 cores or so. At least not if they
>>> value latency more than throughput.
>>>
>>
>> See, I am also not in favor of going with any of these patches, if
>> they doesn't help in reduction of contention. However, I think it is
>> important to understand, under what kind of workload and which
>> environment it can show the benefit or regression whichever is
>> applicable.
>
>
> Sure. Which is why I initially asked what type of workload should I be
> testing, and then done the testing with multiple savepoints as that's what
> you suggested. But apparently that's not a workload that could benefit from
> this patch, so I'm a bit confused.
>
>> Just FYI, couple of days back one of EDB's partner who was doing the
>> performance tests by using HammerDB (which is again OLTP focussed
>> workload) on 9.5 based code has found that CLogControlLock has the
>> significantly high contention. They were using synchronous_commit=off
>> in their settings. Now, it is quite possible that with improvements
>> done in 9.6, the contention they are seeing will be eliminated, but
>> we have yet to figure that out. I just shared this information to you
>> with the intention that this seems to be a real problem and we should
>> try to work on it unless we are able to convince ourselves that this
>> is not a problem.
>>
>
> So, can we approach the problem from this direction instead? That is,
> instead of looking for workloads that might benefit from the patches, look
> at real-world examples of CLOG lock contention and then evaluate the impact
> on those?
>

Sure, we can go that way as well, but I thought instead of testing
with a new benchmark kit (HammerDB), it is better to first get with
some simple statements.

> Extracting the workload from benchmarks probably is not ideal, but it's
> still better than constructing the workload on our own to fit the patch.
>
> FWIW I'll do a simple pgbench test - first with synchronous_commit=on and
> then with synchronous_commit=off. Probably the workloads we should have
> started with anyway, I guess.
>

Here, synchronous_commit = off case could be interesting. Do you see
any problem with first trying a workload where Dilip is seeing
benefit? I am not suggesting we should not do any other testing, but
just first lets try to reproduce the performance gain which is seen in
Dilip's tests.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-18 21:11:58
Message-ID: 9e877db7-4dc2-88c1-67ae-034ad1a5cafe@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/18/2016 06:08 AM, Amit Kapila wrote:
> On Sat, Sep 17, 2016 at 11:25 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 09/17/2016 07:05 AM, Amit Kapila wrote:
>>>
>>> On Sat, Sep 17, 2016 at 9:17 AM, Tomas Vondra
>>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>
>>>> On 09/14/2016 05:29 PM, Robert Haas wrote:
>>
>> ...
>>>>>
>>>>> Sure, but you're testing at *really* high client counts here.
>>>>> Almost nobody is going to benefit from a 5% improvement at 256
>>>>> clients. You need to test 64 clients and 32 clients and 16
>>>>> clients and 8 clients and see what happens there. Those cases are
>>>>> a lot more likely than these stratospheric client counts.
>>>>>
>>>>
>>>> Right. My impression from the discussion so far is that the patches
>>>> only improve performance with very many concurrent clients - but as
>>>> Robert points out, almost no one is running with 256 active
>>>> clients, unless they have 128 cores or so. At least not if they
>>>> value latency more than throughput.
>>>>
>>>
>>> See, I am also not in favor of going with any of these patches, if
>>> they doesn't help in reduction of contention. However, I think it is
>>> important to understand, under what kind of workload and which
>>> environment it can show the benefit or regression whichever is
>>> applicable.
>>
>>
>> Sure. Which is why I initially asked what type of workload should I be
>> testing, and then done the testing with multiple savepoints as that's what
>> you suggested. But apparently that's not a workload that could benefit from
>> this patch, so I'm a bit confused.
>>
>>> Just FYI, couple of days back one of EDB's partner who was doing the
>>> performance tests by using HammerDB (which is again OLTP focussed
>>> workload) on 9.5 based code has found that CLogControlLock has the
>>> significantly high contention. They were using synchronous_commit=off
>>> in their settings. Now, it is quite possible that with improvements
>>> done in 9.6, the contention they are seeing will be eliminated, but
>>> we have yet to figure that out. I just shared this information to you
>>> with the intention that this seems to be a real problem and we should
>>> try to work on it unless we are able to convince ourselves that this
>>> is not a problem.
>>>
>>
>> So, can we approach the problem from this direction instead? That is,
>> instead of looking for workloads that might benefit from the patches, look
>> at real-world examples of CLOG lock contention and then evaluate the impact
>> on those?
>>
>
> Sure, we can go that way as well, but I thought instead of testing
> with a new benchmark kit (HammerDB), it is better to first get with
> some simple statements.
>

IMHO in the ideal case the first message in this thread would provide a
test case, demonstrating the effect of the patch. Then we wouldn't have
the issue of looking for a good workload two years later.

But now that I look at the first post, I see it apparently used a plain
tpc-b pgbench (with synchronous_commit=on) to show the benefits, which
is the workload I'm running right now (results sometime tomorrow).

That workload clearly uses no savepoints at all, so I'm wondering why
you suggested to use several of them - I know you said that it's to show
differences between the approaches, but why should that matter to any of
the patches (and if it matters, why I got almost no differences in the
benchmarks)?

Pardon my ignorance, CLOG is not my area of expertise ...

>> Extracting the workload from benchmarks probably is not ideal, but
>> it's still better than constructing the workload on our own to fit
>> the patch.
>>
>> FWIW I'll do a simple pgbench test - first with
>> synchronous_commit=on and then with synchronous_commit=off.
>> Probably the workloads we should have started with anyway, I
>> guess.
>>
>
> Here, synchronous_commit = off case could be interesting. Do you see
> any problem with first trying a workload where Dilip is seeing
> benefit? I am not suggesting we should not do any other testing, but
> just first lets try to reproduce the performance gain which is seen
> in Dilip's tests.
>

I plan to run Dilip's workload once the current benchmarks complete.

regard

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-19 04:20:49
Message-ID: CAFiTN-u-XEzhd=hNGW586fmQwdTy6Qy6_SXe09tNB=gBcVzZ_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 19, 2016 at 2:41 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> But now that I look at the first post, I see it apparently used a plain
> tpc-b pgbench (with synchronous_commit=on) to show the benefits, which is
> the workload I'm running right now (results sometime tomorrow).

Good option, We can test plain TPC-B also..

I have some more results.. I have got the result for "Update with no
savepoint"....

below is my script...

\set aid random (1,30000000)
\set tid random (1,3000)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;

Results: (median of three, 10 minutes run).

Clients Head GroupLock
16 21452 21589
32 42422 42688
64 42460 52590 ~ 23%
128 22683 56825 ~150%
256 18748 54867

With this workload I observed that gain is bigger than my previous
workload (select for update with 2 SP)..

Just to confirm that the gain what we are seeing is because of Clog
Lock contention removal or it's
something else, I ran 128 client with perf for 5 minutes and below is my result.

I can see that after applying group lock patch, LWLockAcquire become
28% to just 4%, and all because
of Clog Lock.

On Head:
------------
- 28.45% 0.24% postgres postgres [.] LWLockAcquire
- LWLockAcquire
+ 53.49% TransactionIdSetPageStatus
+ 40.83% SimpleLruReadPage_ReadOnly
+ 1.16% BufferAlloc
+ 0.92% GetSnapshotData
+ 0.89% GetNewTransactionId
+ 0.72% LockBuffer
+ 0.70% ProcArrayGroupClearXid

After Group Lock Patch:
-------------------------------
- 4.47% 0.26% postgres postgres [.] LWLockAcquire
- LWLockAcquire
+ 27.11% GetSnapshotData
+ 21.57% GetNewTransactionId
+ 11.44% SimpleLruReadPage_ReadOnly
+ 10.13% BufferAlloc
+ 7.24% ProcArrayGroupClearXid
+ 4.74% LockBuffer
+ 4.08% LockAcquireExtended
+ 2.91% TransactionGroupUpdateXidStatus
+ 2.71% LockReleaseAll
+ 1.90% WALInsertLockAcquire
+ 0.94% LockRelease
+ 0.91% VirtualXactLockTableInsert
+ 0.90% VirtualXactLockTableCleanup
+ 0.72% MultiXactIdSetOldestMember
+ 0.66% LockRefindAndRelease

Next I will test, "update with 2 savepoints", "select for update with
no savepoints"....
I will also test the granular lock and atomic lock patch in next run..

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-19 19:10:58
Message-ID: CA+Tgmoan8OxOfxCqiYr6T_Rc9qPFPBxWukBzi1xbGez1UpXx3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Sep 18, 2016 at 5:11 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> IMHO in the ideal case the first message in this thread would provide a test
> case, demonstrating the effect of the patch. Then we wouldn't have the issue
> of looking for a good workload two years later.
>
> But now that I look at the first post, I see it apparently used a plain
> tpc-b pgbench (with synchronous_commit=on) to show the benefits, which is
> the workload I'm running right now (results sometime tomorrow).
>
> That workload clearly uses no savepoints at all, so I'm wondering why you
> suggested to use several of them - I know you said that it's to show
> differences between the approaches, but why should that matter to any of the
> patches (and if it matters, why I got almost no differences in the
> benchmarks)?
>
> Pardon my ignorance, CLOG is not my area of expertise ...

It's possible that the effect of this patch depends on the number of
sockets. EDB test machine cthulhu as 8 sockets, and power2 has 4
sockets. I assume Dilip's tests were run on one of those two,
although he doesn't seem to have mentioned which one. Your system is
probably 2 or 4 sockets, which might make a difference. Results might
also depend on CPU architecture; power2 is, unsurprisingly, a POWER
system, whereas I assume you are testing x86. Maybe somebody who has
access should test on hydra.pg.osuosl.org, which is a community POWER
resource. (Send me a private email if you are a known community
member who wants access for benchmarking purposes.)

Personally, I find the results so far posted on this thread thoroughly
unimpressive. I acknowledge that Dilip's results appear to show that
in a best-case scenario these patches produce a rather large gain.
However, that gain seems to happen in a completely contrived scenario:
astronomical client counts, unlogged tables, and a test script that
maximizes pressure on CLogControlLock. If you have to work that hard
to find a big win, and tests under more reasonable conditions show no
benefit, it's not clear to me that it's really worth the time we're
all spending benchmarking and reviewing this, or the risk of bugs, or
the damage to the SLRU abstraction layer. I think there's a very good
chance that we're better off moving on to projects that have a better
chance of helping in the real world.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-19 19:12:47
Message-ID: 20160919191247.srwhnbpryhsbi3me@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-09-19 15:10:58 -0400, Robert Haas wrote:
> Personally, I find the results so far posted on this thread thoroughly
> unimpressive. I acknowledge that Dilip's results appear to show that
> in a best-case scenario these patches produce a rather large gain.
> However, that gain seems to happen in a completely contrived scenario:
> astronomical client counts, unlogged tables, and a test script that
> maximizes pressure on CLogControlLock. If you have to work that hard
> to find a big win, and tests under more reasonable conditions show no
> benefit, it's not clear to me that it's really worth the time we're
> all spending benchmarking and reviewing this, or the risk of bugs, or
> the damage to the SLRU abstraction layer. I think there's a very good
> chance that we're better off moving on to projects that have a better
> chance of helping in the real world.

+1


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-20 03:07:15
Message-ID: CAA4eK1LWCbd8JZ3FV7KDZ-igf928Jq01Au_n7=S1=9E6vgeVSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 20, 2016 at 12:40 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Sep 18, 2016 at 5:11 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> IMHO in the ideal case the first message in this thread would provide a test
>> case, demonstrating the effect of the patch. Then we wouldn't have the issue
>> of looking for a good workload two years later.
>>
>> But now that I look at the first post, I see it apparently used a plain
>> tpc-b pgbench (with synchronous_commit=on) to show the benefits, which is
>> the workload I'm running right now (results sometime tomorrow).
>>
>> That workload clearly uses no savepoints at all, so I'm wondering why you
>> suggested to use several of them - I know you said that it's to show
>> differences between the approaches, but why should that matter to any of the
>> patches (and if it matters, why I got almost no differences in the
>> benchmarks)?
>>
>> Pardon my ignorance, CLOG is not my area of expertise ...
>
> It's possible that the effect of this patch depends on the number of
> sockets. EDB test machine cthulhu as 8 sockets, and power2 has 4
> sockets. I assume Dilip's tests were run on one of those two,
>

I think it is former (8 socket machine).

> although he doesn't seem to have mentioned which one. Your system is
> probably 2 or 4 sockets, which might make a difference. Results might
> also depend on CPU architecture; power2 is, unsurprisingly, a POWER
> system, whereas I assume you are testing x86. Maybe somebody who has
> access should test on hydra.pg.osuosl.org, which is a community POWER
> resource. (Send me a private email if you are a known community
> member who wants access for benchmarking purposes.)
>
> Personally, I find the results so far posted on this thread thoroughly
> unimpressive. I acknowledge that Dilip's results appear to show that
> in a best-case scenario these patches produce a rather large gain.
> However, that gain seems to happen in a completely contrived scenario:
> astronomical client counts, unlogged tables, and a test script that
> maximizes pressure on CLogControlLock.
>

You are right that the scenario is somewhat contrived, but I think he
hasn't posted the results for simple-update or tpc-b kind of scenarios
for pgbench, so we can't conclude that those won't show benefit. I
think we can see benefits with synchronous_commit=off as well may not
be as good as with unlogged tables. The other thing to keep in mind
is that reducing contention on one lock (assume in this case
CLOGControlLock) also gives benefits when we reduce contention on
other locks (like ProcArrayLock, WALWriteLock, ..). Last time we have
verified this effect with Andres's patch (cache the snapshot) which
reduces the remaining contention on ProcArrayLock. We have seen that
individually that patch gives some benefit, but by removing the
contention on CLOGControlLock with the patches (increase the clog
buffers and grouping stuff, each one helps) discussed in this thread,
it gives much bigger benefit.

You point related to high-client count is valid and I am sure it won't
give noticeable benefit at lower client-count as the the
CLOGControlLock contention starts impacting only at high-client count.
I am not sure if it is good idea to reject a patch which helps in
stabilising the performance (helps in falling off the cliff) when the
processes increases the number of cores (or hardware threads)

> If you have to work that hard
> to find a big win, and tests under more reasonable conditions show no
> benefit, it's not clear to me that it's really worth the time we're
> all spending benchmarking and reviewing this, or the risk of bugs, or
> the damage to the SLRU abstraction layer.

I agree with you unless it shows benefit on somewhat more usual
scenario's, we should not accept it. So shouldn't we wait for results
of other workloads like simple-update or tpc-b on bigger machines
before reaching to conclusion?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-20 03:45:07
Message-ID: CAFiTN-v2mO31VA7OSQ5-kp3e408diE0Y0=i_tmjcW54Bgu5uPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 20, 2016 at 8:37 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I think it is former (8 socket machine).

I confirm this is 8 sockets machine(cthulhu)
>

>
> You point related to high-client count is valid and I am sure it won't
> give noticeable benefit at lower client-count as the the
> CLOGControlLock contention starts impacting only at high-client count.
> I am not sure if it is good idea to reject a patch which helps in
> stabilising the performance (helps in falling off the cliff) when the
> processes increases the number of cores (or hardware threads)
>
>> If you have to work that hard
>> to find a big win, and tests under more reasonable conditions show no
>> benefit, it's not clear to me that it's really worth the time we're
>> all spending benchmarking and reviewing this, or the risk of bugs, or
>> the damage to the SLRU abstraction layer.
>
> I agree with you unless it shows benefit on somewhat more usual
> scenario's, we should not accept it. So shouldn't we wait for results
> of other workloads like simple-update or tpc-b on bigger machines
> before reaching to conclusion?

+1

My test are under run, I will post it soon..

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-20 22:18:55
Message-ID: a87bfbfb-6511-b559-bab6-5966b7aabb8e@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 09/19/2016 09:10 PM, Robert Haas wrote:
>
> It's possible that the effect of this patch depends on the number of
> sockets. EDB test machine cthulhu as 8 sockets, and power2 has 4
> sockets. I assume Dilip's tests were run on one of those two,
> although he doesn't seem to have mentioned which one. Your system is
> probably 2 or 4 sockets, which might make a difference. Results
> might also depend on CPU architecture; power2 is, unsurprisingly, a
> POWER system, whereas I assume you are testing x86. Maybe somebody
> who has access should test on hydra.pg.osuosl.org, which is a
> community POWER resource. (Send me a private email if you are a known
> community member who wants access for benchmarking purposes.)
>

Yes, I'm using x86 machines:

1) large but slightly old
- 4 sockets, e5-4620 (so a bit old CPU, 32 cores in total)
- kernel 3.2.80

2) smaller but fresh
- 2 sockets, e5-2620 v4 (newest type of Xeons, 16 cores in total)
- kernel 4.8.0

> Personally, I find the results so far posted on this thread
> thoroughly unimpressive. I acknowledge that Dilip's results appear
> to show that in a best-case scenario these patches produce a rather
> large gain. However, that gain seems to happen in a completely
> contrived scenario: astronomical client counts, unlogged tables, and
> a test script that maximizes pressure on CLogControlLock. If you
> have to work that hard to find a big win, and tests under more
> reasonable conditions show no benefit, it's not clear to me that it's
> really worth the time we're all spending benchmarking and reviewing
> this, or the risk of bugs, or the damage to the SLRU abstraction
> layer. I think there's a very good chance that we're better off
> moving on to projects that have a better chance of helping in the
> real world.

I'm posting results from two types of workloads - traditional r/w
pgbench and Dilip's transaction. With synchronous_commit on/off.

Full results (including script driving the benchmark) are available
here, if needed:

https://bitbucket.org/tvondra/group-clog-benchmark/src

It'd be good if someone could try reproduce this on a comparable
machine, to rule out my stupidity.

2 x e5-2620 v4 (16 cores, 32 with HT)
=====================================

On the "smaller" machine the results look like this - I have only tested
up to 64 clients, as higher values seem rather uninteresting on a
machine with only 16 physical cores.

These are averages of 5 runs, where the min/max for each group are
within ~5% in most cases (see the "spread" sheet). The "e5-2620" sheet
also shows the numbers as % compared to master.

dilip / sync=off 1 4 8 16 32 64
----------------------------------------------------------------------
master 4756 17672 35542 57303 74596 82138
granular-locking 4745 17728 35078 56105 72983 77858
no-content-lock 4646 17650 34887 55794 73273 79000
group-update 4582 17757 35383 56974 74387 81794

dilip / sync=on 1 4 8 16 32 64
----------------------------------------------------------------------
master 4819 17583 35636 57437 74620 82036
granular-locking 4568 17816 35122 56168 73192 78462
no-content-lock 4540 17662 34747 55560 73508 79320
group-update 4495 17612 35474 57095 74409 81874

pgbench / sync=off 1 4 8 16 32 64
----------------------------------------------------------------------
master 3791 14368 27806 43369 54472 62956
granular-locking 3822 14462 27597 43173 56391 64669
no-content-lock 3725 14212 27471 43041 55431 63589
group-update 3895 14453 27574 43405 56783 62406

pgbench / sync=on 1 4 8 16 32 64
----------------------------------------------------------------------
master 3907 14289 27802 43717 56902 62916
granular-locking 3770 14503 27636 44107 55205 63903
no-content-lock 3772 14111 27388 43054 56424 64386
group-update 3844 14334 27452 43621 55896 62498

There's pretty much no improvement at all - most of the results are
within 1-2% of master, in both directions. Hardly a win.

Actually, with 1 client there seems to be ~5% regression, but it might
also be noise and verifying it would require further testing.

4 x e5-4620 v1 (32 cores, 64 with HT)
=====================================

These are averages of 10 runs, and there are a few strange things here.

Firstly, for Dilip's workload the results get much (much) worse between
64 and 128 clients, for some reason. I suspect this might be due to
fairly old kernel (3.2.80), so I'll reboot the machine with 4.5.x kernel
and try again.

Secondly, the min/max differences get much larger than the ~5% on the
smaller machine - with 128 clients, the (max-min)/average is often
>100%. See the "spread" or "spread2" sheets in the attached file.

But for some reason this only affects Dilip's workload, and apparently
the patches make it measurably worse (master is ~75%, patches ~120%). If
you look at tps for individual runs, there's usually 9 runs with almost
the same performance, and then one or two much faster runs. Again, the
pgbench seems not to have this issue.

I have no idea what's causing this - it might be related to the kernel,
but I'm not sure why it should affect the patches differently. Let's see
how the new kernel affects this.

dilip / sync=off 16 32 64 128 192
--------------------------------------------------------------
master 26198 37901 37211 14441 8315
granular-locking 25829 38395 40626 14299 8160
no-content-lock 25872 38994 41053 14058 8169
group-update 26503 38911 42993 19474 8325

dilip / sync=on 16 32 64 128 192
--------------------------------------------------------------
master 26138 37790 38492 13653 8337
granular-locking 25661 38586 40692 14535 8311
no-content-lock 25653 39059 41169 14370 8373
group-update 26472 39170 42126 18923 8366

pgbench / sync=off 16 32 64 128 192
--------------------------------------------------------------
master 23001 35762 41202 31789 8005
granular-locking 23218 36130 42535 45850 8701
no-content-lock 23322 36553 42772 47394 8204
group-update 23129 36177 41788 46419 8163

pgbench / sync=on 16 32 64 128 192
--------------------------------------------------------------
master 22904 36077 41295 35574 8297
granular-locking 23323 36254 42446 43909 8959
no-content-lock 23304 36670 42606 48440 8813
group-update 23127 36696 41859 46693 8345

So there is some improvement due to the patches for 128 clients (+30% in
some cases), but it's rather useless as 64 clients either give you
comparable performance (pgbench workload) or way better one (Dilip's
workload).

Also, pretty much no difference between synchronous_commit on/off,
probably thanks to running on unlogged tables.

I'll repeat the test on the 4-socket machine with a newer kernel, but
that's probably the last benchmark I'll do for this patch for now. I
agree with Robert that the cases the patch is supposed to improve are a
bit contrived because of the very high client counts.

IMHO to continue with the patch (or even with testing it), we really
need a credible / practical example of a real-world workload that
benefits from the patches. The closest we have to that is Amit's
suggestion someone hit the commit lock when running HammerDB, but we
have absolutely no idea what parameters they were using, except that
they were running with synchronous_commit=off. Pgbench shows no such
improvements (at least for me), at least with reasonable parameters.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
results.ods application/vnd.oasis.opendocument.spreadsheet 93.2 KB

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-21 03:17:52
Message-ID: CAFiTN-v5hm1EO4cLXYmpppYdNQk+n4N-O1m++3U9f0Ga1gBzRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 20, 2016 at 9:15 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> +1
>
> My test are under run, I will post it soon..

I have some more results now....

8 socket machine
10 min run(median of 3 run)
synchronous_commit=off
scal factor = 300
share buffer= 8GB

test1: Simple update(pgbench)

Clients Head GroupLock
32 45702 45402
64 46974 51627
128 35056 55362

test2: TPCB (pgbench)

Clients Head GroupLock
32 27969 27765
64 33140 34786
128 21555 38848

Summary:
--------------
At 32 clients no gain, I think at this workload Clog Lock is not a problem.
At 64 Clients we can see ~10% gain with simple update and ~5% with TPCB.
At 128 Clients we can see > 50% gain.

Currently I have tested with synchronous commit=off, later I can try
with on. I can also test at 80 client, I think we will see some
significant gain at this client count also, but as of now I haven't
yet tested.

With above results, what we think ? should we continue our testing ?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-21 06:04:48
Message-ID: CAA4eK1K7Jh1GxQeS+9-ZsadZpz+DfiCXVTjqk+X00aCV6gyP0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 21, 2016 at 3:48 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> I have no idea what's causing this - it might be related to the kernel, but
> I'm not sure why it should affect the patches differently. Let's see how the
> new kernel affects this.
>
> dilip / sync=off 16 32 64 128 192
> --------------------------------------------------------------
> master 26198 37901 37211 14441 8315
> granular-locking 25829 38395 40626 14299 8160
> no-content-lock 25872 38994 41053 14058 8169
> group-update 26503 38911 42993 19474 8325
>
> dilip / sync=on 16 32 64 128 192
> --------------------------------------------------------------
> master 26138 37790 38492 13653 8337
> granular-locking 25661 38586 40692 14535 8311
> no-content-lock 25653 39059 41169 14370 8373
> group-update 26472 39170 42126 18923 8366
>
> pgbench / sync=off 16 32 64 128 192
> --------------------------------------------------------------
> master 23001 35762 41202 31789 8005
> granular-locking 23218 36130 42535 45850 8701
> no-content-lock 23322 36553 42772 47394 8204
> group-update 23129 36177 41788 46419 8163
>
> pgbench / sync=on 16 32 64 128 192
> --------------------------------------------------------------
> master 22904 36077 41295 35574 8297
> granular-locking 23323 36254 42446 43909 8959
> no-content-lock 23304 36670 42606 48440 8813
> group-update 23127 36696 41859 46693 8345
>
>
> So there is some improvement due to the patches for 128 clients (+30% in
> some cases), but it's rather useless as 64 clients either give you
> comparable performance (pgbench workload) or way better one (Dilip's
> workload).
>

I think these results are somewhat similar to what Dilip has reported.
Here, if you see in both cases, the performance improvement is seen
when the client count is greater than cores (including HT). As far as
I know the m/c on which Dilip is running the tests also has 64 HT.
The point here is that the CLOGControlLock contention is noticeable
only at that client count, so it is not fault of patch that it is not
improving at lower client-count. I guess that we will see performance
improvement between 64~128 client counts as well.

> Also, pretty much no difference between synchronous_commit on/off, probably
> thanks to running on unlogged tables.
>

Yeah.

> I'll repeat the test on the 4-socket machine with a newer kernel, but that's
> probably the last benchmark I'll do for this patch for now.
>

Okay, but I think it is better to see the results between 64~128
client count and may be greater than128 client counts, because it is
clear that patch won't improve performance below that.

> I agree with
> Robert that the cases the patch is supposed to improve are a bit contrived
> because of the very high client counts.
>

No issues, I have already explained why I think it is important to
reduce the remaining CLOGControlLock contention in yesterday's and
this mail. If none of you is convinced, then I think we have no
choice but to drop this patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-22 23:44:30
Message-ID: 26b69fb2-fa4d-530c-7783-1cb9d952c4e5@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/21/2016 08:04 AM, Amit Kapila wrote:
> On Wed, Sep 21, 2016 at 3:48 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
...
>
>> I'll repeat the test on the 4-socket machine with a newer kernel,
>> but that's probably the last benchmark I'll do for this patch for
>> now.
>>

Attached are results from benchmarks running on kernel 4.5 (instead of
the old 3.2.80). I've only done synchronous_commit=on, and I've added a
few client counts (mostly at the lower end). The data are pushed the
data to the git repository, see

git push --set-upstream origin master

The summary looks like this (showing both the 3.2.80 and 4.5.5 results):

1) Dilip's workload

3.2.80 16 32 64 128 192
-------------------------------------------------------------------
master 26138 37790 38492 13653 8337
granular-locking 25661 38586 40692 14535 8311
no-content-lock 25653 39059 41169 14370 8373
group-update 26472 39170 42126 18923 8366

4.5.5 1 8 16 32 64 128 192
-------------------------------------------------------------------
granular-locking 4050 23048 27969 32076 34874 36555 37710
no-content-lock 4025 23166 28430 33032 35214 37576 39191
group-update 4002 23037 28008 32492 35161 36836 38850
master 3968 22883 27437 32217 34823 36668 38073

2) pgbench

3.2.80 16 32 64 128 192
-------------------------------------------------------------------
master 22904 36077 41295 35574 8297
granular-locking 23323 36254 42446 43909 8959
no-content-lock 23304 36670 42606 48440 8813
group-update 23127 36696 41859 46693 8345

4.5.5 1 8 16 32 64 128 192
-------------------------------------------------------------------
granular-locking 3116 19235 27388 29150 31905 34105 36359
no-content-lock 3206 19071 27492 29178 32009 34140 36321
group-update 3195 19104 26888 29236 32140 33953 35901
master 3136 18650 26249 28731 31515 33328 35243

The 4.5 kernel clearly changed the results significantly:

(a) Compared to the results from 3.2.80 kernel, some numbers improved,
some got worse. For example, on 3.2.80 pgbench did ~23k tps with 16
clients, on 4.5.5 it does 27k tps. With 64 clients the performance
dropped from 41k tps to ~34k (on master).

(b) The drop above 64 clients is gone - on 3.2.80 it dropped very
quickly to only ~8k with 192 clients. On 4.5 the tps actually continues
to increase, and we get ~35k with 192 clients.

(c) Although it's not visible in the results, 4.5.5 almost perfectly
eliminated the fluctuations in the results. For example when 3.2.80
produced this results (10 runs with the same parameters):

12118 11610 27939 11771 18065
12152 14375 10983 13614 11077

we get this on 4.5.5

37354 37650 37371 37190 37233
38498 37166 36862 37928 38509

Notice how much more even the 4.5.5 results are, compared to 3.2.80.

(d) There's no sign of any benefit from any of the patches (it was only
helpful >= 128 clients, but that's where the tps actually dropped on
3.2.80 - apparently 4.5.5 fixes that and the benefit is gone).

It's a bit annoying that after upgrading from 3.2.80 to 4.5.5, the
performance with 32 and 64 clients dropped quite noticeably (by more
than 10%). I believe that might be a kernel regression, but perhaps it's
a price for improved scalability for higher client counts.

It of course begs the question what kernel version is running on the
machine used by Dilip (i.e. cthulhu)? Although it's a Power machine, so
I'm not sure how much the kernel matters on it.

I'll ask someone else with access to this particular machine to repeat
the tests, as I have a nagging suspicion that I've missed something
important when compiling / running the benchmarks. I'll also retry the
benchmarks on 3.2.80 to see if I get the same numbers.

>
> Okay, but I think it is better to see the results between 64~128
> client count and may be greater than128 client counts, because it is
> clear that patch won't improve performance below that.
>

There are results for 64, 128 and 192 clients. Why should we care about
numbers in between? How likely (and useful) would it be to get
improvement with 96 clients, but no improvement for 64 or 128 clients?

>>
>> I agree with Robert that the cases the patch is supposed to
>> improve are a bit contrived because of the very high client
>> counts.
>>
>
> No issues, I have already explained why I think it is important to
> reduce the remaining CLOGControlLock contention in yesterday's and
> this mail. If none of you is convinced, then I think we have no
> choice but to drop this patch.
>

I agree it's useful to reduce lock contention in general, but
considering the last set of benchmarks shows no benefit with recent
kernel, I think we really need a better understanding of what's going
on, what workloads / systems it's supposed to improve, etc.

I don't dare to suggest rejecting the patch, but I don't see how we
could commit any of the patches at this point. So perhaps "returned with
feedback" and resubmitting in the next CF (along with analysis of
improved workloads) would be appropriate.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
results.ods application/vnd.oasis.opendocument.spreadsheet 58.1 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 01:20:23
Message-ID: CA+Tgmoad9PaEQXJN=ZYJCgVj7Lob8pJhgFr-QYHot_yxu7jBng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> I don't dare to suggest rejecting the patch, but I don't see how we could
> commit any of the patches at this point. So perhaps "returned with feedback"
> and resubmitting in the next CF (along with analysis of improved workloads)
> would be appropriate.

I think it would be useful to have some kind of theoretical analysis
of how much time we're spending waiting for various locks. So, for
example, suppose we one run of these tests with various client counts
- say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run "select
wait_event from pg_stat_activity" once per second throughout the test.
Then we see how many times we get each wait event, including NULL (no
wait event). Now, from this, we can compute the approximate
percentage of time we're spending waiting on CLogControlLock and every
other lock, too, as well as the percentage of time we're not waiting
for lock. That, it seems to me, would give us a pretty clear idea
what the maximum benefit we could hope for from reducing contention on
any given lock might be.

Now, we could also try that experiment with various patches. If we
can show that some patch reduces CLogControlLock contention without
increasing TPS, they might still be worth committing for that reason.
Otherwise, you could have a chicken-and-egg problem. If reducing
contention on A doesn't help TPS because of lock B and visca-versa,
then does that mean we can never commit any patch to reduce contention
on either lock? Hopefully not. But I agree with you that there's
certainly not enough evidence to commit any of these patches now. To
my mind, these numbers aren't convincing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 01:47:19
Message-ID: 15e6e88e-3f8c-ce4b-0782-c279511815ea@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/23/2016 03:20 AM, Robert Haas wrote:
> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> I don't dare to suggest rejecting the patch, but I don't see how
>> we could commit any of the patches at this point. So perhaps
>> "returned with feedback" and resubmitting in the next CF (along
>> with analysis of improvedworkloads) would be appropriate.
>
> I think it would be useful to have some kind of theoretical analysis
> of how much time we're spending waiting for various locks. So, for
> example, suppose we one run of these tests with various client
> counts - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run
> "select wait_event from pg_stat_activity" once per second throughout
> the test. Then we see how many times we get each wait event,
> including NULL (no wait event). Now, from this, we can compute the
> approximate percentage of time we're spending waiting on
> CLogControlLock and every other lock, too, as well as the percentage
> of time we're not waiting for lock. That, it seems to me, would give
> us a pretty clear idea what the maximum benefit we could hope for
> from reducing contention on any given lock might be.
>

Yeah, I think that might be a good way to analyze the locks in general,
not just got these patches. 24h run with per-second samples should give
us about 86400 samples (well, multiplied by number of clients), which is
probably good enough.

We also have LWLOCK_STATS, that might be interesting too, but I'm not
sure how much it affects the behavior (and AFAIK it also only dumps the
data to the server log).

>
> Now, we could also try that experiment with various patches. If we
> can show that some patch reduces CLogControlLock contention without
> increasing TPS, they might still be worth committing for that
> reason. Otherwise, you could have a chicken-and-egg problem. If
> reducing contention on A doesn't help TPS because of lock B and
> visca-versa, then does that mean we can never commit any patch to
> reduce contention on either lock? Hopefully not. But I agree with you
> that there's certainly not enough evidence to commit any of these
> patches now. To my mind, these numbers aren't convincing.
>

Yes, the chicken-and-egg problem is why the tests were done with
unlogged tables (to work around the WAL lock).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 02:59:49
Message-ID: CAA4eK1Kt2nbzvL9ecXHd8Cb6M4sQHx-q12bYjdaoaH4V+QXs4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 23, 2016 at 7:17 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 09/23/2016 03:20 AM, Robert Haas wrote:
>>
>> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>
>>> I don't dare to suggest rejecting the patch, but I don't see how
>>> we could commit any of the patches at this point. So perhaps
>>> "returned with feedback" and resubmitting in the next CF (along
>>> with analysis of improvedworkloads) would be appropriate.
>>
>>
>> I think it would be useful to have some kind of theoretical analysis
>> of how much time we're spending waiting for various locks. So, for
>> example, suppose we one run of these tests with various client
>> counts - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run
>> "select wait_event from pg_stat_activity" once per second throughout
>> the test. Then we see how many times we get each wait event,
>> including NULL (no wait event). Now, from this, we can compute the
>> approximate percentage of time we're spending waiting on
>> CLogControlLock and every other lock, too, as well as the percentage
>> of time we're not waiting for lock. That, it seems to me, would give
>> us a pretty clear idea what the maximum benefit we could hope for
>> from reducing contention on any given lock might be.
>>
>
> Yeah, I think that might be a good way to analyze the locks in general, not
> just got these patches. 24h run with per-second samples should give us about
> 86400 samples (well, multiplied by number of clients), which is probably
> good enough.
>
> We also have LWLOCK_STATS, that might be interesting too, but I'm not sure
> how much it affects the behavior (and AFAIK it also only dumps the data to
> the server log).
>

Right, I think LWLOCK_STATS give us the count of how many time we have
blocked due to particular lock like below where *blk* gives that
number.

PID 164692 lwlock main 11: shacq 2734189 exacq 146304 blk 73808
spindelay 73 dequeue self 57241

I think doing some experiments with both the techniques can help us to
take a call on these patches.

Do we want these experiments on different kernel versions or are we
okay with the current version on cthulhu (3.10) or we want to only
consider the results with latest kernel?

>>
>>
>> Now, we could also try that experiment with various patches. If we
>> can show that some patch reduces CLogControlLock contention without
>> increasing TPS, they might still be worth committing for that
>> reason. Otherwise, you could have a chicken-and-egg problem. If
>> reducing contention on A doesn't help TPS because of lock B and
>> visca-versa, then does that mean we can never commit any patch to
>> reduce contention on either lock? Hopefully not. But I agree with you
>> that there's certainly not enough evidence to commit any of these
>> patches now. To my mind, these numbers aren't convincing.
>>
>
> Yes, the chicken-and-egg problem is why the tests were done with unlogged
> tables (to work around the WAL lock).
>

Yeah, but I suspect still there was a impact due to ProcArrayLock.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 03:10:37
Message-ID: CAA4eK1Kshqxa1birZxocNEWJROaiasUNycL+43b8JTTq+O2Vog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>
>
> (c) Although it's not visible in the results, 4.5.5 almost perfectly
> eliminated the fluctuations in the results. For example when 3.2.80 produced
> this results (10 runs with the same parameters):
>
> 12118 11610 27939 11771 18065
> 12152 14375 10983 13614 11077
>
> we get this on 4.5.5
>
> 37354 37650 37371 37190 37233
> 38498 37166 36862 37928 38509
>
> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>

how long each run was? Generally, I do half-hour run to get stable results.

> (d) There's no sign of any benefit from any of the patches (it was only
> helpful >= 128 clients, but that's where the tps actually dropped on 3.2.80
> - apparently 4.5.5 fixes that and the benefit is gone).
>
> It's a bit annoying that after upgrading from 3.2.80 to 4.5.5, the
> performance with 32 and 64 clients dropped quite noticeably (by more than
> 10%). I believe that might be a kernel regression, but perhaps it's a price
> for improved scalability for higher client counts.
>
> It of course begs the question what kernel version is running on the machine
> used by Dilip (i.e. cthulhu)? Although it's a Power machine, so I'm not sure
> how much the kernel matters on it.
>

cthulhu is a x86 m/c and the kernel version is 3.10. Seeing, the
above results I think kernel version do matter, but does that mean we
ignore the benefits we are seeing on somewhat older kernel version. I
think right answer here is to do some experiments which can show the
actual contention as suggested by Robert and you.

> I'll ask someone else with access to this particular machine to repeat the
> tests, as I have a nagging suspicion that I've missed something important
> when compiling / running the benchmarks. I'll also retry the benchmarks on
> 3.2.80 to see if I get the same numbers.
>
>>
>> Okay, but I think it is better to see the results between 64~128
>> client count and may be greater than128 client counts, because it is
>> clear that patch won't improve performance below that.
>>
>
> There are results for 64, 128 and 192 clients. Why should we care about
> numbers in between? How likely (and useful) would it be to get improvement
> with 96 clients, but no improvement for 64 or 128 clients?
>

The only point to take was to see from where we have started seeing
improvement, saying that the TPS has improved from >=72 client count
is different from saying that it has improved from >=128.

>> No issues, I have already explained why I think it is important to
>> reduce the remaining CLOGControlLock contention in yesterday's and
>> this mail. If none of you is convinced, then I think we have no
>> choice but to drop this patch.
>>
>
> I agree it's useful to reduce lock contention in general, but considering
> the last set of benchmarks shows no benefit with recent kernel, I think we
> really need a better understanding of what's going on, what workloads /
> systems it's supposed to improve, etc.
>
> I don't dare to suggest rejecting the patch, but I don't see how we could
> commit any of the patches at this point. So perhaps "returned with feedback"
> and resubmitting in the next CF (along with analysis of improved workloads)
> would be appropriate.
>

Agreed with your conclusion and changed the status of patch in CF accordingly.

Many thanks for doing the tests.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 12:35:48
Message-ID: cac99b14-f5d7-8fa4-b327-b383c2f5069e@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/23/2016 05:10 AM, Amit Kapila wrote:
> On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>>
>>
>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>> eliminated the fluctuations in the results. For example when 3.2.80 produced
>> this results (10 runs with the same parameters):
>>
>> 12118 11610 27939 11771 18065
>> 12152 14375 10983 13614 11077
>>
>> we get this on 4.5.5
>>
>> 37354 37650 37371 37190 37233
>> 38498 37166 36862 37928 38509
>>
>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>
>
> how long each run was? Generally, I do half-hour run to get stable results.
>

10 x 5-minute runs for each client count. The full shell script driving
the benchmark is here: http://bit.ly/2doY6ID and in short it looks like
this:

for r in `seq 1 $runs`; do
for c in 1 8 16 32 64 128 192; do
psql -c checkpoint
pgbench -j 8 -c $c ...
done
done

>>
>> It of course begs the question what kernel version is running on
>> the machine used by Dilip (i.e. cthulhu)? Although it's a Power
>> machine, so I'm not sure how much the kernel matters on it.
>>
>
> cthulhu is a x86 m/c and the kernel version is 3.10. Seeing, the
> above results I think kernel version do matter, but does that mean
> we ignore the benefits we are seeing on somewhat older kernel
> version. I think right answer here is to do some experiments which
> can show the actual contention as suggested by Robert and you.
>

Yes, I think it'd be useful to test a new kernel version. Perhaps try
4.5.x so that we can compare it to my results. Maybe even try using my
shell script

>>
>> There are results for 64, 128 and 192 clients. Why should we care
>> about numbers in between? How likely (and useful) would it be to
>> get improvement with 96 clients, but no improvement for 64 or 128
>> clients?
>>
>
> The only point to take was to see from where we have started seeing
> improvement, saying that the TPS has improved from >=72 client count
> is different from saying that it has improved from >=128.
>

I find the exact client count rather uninteresting - it's going to be
quite dependent on hardware, workload etc.

>>
>> I don't dare to suggest rejecting the patch, but I don't see how
>> we could commit any of the patches at this point. So perhaps
>> "returned with feedback" and resubmitting in the next CF (along
>> with analysis of improvedworkloads) would be appropriate.
>>
>
> Agreed with your conclusion and changed the status of patch in CF
> accordingly.
>

+1

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 12:46:45
Message-ID: 781ac43c-da69-f5c7-a828-7b995691d4cc@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/23/2016 01:44 AM, Tomas Vondra wrote:
>...
> The 4.5 kernel clearly changed the results significantly:
>
...
>
> (c) Although it's not visible in the results, 4.5.5 almost perfectly
> eliminated the fluctuations in the results. For example when 3.2.80
> produced this results (10 runs with the same parameters):
>
> 12118 11610 27939 11771 18065
> 12152 14375 10983 13614 11077
>
> we get this on 4.5.5
>
> 37354 37650 37371 37190 37233
> 38498 37166 36862 37928 38509
>
> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>

The more I think about these random spikes in pgbench performance on
3.2.80, the more I find them intriguing. Let me show you another example
(from Dilip's workload and group-update patch on 64 clients).

This is on 3.2.80:

44175 34619 51944 38384 49066
37004 47242 36296 46353 36180

and on 4.5.5 it looks like this:

34400 35559 35436 34890 34626
35233 35756 34876 35347 35486

So the 4.5.5 results are much more even, but overall clearly below
3.2.80. How does 3.2.80 manage to do ~50k tps in some of the runs?
Clearly we randomly do something right, but what is it and why doesn't
it happen on the new kernel? And how could we do it every time?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 12:59:21
Message-ID: CABOikdMVnz9HvzAaK06tWOijs+JTY7d4X36m2kqz8TrswtKBhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 23, 2016 at 6:05 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> On 09/23/2016 05:10 AM, Amit Kapila wrote:
>
>> On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>>
>>>>
>>>>
>>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>>> eliminated the fluctuations in the results. For example when 3.2.80
>>> produced
>>> this results (10 runs with the same parameters):
>>>
>>> 12118 11610 27939 11771 18065
>>> 12152 14375 10983 13614 11077
>>>
>>> we get this on 4.5.5
>>>
>>> 37354 37650 37371 37190 37233
>>> 38498 37166 36862 37928 38509
>>>
>>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>>
>>>
>> how long each run was? Generally, I do half-hour run to get stable
>> results.
>>
>>
> 10 x 5-minute runs for each client count. The full shell script driving
> the benchmark is here: http://bit.ly/2doY6ID and in short it looks like
> this:
>
> for r in `seq 1 $runs`; do
> for c in 1 8 16 32 64 128 192; do
> psql -c checkpoint
> pgbench -j 8 -c $c ...
> done
> done

I see couple of problems with the tests:

1. You're running regular pgbench, which also updates the small tables. At
scale 300 and higher clients, there is going to heavy contention on the
pgbench_branches table. Why not test with pgbench -N? As far as this patch
is concerned, we are only interested in seeing contention on
ClogControlLock. In fact, how about a test which only consumes an XID, but
does not do any write activity at all? Complete artificial workload, but
good enough to tell us if and how much the patch helps in the best case. We
can probably do that with a simple txid_current() call, right?

2. Each subsequent pgbench run will bloat the tables. Now that may not be
such a big deal given that you're checkpointing between each run. But it
still makes results somewhat hard to compare. If a vacuum kicks in, that
may have some impact too. Given the scale factor you're testing, why not
just start fresh every time?

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 13:07:24
Message-ID: CAA4eK1K4HEsy819bkDxA3GxGBRsBvu9MmuGh3Q_CxUho29FG4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 23, 2016 at 6:16 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 09/23/2016 01:44 AM, Tomas Vondra wrote:
>>
>> ...
>> The 4.5 kernel clearly changed the results significantly:
>>
> ...
>>
>>
>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>> eliminated the fluctuations in the results. For example when 3.2.80
>> produced this results (10 runs with the same parameters):
>>
>> 12118 11610 27939 11771 18065
>> 12152 14375 10983 13614 11077
>>
>> we get this on 4.5.5
>>
>> 37354 37650 37371 37190 37233
>> 38498 37166 36862 37928 38509
>>
>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>
>
> The more I think about these random spikes in pgbench performance on 3.2.80,
> the more I find them intriguing. Let me show you another example (from
> Dilip's workload and group-update patch on 64 clients).
>
> This is on 3.2.80:
>
> 44175 34619 51944 38384 49066
> 37004 47242 36296 46353 36180
>
> and on 4.5.5 it looks like this:
>
> 34400 35559 35436 34890 34626
> 35233 35756 34876 35347 35486
>
> So the 4.5.5 results are much more even, but overall clearly below 3.2.80.
> How does 3.2.80 manage to do ~50k tps in some of the runs? Clearly we
> randomly do something right, but what is it and why doesn't it happen on the
> new kernel? And how could we do it every time?
>

As far as I can see you are using default values of min_wal_size,
max_wal_size, checkpoint related params, have you changed default
shared_buffer settings, because that can have a bigger impact. Using
default values of mentioned parameters can lead to checkpoints in
between your runs. Also, I think instead of 5 mins, read-write runs
should be run for 15 mins to get consistent data. For Dilip's
workload where he is using only Select ... For Update, i think it is
okay, but otherwise you need to drop and re-create the database
between each run, otherwise data bloat could impact the readings.

I think in general, the impact should be same for both the kernels
because you are using same parameters, but I think if use appropriate
parameters, then you can get consistent results for 3.2.80. I have
also seen variation in read-write tests, but the variation you are
showing is really a matter of concern, because it will be difficult to
rely on final data.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 13:15:33
Message-ID: CAA4eK1J=0YeAWHWEdiPj9tkgEbnz9vbCZ5Q+-6TCrxJub5LL=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 23, 2016 at 6:29 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> On Fri, Sep 23, 2016 at 6:05 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> wrote:
>>
>> On 09/23/2016 05:10 AM, Amit Kapila wrote:
>>>
>>> On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
>>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>
>>>> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>>>>
>>>>>
>>>>
>>>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>>>> eliminated the fluctuations in the results. For example when 3.2.80
>>>> produced
>>>> this results (10 runs with the same parameters):
>>>>
>>>> 12118 11610 27939 11771 18065
>>>> 12152 14375 10983 13614 11077
>>>>
>>>> we get this on 4.5.5
>>>>
>>>> 37354 37650 37371 37190 37233
>>>> 38498 37166 36862 37928 38509
>>>>
>>>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>>>
>>>
>>> how long each run was? Generally, I do half-hour run to get stable
>>> results.
>>>
>>
>> 10 x 5-minute runs for each client count. The full shell script driving
>> the benchmark is here: http://bit.ly/2doY6ID and in short it looks like
>> this:
>>
>> for r in `seq 1 $runs`; do
>> for c in 1 8 16 32 64 128 192; do
>> psql -c checkpoint
>> pgbench -j 8 -c $c ...
>> done
>> done
>
>
>
> I see couple of problems with the tests:
>
> 1. You're running regular pgbench, which also updates the small tables. At
> scale 300 and higher clients, there is going to heavy contention on the
> pgbench_branches table. Why not test with pgbench -N? As far as this patch
> is concerned, we are only interested in seeing contention on
> ClogControlLock. In fact, how about a test which only consumes an XID, but
> does not do any write activity at all? Complete artificial workload, but
> good enough to tell us if and how much the patch helps in the best case. We
> can probably do that with a simple txid_current() call, right?
>

Right, that is why in the initial tests done by Dilip, he has used
Select .. for Update. I think using txid_current will generate lot of
contention on XidGenLock which will mask the contention around
CLOGControlLock, in-fact we have tried that.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 13:20:26
Message-ID: CAA4eK1LO5=M=EWopDL694vzxwx7xuEaRVQvu0wetQ4GNf1naPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 23, 2016 at 6:50 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> I don't dare to suggest rejecting the patch, but I don't see how we could
>> commit any of the patches at this point. So perhaps "returned with feedback"
>> and resubmitting in the next CF (along with analysis of improved workloads)
>> would be appropriate.
>
> I think it would be useful to have some kind of theoretical analysis
> of how much time we're spending waiting for various locks. So, for
> example, suppose we one run of these tests with various client counts
> - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run "select
> wait_event from pg_stat_activity" once per second throughout the test.
> Then we see how many times we get each wait event, including NULL (no
> wait event). Now, from this, we can compute the approximate
> percentage of time we're spending waiting on CLogControlLock and every
> other lock, too, as well as the percentage of time we're not waiting
> for lock. That, it seems to me, would give us a pretty clear idea
> what the maximum benefit we could hope for from reducing contention on
> any given lock might be.
>

As mentioned earlier, such an activity makes sense, however today,
again reading this thread, I noticed that Dilip has already posted
some analysis of lock contention upthread [1]. It is clear that patch
has reduced LWLock contention from ~28% to ~4% (where the major
contributor was TransactionIdSetPageStatus which has reduced from ~53%
to ~3%). Isn't it inline with what you are looking for?

[1] - https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 14:52:40
Message-ID: 5da94f12-8141-2f2f-016a-09a8e37bdd30@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/23/2016 03:07 PM, Amit Kapila wrote:
> On Fri, Sep 23, 2016 at 6:16 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 09/23/2016 01:44 AM, Tomas Vondra wrote:
>>>
>>> ...
>>> The 4.5 kernel clearly changed the results significantly:
>>>
>> ...
>>>
>>>
>>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>>> eliminated the fluctuations in the results. For example when 3.2.80
>>> produced this results (10 runs with the same parameters):
>>>
>>> 12118 11610 27939 11771 18065
>>> 12152 14375 10983 13614 11077
>>>
>>> we get this on 4.5.5
>>>
>>> 37354 37650 37371 37190 37233
>>> 38498 37166 36862 37928 38509
>>>
>>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>>
>>
>> The more I think about these random spikes in pgbench performance on 3.2.80,
>> the more I find them intriguing. Let me show you another example (from
>> Dilip's workload and group-update patch on 64 clients).
>>
>> This is on 3.2.80:
>>
>> 44175 34619 51944 38384 49066
>> 37004 47242 36296 46353 36180
>>
>> and on 4.5.5 it looks like this:
>>
>> 34400 35559 35436 34890 34626
>> 35233 35756 34876 35347 35486
>>
>> So the 4.5.5 results are much more even, but overall clearly below 3.2.80.
>> How does 3.2.80 manage to do ~50k tps in some of the runs? Clearly we
>> randomly do something right, but what is it and why doesn't it happen on the
>> new kernel? And how could we do it every time?
>>
>
> As far as I can see you are using default values of min_wal_size,
> max_wal_size, checkpoint related params, have you changed default
> shared_buffer settings, because that can have a bigger impact.

Huh? Where do you see me using default values? There are settings.log
with a dump of pg_settings data, and the modified values are

checkpoint_completion_target = 0.9
checkpoint_timeout = 3600
effective_io_concurrency = 32
log_autovacuum_min_duration = 100
log_checkpoints = on
log_line_prefix = %m
log_timezone = UTC
maintenance_work_mem = 524288
max_connections = 300
max_wal_size = 8192
min_wal_size = 1024
shared_buffers = 2097152
synchronous_commit = on
work_mem = 524288

(ignoring some irrelevant stuff like locales, timezone etc.).

> Using default values of mentioned parameters can lead to checkpoints in
> between your runs.

So I'm using 16GB shared buffers (so with scale 300 everything fits into
shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint
timeout 1h etc. So no, there are no checkpoints during the 5-minute
runs, only those triggered explicitly before each run.

> Also, I think instead of 5 mins, read-write runs should be run for 15
> mins to get consistent data.

Where does the inconsistency come from? Lack of warmup? Considering how
uniform the results from the 10 runs are (at least on 4.5.5), I claim
this is not an issue.

> For Dilip's workload where he is using only Select ... For Update, i
> think it is okay, but otherwise you need to drop and re-create the
> database between each run, otherwise data bloat could impact the
> readings.

And why should it affect 3.2.80 and 4.5.5 differently?

>
> I think in general, the impact should be same for both the kernels
> because you are using same parameters, but I think if use
> appropriate parameters, then you can get consistent results for
> 3.2.80. I have also seen variation in read-write tests, but the
> variation you are showing is really a matter of concern, because it
> will be difficult to rely on final data.
>

Both kernels use exactly the same parameters (fairly tuned, IMHO).

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-23 14:59:44
Message-ID: 94192968-1bb2-d409-190e-99915b2bcdb5@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/23/2016 02:59 PM, Pavan Deolasee wrote:
>
>
> On Fri, Sep 23, 2016 at 6:05 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
> On 09/23/2016 05:10 AM, Amit Kapila wrote:
>
> On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com
> <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>
>
>
> (c) Although it's not visible in the results, 4.5.5 almost
> perfectly
> eliminated the fluctuations in the results. For example when
> 3.2.80 produced
> this results (10 runs with the same parameters):
>
> 12118 11610 27939 11771 18065
> 12152 14375 10983 13614 11077
>
> we get this on 4.5.5
>
> 37354 37650 37371 37190 37233
> 38498 37166 36862 37928 38509
>
> Notice how much more even the 4.5.5 results are, compared to
> 3.2.80.
>
>
> how long each run was? Generally, I do half-hour run to get
> stable results.
>
>
> 10 x 5-minute runs for each client count. The full shell script
> driving the benchmark is here: http://bit.ly/2doY6ID and in short it
> looks like this:
>
> for r in `seq 1 $runs`; do
> for c in 1 8 16 32 64 128 192; do
> psql -c checkpoint
> pgbench -j 8 -c $c ...
> done
> done
>
>
>
> I see couple of problems with the tests:
>
> 1. You're running regular pgbench, which also updates the small
> tables. At scale 300 and higher clients, there is going to heavy
> contention on the pgbench_branches table. Why not test with pgbench
> -N?

Sure, I can do a bunch of tests with pgbench -N. Good point.

But notice that I've also done the testing with Dilip's workload, and
the results are pretty much the same.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-24 04:06:34
Message-ID: CAA4eK1J1sJchNAsbbhKP1DSRubcZLtQuiTjWpxdF0rc9+QoXvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 23, 2016 at 8:22 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 09/23/2016 03:07 PM, Amit Kapila wrote:
>>
>> On Fri, Sep 23, 2016 at 6:16 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>
>>> On 09/23/2016 01:44 AM, Tomas Vondra wrote:
>>>>
>>>>
>>>> ...
>>>> The 4.5 kernel clearly changed the results significantly:
>>>>
>>> ...
>>>>
>>>>
>>>>
>>>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>>>> eliminated the fluctuations in the results. For example when 3.2.80
>>>> produced this results (10 runs with the same parameters):
>>>>
>>>> 12118 11610 27939 11771 18065
>>>> 12152 14375 10983 13614 11077
>>>>
>>>> we get this on 4.5.5
>>>>
>>>> 37354 37650 37371 37190 37233
>>>> 38498 37166 36862 37928 38509
>>>>
>>>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>>>
>>>
>>> The more I think about these random spikes in pgbench performance on
>>> 3.2.80,
>>> the more I find them intriguing. Let me show you another example (from
>>> Dilip's workload and group-update patch on 64 clients).
>>>
>>> This is on 3.2.80:
>>>
>>> 44175 34619 51944 38384 49066
>>> 37004 47242 36296 46353 36180
>>>
>>> and on 4.5.5 it looks like this:
>>>
>>> 34400 35559 35436 34890 34626
>>> 35233 35756 34876 35347 35486
>>>
>>> So the 4.5.5 results are much more even, but overall clearly below
>>> 3.2.80.
>>> How does 3.2.80 manage to do ~50k tps in some of the runs? Clearly we
>>> randomly do something right, but what is it and why doesn't it happen on
>>> the
>>> new kernel? And how could we do it every time?
>>>
>>
>> As far as I can see you are using default values of min_wal_size,
>> max_wal_size, checkpoint related params, have you changed default
>> shared_buffer settings, because that can have a bigger impact.
>
>
> Huh? Where do you see me using default values?
>

I was referring to one of your script @ http://bit.ly/2doY6ID. I
haven't noticed that you have changed default values in
postgresql.conf.

> There are settings.log with a
> dump of pg_settings data, and the modified values are
>
> checkpoint_completion_target = 0.9
> checkpoint_timeout = 3600
> effective_io_concurrency = 32
> log_autovacuum_min_duration = 100
> log_checkpoints = on
> log_line_prefix = %m
> log_timezone = UTC
> maintenance_work_mem = 524288
> max_connections = 300
> max_wal_size = 8192
> min_wal_size = 1024
> shared_buffers = 2097152
> synchronous_commit = on
> work_mem = 524288
>
> (ignoring some irrelevant stuff like locales, timezone etc.).
>
>> Using default values of mentioned parameters can lead to checkpoints in
>> between your runs.
>
>
> So I'm using 16GB shared buffers (so with scale 300 everything fits into
> shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint timeout
> 1h etc. So no, there are no checkpoints during the 5-minute runs, only those
> triggered explicitly before each run.
>

Thanks for clarification. Do you think we should try some different
settings *_flush_after parameters as those can help in reducing spikes
in writes?

>> Also, I think instead of 5 mins, read-write runs should be run for 15
>> mins to get consistent data.
>
>
> Where does the inconsistency come from?

Thats what I am also curious to know.

> Lack of warmup?

Can't say, but at least we should try to rule out the possibilities.
I think one way to rule out is to do slightly longer runs for Dilip's
test cases and for pgbench we might need to drop and re-create
database after each reading.

> Considering how
> uniform the results from the 10 runs are (at least on 4.5.5), I claim this
> is not an issue.
>

It is quite possible that it is some kernel regression which might be
fixed in later version. Like we are doing most tests in cthulhu which
has 3.10 version of kernel and we generally get consistent results.
I am not sure if later version of kernel say 4.5.5 is a net win,
because there is a considerable difference (dip) of performance in
that version, though it produces quite stable results.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-24 18:28:57
Message-ID: e78b4f32-f24e-f282-1f46-b66d39d9ca9a@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/24/2016 06:06 AM, Amit Kapila wrote:
> On Fri, Sep 23, 2016 at 8:22 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> ...
>>
>> So I'm using 16GB shared buffers (so with scale 300 everything fits into
>> shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint timeout
>> 1h etc. So no, there are no checkpoints during the 5-minute runs, only those
>> triggered explicitly before each run.
>>
>
> Thanks for clarification. Do you think we should try some different
> settings *_flush_after parameters as those can help in reducing spikes
> in writes?
>

I don't see why that settings would matter. The tests are on unlogged
tables, so there's almost no WAL traffic and checkpoints (triggered
explicitly before each run) look like this:

checkpoint complete: wrote 17 buffers (0.0%); 0 transaction log file(s)
added, 0 removed, 13 recycled; write=0.062 s, sync=0.006 s, total=0.092
s; sync files=10, longest=0.004 s, average=0.000 s; distance=309223 kB,
estimate=363742 kB

So I don't see how tuning the flushing would change anything, as we're
not doing any writes.

Moreover, the machine has a bunch of SSD drives (16 or 24, I don't
remember at the moment), behind a RAID controller with 2GB of write
cache on it.

>>> Also, I think instead of 5 mins, read-write runs should be run for 15
>>> mins to get consistent data.
>>
>>
>> Where does the inconsistency come from?
>
> Thats what I am also curious to know.
>
>> Lack of warmup?
>
> Can't say, but at least we should try to rule out the possibilities.
> I think one way to rule out is to do slightly longer runs for
> Dilip's test cases and for pgbench we might need to drop and
> re-create database after each reading.
>

My point is that it's unlikely to be due to insufficient warmup, because
the inconsistencies appear randomly - generally you get a bunch of slow
runs, one significantly faster one, then slow ones again.

I believe the runs to be sufficiently long. I don't see why recreating
the database would be useful - the whole point is to get the database
and shared buffers into a stable state, and then do measurements on it.

I don't think bloat is a major factor here - I'm collecting some
additional statistics during this run, including pg_database_size, and I
can see the size oscillates between 4.8GB and 5.4GB. That's pretty
negligible, I believe.

I'll let the current set of benchmarks complete - it's running on 4.5.5
now, I'll do tests on 3.2.80 too.

Then we can re-evaluate if longer runs are needed.

>> Considering how uniform the results from the 10 runs are (at least
>> on 4.5.5), I claim this is not an issue.
>>
>
> It is quite possible that it is some kernel regression which might
> be fixed in later version. Like we are doing most tests in cthulhu
> which has 3.10 version of kernel and we generally get consistent
> results. I am not sure if later version of kernel say 4.5.5 is a net
> win, because there is a considerable difference (dip) of performance
> in that version, though it produces quite stable results.
>

Well, the thing is - the 4.5.5 behavior is much nicer in general. I'll
always prefer lower but more consistent performance (in most cases). In
any case, we're stuck with whatever kernel version the people are using,
and they're likely to use the newer ones.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-26 17:16:31
Message-ID: f9a8572b-5f27-6666-0f44-e845480d989e@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/24/2016 08:28 PM, Tomas Vondra wrote:
> On 09/24/2016 06:06 AM, Amit Kapila wrote:
>> On Fri, Sep 23, 2016 at 8:22 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> ...
>>>
>>> So I'm using 16GB shared buffers (so with scale 300 everything fits into
>>> shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint
>>> timeout
>>> 1h etc. So no, there are no checkpoints during the 5-minute runs,
>>> only those
>>> triggered explicitly before each run.
>>>
>>
>> Thanks for clarification. Do you think we should try some different
>> settings *_flush_after parameters as those can help in reducing spikes
>> in writes?
>>
>
> I don't see why that settings would matter. The tests are on unlogged
> tables, so there's almost no WAL traffic and checkpoints (triggered
> explicitly before each run) look like this:
>
> checkpoint complete: wrote 17 buffers (0.0%); 0 transaction log file(s)
> added, 0 removed, 13 recycled; write=0.062 s, sync=0.006 s, total=0.092
> s; sync files=10, longest=0.004 s, average=0.000 s; distance=309223 kB,
> estimate=363742 kB
>
> So I don't see how tuning the flushing would change anything, as we're
> not doing any writes.
>
> Moreover, the machine has a bunch of SSD drives (16 or 24, I don't
> remember at the moment), behind a RAID controller with 2GB of write
> cache on it.
>
>>>> Also, I think instead of 5 mins, read-write runs should be run for 15
>>>> mins to get consistent data.
>>>
>>>
>>> Where does the inconsistency come from?
>>
>> Thats what I am also curious to know.
>>
>>> Lack of warmup?
>>
>> Can't say, but at least we should try to rule out the possibilities.
>> I think one way to rule out is to do slightly longer runs for
>> Dilip's test cases and for pgbench we might need to drop and
>> re-create database after each reading.
>>
>
> My point is that it's unlikely to be due to insufficient warmup, because
> the inconsistencies appear randomly - generally you get a bunch of slow
> runs, one significantly faster one, then slow ones again.
>
> I believe the runs to be sufficiently long. I don't see why recreating
> the database would be useful - the whole point is to get the database
> and shared buffers into a stable state, and then do measurements on it.
>
> I don't think bloat is a major factor here - I'm collecting some
> additional statistics during this run, including pg_database_size, and I
> can see the size oscillates between 4.8GB and 5.4GB. That's pretty
> negligible, I believe.
>
> I'll let the current set of benchmarks complete - it's running on 4.5.5
> now, I'll do tests on 3.2.80 too.
>
> Then we can re-evaluate if longer runs are needed.
>
>>> Considering how uniform the results from the 10 runs are (at least
>>> on 4.5.5), I claim this is not an issue.
>>>
>>
>> It is quite possible that it is some kernel regression which might
>> be fixed in later version. Like we are doing most tests in cthulhu
>> which has 3.10 version of kernel and we generally get consistent
>> results. I am not sure if later version of kernel say 4.5.5 is a net
>> win, because there is a considerable difference (dip) of performance
>> in that version, though it produces quite stable results.
>>
>
> Well, the thing is - the 4.5.5 behavior is much nicer in general. I'll
> always prefer lower but more consistent performance (in most cases). In
> any case, we're stuck with whatever kernel version the people are using,
> and they're likely to use the newer ones.
>

So, I have the pgbench results from 3.2.80 and 4.5.5, and in general I
think it matches the previous results rather exactly, so it wasn't just
a fluke before.

The full results, including systat data and various database statistics
(pg_stat_* sampled every second) are available here:

https://bitbucket.org/tvondra/group-clog-kernels

Attached are the per-run results. The averages (over the 10 runs, 5
minute each) look like this:

3.2.80 1 8 16 32 64 128 192
--------------------------------------------------------------------
granular-locking 1567 12146 26341 44188 43263 49590 15042
no-content-lock 1567 12180 25549 43787 43675 51800 16831
group-update 1550 12018 26121 44451 42734 51455 15504
master 1566 12057 25457 42299 42513 42562 10462

4.5.5 1 8 16 32 64 128 192
--------------------------------------------------------------------
granular-locking 3018 19031 27394 29222 32032 34249 36191
no-content-lock 2988 18871 27384 29260 32120 34456 36216
group-update 2960 18848 26870 29025 32078 34259 35900
master 2984 18917 26430 29065 32119 33924 35897

That is:

(1) The 3.2.80 performs a bit better than before, particularly for 128
and 256 clients - I'm not sure if it's thanks to the reboots or so.

(2) 4.5.5 performs measurably worse for >= 32 clients (by ~30%). That's
a pretty significant regression, on a fairly common workload.

(3) The patches somewhat help on 3.2.80, with 128 clients or more.

(4) There's no measurable improvement on 4.5.5.

As for the warmup, possible impact of database bloat etc. Attached are
two charts, illustrating how tps and database looks like over the whole
benchmark on 4.5.5 (~1440 minutes). Clearly, the behavior is very stable
- the database size oscillates around 5GB (which easily fits into
shared_buffers), and the tps is very stable over the 10 runs. If the
warmup (or run duration) was insufficient, there'd be visible behavior
changes during the benchmark. So I believe the parameters are appropriate.

I've realized there actually is 3.10.101 kernel available on the
machine, so I'll repeat the pgbench on it too - perhaps that'll give us
some comparison to cthulhu, which is running 3.10 kernel too.

Then I'll run Dilip's workload on those three kernels (so far only the
simple pgbench was measured).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
db-size.png image/png 75.0 KB
result.ods application/vnd.oasis.opendocument.spreadsheet 46.0 KB
image/png 63.2 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-26 17:31:41
Message-ID: CA+TgmoZO7dubdAcrV9m=SigrVUDXifqt4E-uYuqm5RnxMprcuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 23, 2016 at 9:20 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Fri, Sep 23, 2016 at 6:50 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> I don't dare to suggest rejecting the patch, but I don't see how we could
>>> commit any of the patches at this point. So perhaps "returned with feedback"
>>> and resubmitting in the next CF (along with analysis of improved workloads)
>>> would be appropriate.
>>
>> I think it would be useful to have some kind of theoretical analysis
>> of how much time we're spending waiting for various locks. So, for
>> example, suppose we one run of these tests with various client counts
>> - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run "select
>> wait_event from pg_stat_activity" once per second throughout the test.
>> Then we see how many times we get each wait event, including NULL (no
>> wait event). Now, from this, we can compute the approximate
>> percentage of time we're spending waiting on CLogControlLock and every
>> other lock, too, as well as the percentage of time we're not waiting
>> for lock. That, it seems to me, would give us a pretty clear idea
>> what the maximum benefit we could hope for from reducing contention on
>> any given lock might be.
>>
> As mentioned earlier, such an activity makes sense, however today,
> again reading this thread, I noticed that Dilip has already posted
> some analysis of lock contention upthread [1]. It is clear that patch
> has reduced LWLock contention from ~28% to ~4% (where the major
> contributor was TransactionIdSetPageStatus which has reduced from ~53%
> to ~3%). Isn't it inline with what you are looking for?

Hmm, yes. But it's a little hard to interpret what that means; I
think the test I proposed in the quoted material above would provide
clearer data.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-26 18:48:55
Message-ID: 43198bfb-391b-75da-0517-31e56c4d11f4@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/26/2016 07:16 PM, Tomas Vondra wrote:
>
> The averages (over the 10 runs, 5 minute each) look like this:
>
> 3.2.80 1 8 16 32 64 128 192
> --------------------------------------------------------------------
> granular-locking 1567 12146 26341 44188 43263 49590 15042
> no-content-lock 1567 12180 25549 43787 43675 51800 16831
> group-update 1550 12018 26121 44451 42734 51455 15504
> master 1566 12057 25457 42299 42513 42562 10462
>
> 4.5.5 1 8 16 32 64 128 192
> --------------------------------------------------------------------
> granular-locking 3018 19031 27394 29222 32032 34249 36191
> no-content-lock 2988 18871 27384 29260 32120 34456 36216
> group-update 2960 18848 26870 29025 32078 34259 35900
> master 2984 18917 26430 29065 32119 33924 35897
>
> That is:
>
> (1) The 3.2.80 performs a bit better than before, particularly for 128
> and 256 clients - I'm not sure if it's thanks to the reboots or so.
>
> (2) 4.5.5 performs measurably worse for >= 32 clients (by ~30%). That's
> a pretty significant regression, on a fairly common workload.
>

FWIW, now that I think about this, the regression is roughly in line
with my findings presented in my recent blog post:

http://blog.2ndquadrant.com/postgresql-vs-kernel-versions/

Those numbers were collected on a much smaller machine (2/4 cores only),
which might be why the difference observed on 32-core machine is much
more significant.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-27 10:39:56
Message-ID: CAFiTN-tr_=25EQUFezKNRk=4N-V+D6WMxo7HWs9BMaNx7S3y6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 21, 2016 at 8:47 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> Summary:
> --------------
> At 32 clients no gain, I think at this workload Clog Lock is not a problem.
> At 64 Clients we can see ~10% gain with simple update and ~5% with TPCB.
> At 128 Clients we can see > 50% gain.
>
> Currently I have tested with synchronous commit=off, later I can try
> with on. I can also test at 80 client, I think we will see some
> significant gain at this client count also, but as of now I haven't
> yet tested.
>
> With above results, what we think ? should we continue our testing ?

I have done further testing with on TPCB workload to see the impact on
performance gain by increasing scale factor.

Again at 32 client there is no gain, but at 64 client gain is 12% and
at 128 client it's 75%, it shows that improvement with group lock is
better at higher scale factor (at 300 scale factor gain was 5% at 64
client and 50% at 128 clients).

8 socket machine (kernel 3.10)
10 min run(median of 3 run)
synchronous_commit=off
scal factor = 1000
share buffer= 40GB

Test results:
----------------

client head group lock
32 27496 27178
64 31275 35205
128 20656 34490

LWLOCK_STATS approx. block count on ClogControl Lock ("lwlock main 11")
--------------------------------------------------------------------------------------------------------
client head group lock
32 80000 60000
64 150000 100000
128 140000 70000

Note: These are approx. block count, I have detailed result of
LWLOCK_STAT, incase someone wants to look into.

LWLOCK_STATS shows that ClogControl lock block count reduced by 25% at
32 client, 33% at 64 client and 50% at 128 client.

Conclusion:
1. I think both LWLOCK_STATS and performance data shows that we get
significant contention reduction on ClogControlLock with the patch.
2. It also shows that though we are not seeing any performance gain at
32 clients, but there is contention reduction with patch.

I am planning to do some more test with higher scale factor (3000 or more).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-27 21:15:19
Message-ID: d7a7e096-daac-9207-8eae-50f450203312@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/26/2016 08:48 PM, Tomas Vondra wrote:
> On 09/26/2016 07:16 PM, Tomas Vondra wrote:
>>
>> The averages (over the 10 runs, 5 minute each) look like this:
>>
>> 3.2.80 1 8 16 32 64 128 192
>> --------------------------------------------------------------------
>> granular-locking 1567 12146 26341 44188 43263 49590 15042
>> no-content-lock 1567 12180 25549 43787 43675 51800 16831
>> group-update 1550 12018 26121 44451 42734 51455 15504
>> master 1566 12057 25457 42299 42513 42562 10462
>>
>> 4.5.5 1 8 16 32 64 128 192
>> --------------------------------------------------------------------
>> granular-locking 3018 19031 27394 29222 32032 34249 36191
>> no-content-lock 2988 18871 27384 29260 32120 34456 36216
>> group-update 2960 18848 26870 29025 32078 34259 35900
>> master 2984 18917 26430 29065 32119 33924 35897
>>

So, I got the results from 3.10.101 (only the pgbench data), and it
looks like this:

3.10.101 1 8 16 32 64 128 192
--------------------------------------------------------------------
granular-locking 2582 18492 33416 49583 53759 53572 51295
no-content-lock 2580 18666 33860 49976 54382 54012 51549
group-update 2635 18877 33806 49525 54787 54117 51718
master 2630 18783 33630 49451 54104 53199 50497

So 3.10.101 performs even better tnan 3.2.80 (and much better than
4.5.5), and there's no sign any of the patches making a difference.

It also seems there's a major regression in the kernel, somewhere
between 3.10 and 4.5. With 64 clients, 3.10 does ~54k transactions,
while 4.5 does only ~32k - that's helluva difference.

I wonder if this might be due to running the benchmark on unlogged
tables (and thus not waiting for WAL), but I don't see why that should
result in such drop on a new kernel.

In any case, this seems like an issue unrelated to the patch, so I'll
post further data into a new thread instead of hijacking this one.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-28 15:39:21
Message-ID: CA+TgmoZ+2OmyUfcNOPsFwsBXJdBcKKPnLaFJ3DJyYLop0u_OLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 27, 2016 at 5:15 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> So, I got the results from 3.10.101 (only the pgbench data), and it looks
> like this:
>
> 3.10.101 1 8 16 32 64 128 192
> --------------------------------------------------------------------
> granular-locking 2582 18492 33416 49583 53759 53572 51295
> no-content-lock 2580 18666 33860 49976 54382 54012 51549
> group-update 2635 18877 33806 49525 54787 54117 51718
> master 2630 18783 33630 49451 54104 53199 50497
>
> So 3.10.101 performs even better tnan 3.2.80 (and much better than 4.5.5),
> and there's no sign any of the patches making a difference.

I'm sure that you mentioned this upthread somewhere, but I can't
immediately find it. What scale factor are you testing here?

It strikes me that the larger the scale factor, the more
CLogControlLock contention we expect to have. We'll pretty much do
one CLOG access per update, and the more rows there are, the more
chance there is that the next update hits an "old" row that hasn't
been updated in a long time. So a larger scale factor also increases
the number of active CLOG pages and, presumably therefore, the amount
of CLOG paging activity.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-28 22:45:15
Message-ID: e654de76-0aaa-2873-e105-b6a358e59894@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/28/2016 05:39 PM, Robert Haas wrote:
> On Tue, Sep 27, 2016 at 5:15 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> So, I got the results from 3.10.101 (only the pgbench data), and it looks
>> like this:
>>
>> 3.10.101 1 8 16 32 64 128 192
>> --------------------------------------------------------------------
>> granular-locking 2582 18492 33416 49583 53759 53572 51295
>> no-content-lock 2580 18666 33860 49976 54382 54012 51549
>> group-update 2635 18877 33806 49525 54787 54117 51718
>> master 2630 18783 33630 49451 54104 53199 50497
>>
>> So 3.10.101 performs even better tnan 3.2.80 (and much better than 4.5.5),
>> and there's no sign any of the patches making a difference.
>
> I'm sure that you mentioned this upthread somewhere, but I can't
> immediately find it. What scale factor are you testing here?
>

300, the same scale factor as Dilip.

>
> It strikes me that the larger the scale factor, the more
> CLogControlLock contention we expect to have. We'll pretty much do
> one CLOG access per update, and the more rows there are, the more
> chance there is that the next update hits an "old" row that hasn't
> been updated in a long time. So a larger scale factor also
> increases the number of active CLOG pages and, presumably therefore,
> the amount of CLOG paging activity.
>

So, is 300 too little? I don't think so, because Dilip saw some benefit
from that. Or what scale factor do we think is needed to reproduce the
benefit? My machine has 256GB of ram, so I can easily go up to 15000 and
still keep everything in RAM. But is it worth it?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-28 23:59:57
Message-ID: CA+TgmoaKi1GDLpBTBBv+jQ0uizu4nev_7iU_9z1i76gNd9Mo8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 28, 2016 at 6:45 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> So, is 300 too little? I don't think so, because Dilip saw some benefit from
> that. Or what scale factor do we think is needed to reproduce the benefit?
> My machine has 256GB of ram, so I can easily go up to 15000 and still keep
> everything in RAM. But is it worth it?

Dunno. But it might be worth a test or two at, say, 5000, just to see
if that makes any difference.

I feel like we must be missing something here. If Dilip is seeing
huge speedups and you're seeing nothing, something is different, and
we don't know what it is. Even if the test case is artificial, it
ought to be the same when one of you runs it as when the other runs
it. Right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-29 01:10:30
Message-ID: d8dd08a9-1352-a34a-833b-6864780a4c53@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/29/2016 01:59 AM, Robert Haas wrote:
> On Wed, Sep 28, 2016 at 6:45 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> So, is 300 too little? I don't think so, because Dilip saw some benefit from
>> that. Or what scale factor do we think is needed to reproduce the benefit?
>> My machine has 256GB of ram, so I can easily go up to 15000 and still keep
>> everything in RAM. But is it worth it?
>
> Dunno. But it might be worth a test or two at, say, 5000, just to
> see if that makes any difference.
>

OK, I have some benchmarks to run on that machine, but I'll do a few
tests with scale 5000 - probably sometime next week. I don't think the
delay matters very much, as it's clear the patch will end up with RwF in
this CF round.

> I feel like we must be missing something here. If Dilip is seeing
> huge speedups and you're seeing nothing, something is different, and
> we don't know what it is. Even if the test case is artificial, it
> ought to be the same when one of you runs it as when the other runs
> it. Right?
>

Yes, definitely - we're missing something important, I think. One
difference is that Dilip is using longer runs, but I don't think that's
a problem (as I demonstrated how stable the results are).

I wonder what CPU model is Dilip using - I know it's x86, but not which
generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a
newer model and it makes a difference (although that seems unlikely).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-29 07:26:00
Message-ID: CAFiTN-vE2n6YbkEfJ2JRD+4Z8b+Kv6u07PCCVVdC3Sn26F0EFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 29, 2016 at 6:40 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Yes, definitely - we're missing something important, I think. One difference
> is that Dilip is using longer runs, but I don't think that's a problem (as I
> demonstrated how stable the results are).
>
> I wonder what CPU model is Dilip using - I know it's x86, but not which
> generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer
> model and it makes a difference (although that seems unlikely).

I am using "Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz "

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-29 07:54:24
Message-ID: CAA4eK1K_m1YpzQF8w8pqfLXX8p+o81QEmBbiSxsxj42j-uXV6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 29, 2016 at 12:56 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Thu, Sep 29, 2016 at 6:40 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Yes, definitely - we're missing something important, I think. One difference
>> is that Dilip is using longer runs, but I don't think that's a problem (as I
>> demonstrated how stable the results are).
>>
>> I wonder what CPU model is Dilip using - I know it's x86, but not which
>> generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer
>> model and it makes a difference (although that seems unlikely).
>
> I am using "Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz "
>

Another difference is that m/c on which Dilip is doing tests has 8 sockets.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-29 13:47:09
Message-ID: CA+TgmobcD95v4QeD76qWcwTGE7P_31KVxBQGyS6yLh=dubvSfw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 28, 2016 at 9:10 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> I feel like we must be missing something here. If Dilip is seeing
>> huge speedups and you're seeing nothing, something is different, and
>> we don't know what it is. Even if the test case is artificial, it
>> ought to be the same when one of you runs it as when the other runs
>> it. Right?
>>
> Yes, definitely - we're missing something important, I think. One difference
> is that Dilip is using longer runs, but I don't think that's a problem (as I
> demonstrated how stable the results are).

It's not impossible that the longer runs could matter - performance
isn't necessarily stable across time during a pgbench test, and the
longer the run the more CLOG pages it will fill.

> I wonder what CPU model is Dilip using - I know it's x86, but not which
> generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer
> model and it makes a difference (although that seems unlikely).

The fact that he's using an 8-socket machine seems more likely to
matter than the CPU generation, which isn't much different. Maybe
Dilip should try this on a 2-socket machine and see if he sees the
same kinds of results.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-29 14:14:44
Message-ID: 3f169562-4544-7b2b-9d25-b058da029ffb@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/29/2016 03:47 PM, Robert Haas wrote:
> On Wed, Sep 28, 2016 at 9:10 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> I feel like we must be missing something here. If Dilip is seeing
>>> huge speedups and you're seeing nothing, something is different, and
>>> we don't know what it is. Even if the test case is artificial, it
>>> ought to be the same when one of you runs it as when the other runs
>>> it. Right?
>>>
>> Yes, definitely - we're missing something important, I think. One difference
>> is that Dilip is using longer runs, but I don't think that's a problem (as I
>> demonstrated how stable the results are).
>
> It's not impossible that the longer runs could matter - performance
> isn't necessarily stable across time during a pgbench test, and the
> longer the run the more CLOG pages it will fill.
>

Sure, but I'm not doing just a single pgbench run. I do a sequence of
pgbench runs, with different client counts, with ~6h of total runtime.
There's a checkpoint in between the runs, but as those benchmarks are on
unlogged tables, that flushes only very few buffers.

Also, the clog SLRU has 128 pages, which is ~1MB of clog data, i.e. ~4M
transactions. On some kernels (3.10 and 3.12) I can get >50k tps with 64
clients or more, which means we fill the 128 pages in less than 80 seconds.

So half-way through the run only 50% of clog pages fits into the SLRU,
and we have a data set with 30M tuples, with uniform random access - so
it seems rather unlikely we'll get transaction that's still in the SLRU.

But sure, I can do a run with larger data set to verify this.

>> I wonder what CPU model is Dilip using - I know it's x86, but not which
>> generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer
>> model and it makes a difference (although that seems unlikely).
>
> The fact that he's using an 8-socket machine seems more likely to
> matter than the CPU generation, which isn't much different. Maybe
> Dilip should try this on a 2-socket machine and see if he sees the
> same kinds of results.
>

Maybe. I wouldn't expect a major difference between 4 and 8 sockets, but
I may be wrong.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-29 14:35:08
Message-ID: CA+TgmoY-sXi9u=Bg9U6sPBU3ec9W51bQRn-O6LUniE9zq04j9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 29, 2016 at 10:14 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> It's not impossible that the longer runs could matter - performance
>> isn't necessarily stable across time during a pgbench test, and the
>> longer the run the more CLOG pages it will fill.
>
> Sure, but I'm not doing just a single pgbench run. I do a sequence of
> pgbench runs, with different client counts, with ~6h of total runtime.
> There's a checkpoint in between the runs, but as those benchmarks are on
> unlogged tables, that flushes only very few buffers.
>
> Also, the clog SLRU has 128 pages, which is ~1MB of clog data, i.e. ~4M
> transactions. On some kernels (3.10 and 3.12) I can get >50k tps with 64
> clients or more, which means we fill the 128 pages in less than 80 seconds.
>
> So half-way through the run only 50% of clog pages fits into the SLRU, and
> we have a data set with 30M tuples, with uniform random access - so it seems
> rather unlikely we'll get transaction that's still in the SLRU.
>
> But sure, I can do a run with larger data set to verify this.

OK, another theory: Dilip is, I believe, reinitializing for each run,
and you are not. Maybe somehow the effect Dilip is seeing only
happens with a newly-initialized set of pgbench tables. For example,
maybe the patches cause a huge improvement when all rows have the same
XID, but the effect fades rapidly once the XIDs spread out...

I'm not saying any of what I'm throwing out here is worth the
electrons upon which it is printed, just that there has to be some
explanation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-30 04:26:56
Message-ID: CAFiTN-ufKuY4JkNJV1LREmbuaZD7LS=fNxAjkgKGjKja6ekFpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 29, 2016 at 8:05 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> OK, another theory: Dilip is, I believe, reinitializing for each run,
> and you are not.

Yes, I am reinitializing for each run.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-05 06:35:19
Message-ID: 3cc206fa-7d36-f020-3856-12ed405e2535@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

After collecting a lot more results from multiple kernel versions, I can
confirm that I see a significant improvement with 128 and 192 clients,
roughly by 30%:

64 128 192
------------------------------------------------
master 62482 43181 50985
granular-locking 61701 59611 47483
no-content-lock 62650 59819 47895
group-update 63702 64758 62596

But I only see this with Dilip's workload, and only with pre-4.3.0
kernels (the results above are from kernel 3.19).

With 4.5.5, results for the same benchmark look like this:

64 128 192
------------------------------------------------
master 35693 39822 42151
granular-locking 35370 39409 41353
no-content-lock 36201 39848 42407
group-update 35697 39893 42667

That seems like a fairly bad regression in kernel, although I have not
identified the feature/commit causing it (and it's also possible the
issue lies somewhere else, of course).

With regular pgbench, I see no improvement on any kernel version. For
example on 3.19 the results look like this:

64 128 192
------------------------------------------------
master 54661 61014 59484
granular-locking 55904 62481 60711
no-content-lock 56182 62442 61234
group-update 55019 61587 60485

I haven't done much more testing (e.g. with -N to eliminate collisions
on branches) yet, let's see if it changes anything.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-05 08:03:38
Message-ID: CAA4eK1JiOOKN73yC3rHqF_F+yKEFfn15Gh9iitT-=7Q5BuCs=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 5, 2016 at 12:05 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> After collecting a lot more results from multiple kernel versions, I can
> confirm that I see a significant improvement with 128 and 192 clients,
> roughly by 30%:
>
> 64 128 192
> ------------------------------------------------
> master 62482 43181 50985
> granular-locking 61701 59611 47483
> no-content-lock 62650 59819 47895
> group-update 63702 64758 62596
>
> But I only see this with Dilip's workload, and only with pre-4.3.0 kernels
> (the results above are from kernel 3.19).
>

That appears positive.

> With 4.5.5, results for the same benchmark look like this:
>
> 64 128 192
> ------------------------------------------------
> master 35693 39822 42151
> granular-locking 35370 39409 41353
> no-content-lock 36201 39848 42407
> group-update 35697 39893 42667
>
> That seems like a fairly bad regression in kernel, although I have not
> identified the feature/commit causing it (and it's also possible the issue
> lies somewhere else, of course).
>
> With regular pgbench, I see no improvement on any kernel version. For
> example on 3.19 the results look like this:
>
> 64 128 192
> ------------------------------------------------
> master 54661 61014 59484
> granular-locking 55904 62481 60711
> no-content-lock 56182 62442 61234
> group-update 55019 61587 60485
>

Are the above results with synchronous_commit=off?

> I haven't done much more testing (e.g. with -N to eliminate collisions on
> branches) yet, let's see if it changes anything.
>

Yeah, let us see how it behaves with -N. Also, I think we could try
at higher scale factor?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-07 09:32:53
Message-ID: 65f47a46-f1b2-aee1-d56d-b67de49e3a3b@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/05/2016 10:03 AM, Amit Kapila wrote:
> On Wed, Oct 5, 2016 at 12:05 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Hi,
>>
>> After collecting a lot more results from multiple kernel versions, I can
>> confirm that I see a significant improvement with 128 and 192 clients,
>> roughly by 30%:
>>
>> 64 128 192
>> ------------------------------------------------
>> master 62482 43181 50985
>> granular-locking 61701 59611 47483
>> no-content-lock 62650 59819 47895
>> group-update 63702 64758 62596
>>
>> But I only see this with Dilip's workload, and only with pre-4.3.0 kernels
>> (the results above are from kernel 3.19).
>>
>
> That appears positive.
>

I got access to a large machine with 72/144 cores (thanks to Oleg and
Alexander from Postgres Professional), and I'm running the tests on that
machine too.

Results from Dilip's workload (with scale 300, unlogged tables) look
like this:

32 64 128 192 224 256 288
master 104943 128579 72167 100967 66631 97088 63767
granular-locking 103415 141689 83780 120480 71847 115201 67240
group-update 105343 144322 92229 130149 81247 126629 76638
no-content-lock 103153 140568 80101 119185 70004 115386 66199

So there's some 20-30% improvement for >= 128 clients.

But what I find much more intriguing is the zig-zag behavior. I mean, 64
clients give ~130k tps, 128 clients only give ~70k but 192 clients jump
up to >100k tps again, etc.

FWIW I don't see any such behavior on pgbench, and all those tests were
done on the same cluster.

>> With 4.5.5, results for the same benchmark look like this:
>>
>> 64 128 192
>> ------------------------------------------------
>> master 35693 39822 42151
>> granular-locking 35370 39409 41353
>> no-content-lock 36201 39848 42407
>> group-update 35697 39893 42667
>>
>> That seems like a fairly bad regression in kernel, although I have not
>> identified the feature/commit causing it (and it's also possible the issue
>> lies somewhere else, of course).
>>
>> With regular pgbench, I see no improvement on any kernel version. For
>> example on 3.19 the results look like this:
>>
>> 64 128 192
>> ------------------------------------------------
>> master 54661 61014 59484
>> granular-locking 55904 62481 60711
>> no-content-lock 56182 62442 61234
>> group-update 55019 61587 60485
>>
>
> Are the above results with synchronous_commit=off?
>

No, but I can do that.

>> I haven't done much more testing (e.g. with -N to eliminate
>> collisions on branches) yet, let's see if it changes anything.
>>
>
> Yeah, let us see how it behaves with -N. Also, I think we could try
> at higher scale factor?
>

Yes, I plan to do that. In total, I plan to test combinations of:

(a) Dilip's workload and pgbench (regular and -N)
(b) logged and unlogged tables
(c) scale 300 and scale 3000 (both fits into RAM)
(d) sync_commit=on/off

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-08 05:47:38
Message-ID: CAA4eK1JPVwPW0X8Ss+Rz+VQcPTYxCMGQuHEHfOcCTOtGqE_=ZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 7, 2016 at 3:02 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> I got access to a large machine with 72/144 cores (thanks to Oleg and
> Alexander from Postgres Professional), and I'm running the tests on that
> machine too.
>
> Results from Dilip's workload (with scale 300, unlogged tables) look like
> this:
>
> 32 64 128 192 224 256 288
> master 104943 128579 72167 100967 66631 97088 63767
> granular-locking 103415 141689 83780 120480 71847 115201 67240
> group-update 105343 144322 92229 130149 81247 126629 76638
> no-content-lock 103153 140568 80101 119185 70004 115386 66199
>
> So there's some 20-30% improvement for >= 128 clients.
>

So here we see performance improvement starting at 64 clients, this is
somewhat similar to what Dilip saw in his tests.

> But what I find much more intriguing is the zig-zag behavior. I mean, 64
> clients give ~130k tps, 128 clients only give ~70k but 192 clients jump up
> to >100k tps again, etc.
>

No clear answer.

> FWIW I don't see any such behavior on pgbench, and all those tests were done
> on the same cluster.
>
>>> With 4.5.5, results for the same benchmark look like this:
>>>
>>> 64 128 192
>>> ------------------------------------------------
>>> master 35693 39822 42151
>>> granular-locking 35370 39409 41353
>>> no-content-lock 36201 39848 42407
>>> group-update 35697 39893 42667
>>>
>>> That seems like a fairly bad regression in kernel, although I have not
>>> identified the feature/commit causing it (and it's also possible the
>>> issue
>>> lies somewhere else, of course).
>>>
>>> With regular pgbench, I see no improvement on any kernel version. For
>>> example on 3.19 the results look like this:
>>>
>>> 64 128 192
>>> ------------------------------------------------
>>> master 54661 61014 59484
>>> granular-locking 55904 62481 60711
>>> no-content-lock 56182 62442 61234
>>> group-update 55019 61587 60485
>>>
>>
>> Are the above results with synchronous_commit=off?
>>
>
> No, but I can do that.
>
>>> I haven't done much more testing (e.g. with -N to eliminate
>>> collisions on branches) yet, let's see if it changes anything.
>>>
>>
>> Yeah, let us see how it behaves with -N. Also, I think we could try
>> at higher scale factor?
>>
>
> Yes, I plan to do that. In total, I plan to test combinations of:
>
> (a) Dilip's workload and pgbench (regular and -N)
> (b) logged and unlogged tables
> (c) scale 300 and scale 3000 (both fits into RAM)
> (d) sync_commit=on/off
>

sounds sensible.

Thanks for doing the tests.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-09 20:47:21
Message-ID: 09275f4d-f78d-dd69-22de-57c784dc410e@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/08/2016 07:47 AM, Amit Kapila wrote:
> On Fri, Oct 7, 2016 at 3:02 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> ...
>
>> In total, I plan to test combinations of:
>>
>> (a) Dilip's workload and pgbench (regular and -N)
>> (b) logged and unlogged tables
>> (c) scale 300 and scale 3000 (both fits into RAM)
>> (d) sync_commit=on/off
>>
>
> sounds sensible.
>
> Thanks for doing the tests.
>

FWIW I've started those tests on the big machine provided by Oleg and
Alexander, an estimate to complete all the benchmarks is 9 days. The
results will be pushed

to https://bitbucket.org/tvondra/hp05-results/src

after testing each combination (every ~9 hours). Inspired by Robert's
wait event post a few days ago, I've added wait event sampling so that
we can perform similar analysis. (Neat idea!)

While messing with the kernel on the other machine I've managed to
misconfigure it to the extent that it's not accessible anymore. I'll
start similar benchmarks once I find someone with console access who can
fix the boot.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-12 10:21:37
Message-ID: CAFiTN-uQ+Jbd31cXvRbj48Ba6TqDUDpLKSPnsUCCYRju0Y0U8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 10, 2016 at 2:17 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> after testing each combination (every ~9 hours). Inspired by Robert's wait
> event post a few days ago, I've added wait event sampling so that we can
> perform similar analysis. (Neat idea!)

I have done wait event test on for head vs group lock patch.
I have used similar script what Robert has mentioned in below thread

https://www.postgresql.org/message-id/CA+Tgmoav9Q5v5ZGT3+wP_1tQjT6TGYXrwrDcTRrWimC+ZY7RRA@mail.gmail.com

Test details and Results:
--------------------------------
Machine, POWER, 4 socket machine (machine details are attached in file.)

30-minute pgbench runs with
configurations,
had max_connections = 200,
shared_buffers = 8GB,
maintenance_work_mem = 4GB,
synchronous_commit =off,
checkpoint_timeout = 15min,
checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p]
max_wal_size = 40GB,
log_checkpoints =on.

Test1: unlogged table, 192 clients
---------------------------------------------
On Head:
tps = 44898.862257 (including connections establishing)
tps = 44899.761934 (excluding connections establishing)

262092 LWLockNamed | CLogControlLock
224396 |
114510 Lock | transactionid
42908 Client | ClientRead
20610 Lock | tuple
13700 LWLockTranche | buffer_content
3637
2562 LWLockNamed | XidGenLock
2359 LWLockNamed | ProcArrayLock
1037 Lock | extend
948 LWLockTranche | lock_manager
46 LWLockTranche | wal_insert
12 BufferPin | BufferPin
4 LWLockTranche | buffer_mapping

With Patch:

tps = 77846.622956 (including connections establishing)
tps = 77848.234046 (excluding connections establishing)

101832 Lock | transactionid
91358 Client | ClientRead
16691 LWLockNamed | XidGenLock
12467 Lock | tuple
6007 LWLockNamed | CLogControlLock
3640
3531 LWLockNamed | ProcArrayLock
3390 LWLockTranche | lock_manager
2683 Lock | extend
1112 LWLockTranche | buffer_content
72 LWLockTranche | wal_insert
8 LWLockTranche | buffer_mapping
2 LWLockTranche | proc
2 BufferPin | BufferPin

Test2: unlogged table, 96 clients
------------------------------------------
On head:
tps = 58632.065563 (including connections establishing)
tps = 58632.767384 (excluding connections establishing)
77039 LWLockNamed | CLogControlLock
39712 Client | ClientRead
18358 Lock | transactionid
4238 LWLockNamed | XidGenLock
3638
3518 LWLockTranche | buffer_content
2717 LWLockNamed | ProcArrayLock
1410 Lock | tuple
792 Lock | extend
182 LWLockTranche | lock_manager
30 LWLockTranche | wal_insert
3 LWLockTranche | buffer_mapping
1 Tuples only is on.
1 BufferPin | BufferPin

With Patch:
tps = 75204.166640 (including connections establishing)
tps = 75204.922105 (excluding connections establishing)
[dilip(dot)kumar(at)power2 bin]$ cat out_300_96_ul.txt
261917 |
53407 Client | ClientRead
14994 Lock | transactionid
5258 LWLockNamed | XidGenLock
3660
3604 LWLockNamed | ProcArrayLock
2096 LWLockNamed | CLogControlLock
1102 Lock | tuple
823 Lock | extend
481 LWLockTranche | buffer_content
372 LWLockTranche | lock_manager
192 Lock | relation
65 LWLockTranche | wal_insert
6 LWLockTranche | buffer_mapping
1 Tuples only is on.
1 LWLockTranche | proc

Test3: unlogged table, 64 clients
------------------------------------------
On Head:

tps = 66231.203018 (including connections establishing)
tps = 66231.664990 (excluding connections establishing)

43446 Client | ClientRead
6992 LWLockNamed | CLogControlLock
4685 Lock | transactionid
3650
3381 LWLockNamed | ProcArrayLock
810 LWLockNamed | XidGenLock
734 Lock | extend
439 LWLockTranche | buffer_content
247 Lock | tuple
136 LWLockTranche | lock_manager
64 Lock | relation
24 LWLockTranche | wal_insert
2 LWLockTranche | buffer_mapping
1 Tuples only is on.

With Patch:
tps = 67294.042602 (including connections establishing)
tps = 67294.532650 (excluding connections establishing)

28186 Client | ClientRead
3655
1172 LWLockNamed | ProcArrayLock
619 Lock | transactionid
289 LWLockNamed | CLogControlLock
237 Lock | extend
81 LWLockTranche | buffer_content
48 LWLockNamed | XidGenLock
28 LWLockTranche | lock_manager
23 Lock | tuple
6 LWLockTranche | wal_insert

Test4: unlogged table, 32 clients

Head:
tps = 52320.190549 (including connections establishing)
tps = 52320.442694 (excluding connections establishing)

28564 Client | ClientRead
3663
1320 LWLockNamed | ProcArrayLock
742 Lock | transactionid
534 LWLockNamed | CLogControlLock
255 Lock | extend
108 LWLockNamed | XidGenLock
81 LWLockTranche | buffer_content
44 LWLockTranche | lock_manager
29 Lock | tuple
6 LWLockTranche | wal_insert
1 Tuples only is on.
1 LWLockTranche | buffer_mapping

With Patch:
tps = 47505.582315 (including connections establishing)
tps = 47505.773351 (excluding connections establishing)

28186 Client | ClientRead
3655
1172 LWLockNamed | ProcArrayLock
619 Lock | transactionid
289 LWLockNamed | CLogControlLock
237 Lock | extend
81 LWLockTranche | buffer_content
48 LWLockNamed | XidGenLock
28 LWLockTranche | lock_manager
23 Lock | tuple
6 LWLockTranche | wal_insert

I think at higher client count from client count 96 onwards contention
on CLogControlLock is clearly visible and which is completely solved
with group lock patch.

And at lower client count 32,64 contention on CLogControlLock is not
significant hence we can not see any gain with group lock patch.
(though we can see some contention on CLogControlLock is reduced at 64
clients.)

Note: Here I have taken only one set of reading, and at 32 client my
reading shows some regression with group lock patch, which may be run
to run variance (because earlier I never saw this regression, I can
confirm again with multiple runs).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
benchmark_machine_info.txt text/plain 607 bytes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-12 18:55:26
Message-ID: CA+TgmobjO01bJ2caVWOzU=PuoKfx1X6epGokhHD8fc9hVwkhKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I think at higher client count from client count 96 onwards contention
> on CLogControlLock is clearly visible and which is completely solved
> with group lock patch.
>
> And at lower client count 32,64 contention on CLogControlLock is not
> significant hence we can not see any gain with group lock patch.
> (though we can see some contention on CLogControlLock is reduced at 64
> clients.)

I agree with these conclusions. I had a chance to talk with Andres
this morning at Postgres Vision and based on that conversation I'd
like to suggest a couple of additional tests:

1. Repeat this test on x86. In particular, I think you should test on
the EnterpriseDB server cthulhu, which is an 8-socket x86 server.

2. Repeat this test with a mixed read-write workload, like -b
tpcb-like(at)1 -b select-only(at)9

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-13 02:23:24
Message-ID: 4aafed4c-8002-9a9c-0d15-73d19ab7e317@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/2016 08:55 PM, Robert Haas wrote:
> On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> I think at higher client count from client count 96 onwards contention
>> on CLogControlLock is clearly visible and which is completely solved
>> with group lock patch.
>>
>> And at lower client count 32,64 contention on CLogControlLock is not
>> significant hence we can not see any gain with group lock patch.
>> (though we can see some contention on CLogControlLock is reduced at 64
>> clients.)
>
> I agree with these conclusions. I had a chance to talk with Andres
> this morning at Postgres Vision and based on that conversation I'd
> like to suggest a couple of additional tests:
>
> 1. Repeat this test on x86. In particular, I think you should test on
> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>
> 2. Repeat this test with a mixed read-write workload, like -b
> tpcb-like(at)1 -b select-only(at)9
>

FWIW, I'm already running similar benchmarks on an x86 machine with 72
cores (144 with HT). It's "just" a 4-socket system, but the results I
got so far seem quite interesting. The tooling and results (pushed
incrementally) are available here:

https://bitbucket.org/tvondra/hp05-results/overview

The tooling is completely automated, and it also collects various stats,
like for example the wait event. So perhaps we could simply run it on
ctulhu and get comparable results, and also more thorough data sets than
just snippets posted to the list?

There's also a bunch of reports for the 5 already completed runs

- dilip-300-logged-sync
- dilip-300-unlogged-sync
- pgbench-300-logged-sync-skip
- pgbench-300-unlogged-sync-noskip
- pgbench-300-unlogged-sync-skip

The name identifies the workload type, scale and whether the tables are
wal-logged (for pgbench the "skip" means "-N" while "noskip" does
regular pgbench).

For example the "reports/wait-events-count-patches.txt" compares the
wait even stats with different patches applied (and master):

https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/wait-events-count-patches.txt?at=master&fileviewer=file-view-default

and average tps (from 3 runs, 5 minutes each):

https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/tps-avg-patches.txt?at=master&fileviewer=file-view-default

There are certainly interesting bits. For example while the "logged"
case is dominated y WALWriteLock for most client counts, for large
client counts that's no longer true.

Consider for example dilip-300-logged-sync results with 216 clients:

wait_event | master | gran_lock | no_cont_lock | group_upd
--------------------+---------+-----------+--------------+-----------
CLogControlLock | 624566 | 474261 | 458599 | 225338
WALWriteLock | 431106 | 623142 | 619596 | 699224
| 331542 | 358220 | 371393 | 537076
buffer_content | 261308 | 134764 | 138664 | 102057
ClientRead | 59826 | 100883 | 103609 | 118379
transactionid | 26966 | 23155 | 23815 | 31700
ProcArrayLock | 3967 | 3852 | 4070 | 4576
wal_insert | 3948 | 10430 | 9513 | 12079
clog | 1710 | 4006 | 2443 | 925
XidGenLock | 1689 | 3785 | 4229 | 3539
tuple | 965 | 617 | 655 | 840
lock_manager | 300 | 571 | 619 | 802
WALBufMappingLock | 168 | 140 | 158 | 147
SubtransControlLock | 60 | 115 | 124 | 105

Clearly, CLOG is an issue here, and it's (slightly) improved by all the
patches (group_update performing the best). And with 288 clients (which
is 2x the number of virtual cores in the machine, so not entirely crazy)
you get this:

wait_event | master | gran_lock | no_cont_lock | group_upd
--------------------+---------+-----------+--------------+-----------
CLogControlLock | 901670 | 736822 | 728823 | 398111
buffer_content | 492637 | 318129 | 319251 | 270416
WALWriteLock | 414371 | 593804 | 589809 | 656613
| 380344 | 452936 | 470178 | 745790
ClientRead | 60261 | 111367 | 111391 | 126151
transactionid | 43627 | 34585 | 35464 | 48679
wal_insert | 5423 | 29323 | 25898 | 30191
ProcArrayLock | 4379 | 3918 | 4006 | 4582
clog | 2952 | 9135 | 5304 | 2514
XidGenLock | 2182 | 9488 | 8894 | 8595
tuple | 2176 | 1288 | 1409 | 1821
lock_manager | 323 | 797 | 827 | 1006
WALBufMappingLock | 124 | 124 | 146 | 206
SubtransControlLock | 85 | 146 | 170 | 120

So even buffer_content gets ahead of the WALWriteLock. I wonder whether
this might be because of only having 128 buffers for clog pages, causing
contention on this system (surely, systems with 144 cores were not that
common when the 128 limit was introduced).

So the patch has positive impact even with WAL, as illustrated by tps
improvements (for large client counts):

clients | master | gran_locking | no_content_lock | group_update
---------+--------+--------------+-----------------+--------------
36 | 39725 | 39627 | 41203 | 39763
72 | 70533 | 65795 | 65602 | 66195
108 | 81664 | 87415 | 86896 | 87199
144 | 68950 | 98054 | 98266 | 102834
180 | 105741 | 109827 | 109201 | 113911
216 | 62789 | 92193 | 90586 | 98995
252 | 94243 | 102368 | 100663 | 107515
288 | 57895 | 83608 | 82556 | 91738

I find the tps fluctuation intriguing, and I'd like to see that fixed
before committing any of the patches.

For pgbench-300-logged-sync-skip (the other WAL-logging test already
completed), the CLOG contention is also reduced significantly, but the
tps did not improve this significantly.

For the the unlogged case (dilip-300-unlogged-sync), the results are
fairly similar - CLogControlLock and buffer_content dominating the wait
even profiles (WALWriteLock is missing, of course), and the average tps
fluctuates in almost exactly the same way.

Interestingly, no fluctuation for the pgbench tests. For example for
pgbench-300-unlogged-sync-ski (i.e. pgbench -N) the result is this:

clients | master | gran_locking | no_content_lock | group_update
---------+--------+--------------+-----------------+--------------
36 | 147265 | 148663 | 148985 | 146559
72 | 162645 | 209070 | 207841 | 204588
108 | 135785 | 219982 | 218111 | 217588
144 | 113979 | 228683 | 228953 | 226934
180 | 96930 | 230161 | 230316 | 227156
216 | 89068 | 224241 | 226524 | 225805
252 | 78203 | 222507 | 225636 | 224810
288 | 63999 | 204524 | 225469 | 220098

That's a fairly significant improvement, and the behavior is very
smooth. Sadly, with WAL logging (pgbench-300-logged-sync-ski) the tps
drops back to master mostly thanks to WALWriteLock.

Another interesting aspect of the patches is impact on variability of
results - for example looking at dilip-300-unlogged-sync, the overall
average tps (for the three runs combined) and for each of the tree runs,
looks like this:

clients | avg_tps | tps_1 | tps_2 | tps_3
---------+---------+-----------+-----------+-----------
36 | 117332 | 115042 | 116125 | 120841
72 | 90917 | 72451 | 119915 | 80319
108 | 96070 | 106105 | 73606 | 108580
144 | 81422 | 71094 | 102109 | 71063
180 | 88537 | 98871 | 67756 | 99021
216 | 75962 | 65584 | 96365 | 66010
252 | 59941 | 57771 | 64756 | 57289
288 | 80851 | 93005 | 56454 | 93313

Notice the variability between the runs - the difference between min and
max is often more than 40%. Now compare it to results with the
"group-update" patch applied:

clients | avg_tps | tps_1 | tps_2 | tps_3
---------+---------+-----------+-----------+-----------
36 | 116273 | 117031 | 116005 | 115786
72 | 145273 | 147166 | 144839 | 143821
108 | 89892 | 89957 | 89585 | 90133
144 | 130176 | 130310 | 130565 | 129655
180 | 81944 | 81927 | 81951 | 81953
216 | 124415 | 124367 | 123228 | 125651
252 | 76723 | 76467 | 77266 | 76436
288 | 120072 | 121205 | 119731 | 119283

In this case there's pretty much no cross-run variability, the
differences are usually within 2%, so basically random noise. (There's
of course the variability depending on client count, but that was
already mentioned).

There's certainly much more interesting stuff in the results, but I
don't have time for more thorough analysis now - I only intended to do
some "quick benchmarking" on the patch, and I've already spent days on
this, and I have other things to do.

I'll take care of collecting data for the remaining cases on this
machine (and possibly running the same tests on the other one, if I
manage to get access to it again). But I'll leave further analysis of
the collected data up to the patch authors, or some volunteers.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-15 07:43:03
Message-ID: CAA4eK1J9VxJUnpOiQDf0O=Z87QUMbw=uGcQr4EaGbHSCibx9yA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 13, 2016 at 7:53 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 10/12/2016 08:55 PM, Robert Haas wrote:
>> On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>>> I think at higher client count from client count 96 onwards contention
>>> on CLogControlLock is clearly visible and which is completely solved
>>> with group lock patch.
>>>
>>> And at lower client count 32,64 contention on CLogControlLock is not
>>> significant hence we can not see any gain with group lock patch.
>>> (though we can see some contention on CLogControlLock is reduced at 64
>>> clients.)
>>
>> I agree with these conclusions. I had a chance to talk with Andres
>> this morning at Postgres Vision and based on that conversation I'd
>> like to suggest a couple of additional tests:
>>
>> 1. Repeat this test on x86. In particular, I think you should test on
>> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>>
>> 2. Repeat this test with a mixed read-write workload, like -b
>> tpcb-like(at)1 -b select-only(at)9
>>
>
> FWIW, I'm already running similar benchmarks on an x86 machine with 72
> cores (144 with HT). It's "just" a 4-socket system, but the results I
> got so far seem quite interesting. The tooling and results (pushed
> incrementally) are available here:
>
> https://bitbucket.org/tvondra/hp05-results/overview
>
> The tooling is completely automated, and it also collects various stats,
> like for example the wait event. So perhaps we could simply run it on
> ctulhu and get comparable results, and also more thorough data sets than
> just snippets posted to the list?
>
> There's also a bunch of reports for the 5 already completed runs
>
> - dilip-300-logged-sync
> - dilip-300-unlogged-sync
> - pgbench-300-logged-sync-skip
> - pgbench-300-unlogged-sync-noskip
> - pgbench-300-unlogged-sync-skip
>
> The name identifies the workload type, scale and whether the tables are
> wal-logged (for pgbench the "skip" means "-N" while "noskip" does
> regular pgbench).
>
> For example the "reports/wait-events-count-patches.txt" compares the
> wait even stats with different patches applied (and master):
>
> https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/wait-events-count-patches.txt?at=master&fileviewer=file-view-default
>
> and average tps (from 3 runs, 5 minutes each):
>
> https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/tps-avg-patches.txt?at=master&fileviewer=file-view-default
>
> There are certainly interesting bits. For example while the "logged"
> case is dominated y WALWriteLock for most client counts, for large
> client counts that's no longer true.
>
> Consider for example dilip-300-logged-sync results with 216 clients:
>
> wait_event | master | gran_lock | no_cont_lock | group_upd
> --------------------+---------+-----------+--------------+-----------
> CLogControlLock | 624566 | 474261 | 458599 | 225338
> WALWriteLock | 431106 | 623142 | 619596 | 699224
> | 331542 | 358220 | 371393 | 537076
> buffer_content | 261308 | 134764 | 138664 | 102057
> ClientRead | 59826 | 100883 | 103609 | 118379
> transactionid | 26966 | 23155 | 23815 | 31700
> ProcArrayLock | 3967 | 3852 | 4070 | 4576
> wal_insert | 3948 | 10430 | 9513 | 12079
> clog | 1710 | 4006 | 2443 | 925
> XidGenLock | 1689 | 3785 | 4229 | 3539
> tuple | 965 | 617 | 655 | 840
> lock_manager | 300 | 571 | 619 | 802
> WALBufMappingLock | 168 | 140 | 158 | 147
> SubtransControlLock | 60 | 115 | 124 | 105
>
> Clearly, CLOG is an issue here, and it's (slightly) improved by all the
> patches (group_update performing the best). And with 288 clients (which
> is 2x the number of virtual cores in the machine, so not entirely crazy)
> you get this:
>
> wait_event | master | gran_lock | no_cont_lock | group_upd
> --------------------+---------+-----------+--------------+-----------
> CLogControlLock | 901670 | 736822 | 728823 | 398111
> buffer_content | 492637 | 318129 | 319251 | 270416
> WALWriteLock | 414371 | 593804 | 589809 | 656613
> | 380344 | 452936 | 470178 | 745790
> ClientRead | 60261 | 111367 | 111391 | 126151
> transactionid | 43627 | 34585 | 35464 | 48679
> wal_insert | 5423 | 29323 | 25898 | 30191
> ProcArrayLock | 4379 | 3918 | 4006 | 4582
> clog | 2952 | 9135 | 5304 | 2514
> XidGenLock | 2182 | 9488 | 8894 | 8595
> tuple | 2176 | 1288 | 1409 | 1821
> lock_manager | 323 | 797 | 827 | 1006
> WALBufMappingLock | 124 | 124 | 146 | 206
> SubtransControlLock | 85 | 146 | 170 | 120
>
> So even buffer_content gets ahead of the WALWriteLock. I wonder whether
> this might be because of only having 128 buffers for clog pages, causing
> contention on this system (surely, systems with 144 cores were not that
> common when the 128 limit was introduced).
>

Not sure, but I have checked if we increase clog buffers greater than
128, then it causes dip in performance on read-write workload in some
cases. Apart from that from above results, it is quite clear that
patches help in significantly reducing the CLOGControlLock contention
with group-update patch consistently better, probably because with
this workload is more contended on writing the transaction status.

> So the patch has positive impact even with WAL, as illustrated by tps
> improvements (for large client counts):
>
> clients | master | gran_locking | no_content_lock | group_update
> ---------+--------+--------------+-----------------+--------------
> 36 | 39725 | 39627 | 41203 | 39763
> 72 | 70533 | 65795 | 65602 | 66195
> 108 | 81664 | 87415 | 86896 | 87199
> 144 | 68950 | 98054 | 98266 | 102834
> 180 | 105741 | 109827 | 109201 | 113911
> 216 | 62789 | 92193 | 90586 | 98995
> 252 | 94243 | 102368 | 100663 | 107515
> 288 | 57895 | 83608 | 82556 | 91738
>
> I find the tps fluctuation intriguing, and I'd like to see that fixed
> before committing any of the patches.
>

I have checked the wait event results where there is more fluctuation:

test | clients | wait_event_type | wait_event |
master | granular_locking | no_content_lock | group_update
----------------------------------+---------+-----------------+---------------------+---------+------------------+-----------------+--------------
dilip-300-unlogged-sync | 108 | LWLockNamed |
CLogControlLock | 343526 | 502127 | 479937 |
301381
dilip-300-unlogged-sync | 180 | LWLockNamed |
CLogControlLock | 557639 | 835567 | 795403 |
512707

So, if I read above results correctly, then it shows that group-update
has helped slightly to reduce the contention and one probable reason
could be that we need to update clog status on different clog pages
more frequently on such a workload and may be need to perform disk
page reads for clog pages as well, so the benefit of grouping will
certainly be less. This is because page read requests will get
serialized and only leader backend needs to perform all such requests.
Robert has pointed about somewhat similar case upthread [1] and I have
modified the patch as well to use multiple slots (groups) for group
transaction status update [2], but we didn't pursued, because on
pgbench workload, it didn't showed any benefit. However, may be here
it can show some benefit, if we could make above results reproducible
and you guys think that above theory sounds reasonable, then I can
again modify the patch based on that idea.

Now, the story with granular_locking and no_content_lock patches seems
to be worse, because they seem to be increasing the contention on
CLOGControlLock rather than reducing it. I think one of the probable
reasons that could happen for both the approaches is that it
frequently needs to release the CLogControlLock acquired in Shared
mode and reacquire it in Exclusive mode as the clog page to modify is
not in buffer (different clog page update then the currently in
buffer) and then once again it needs to release the CLogControlLock
lock to read the clog page from disk and acquire it again in Exclusive
mode. This frequent release-acquire of CLOGControlLock in different
modes could lead to significant increase in contention. It is
slightly more for granular_locking patch as it needs one additional
lock (buffer_content_lock) in Exclusive mode after acquiring
CLogControlLock. Offhand, I could not see a way to reduce the
contention with granular_locking and no_content_lock patches.

So, the crux is that we are seeing more variability in some of the
results because of frequent different clog page accesses which is not
so easy to predict, but I think it is possible with ~100,000 tps.

>
> There's certainly much more interesting stuff in the results, but I
> don't have time for more thorough analysis now - I only intended to do
> some "quick benchmarking" on the patch, and I've already spent days on
> this, and I have other things to do.
>

Thanks a ton for doing such a detailed testing.

[1] - https://www.postgresql.org/message-id/CA%2BTgmoahCx6XgprR%3Dp5%3D%3DcF0g9uhSHsJxVdWdUEHN9H2Mv0gkw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1%2BSoW3FBrdZV%2B3m34uCByK3DMPy_9QQs34yvN8spByzyA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-20 07:36:00
Message-ID: CAFiTN-taV4iVkPHrxg=YCicKjBS6=QZm_cM4hbS_2q2ryLhUUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I agree with these conclusions. I had a chance to talk with Andres
> this morning at Postgres Vision and based on that conversation I'd
> like to suggest a couple of additional tests:
>
> 1. Repeat this test on x86. In particular, I think you should test on
> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.

I have done my test on cthulhu, basic difference is that In POWER we
saw ClogControlLock on top at 96 and more client with 300 scale
factor. But, on cthulhu at 300 scale factor transactionid lock is
always on top. So I repeated my test with 1000 scale factor as well on
cthulhu.

All configuration is same as my last test.

Test with 1000 scale factor
-------------------------------------

Test1: number of clients: 192

Head:
tps = 21206.108856 (including connections establishing)
tps = 21206.245441 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 1000_192_ul.txt
310489 LWLockNamed | CLogControlLock
296152 |
35537 Lock | transactionid
15821 LWLockTranche | buffer_mapping
10342 LWLockTranche | buffer_content
8427 LWLockTranche | clog
3961
3165 Lock | extend
2861 Lock | tuple
2781 LWLockNamed | ProcArrayLock
1104 LWLockNamed | XidGenLock
745 LWLockTranche | lock_manager
371 LWLockNamed | CheckpointerCommLock
70 LWLockTranche | wal_insert
5 BufferPin | BufferPin
3 LWLockTranche | proc

Patch:
tps = 28725.038933 (including connections establishing)
tps = 28725.367102 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 1000_192_ul.txt
540061 |
57810 LWLockNamed | CLogControlLock
36264 LWLockTranche | buffer_mapping
29976 Lock | transactionid
4770 Lock | extend
4735 LWLockTranche | clog
4479 LWLockNamed | ProcArrayLock
4006
3955 LWLockTranche | buffer_content
2505 LWLockTranche | lock_manager
2179 Lock | tuple
1977 LWLockNamed | XidGenLock
905 LWLockNamed | CheckpointerCommLock
222 LWLockTranche | wal_insert
8 LWLockTranche | proc

Test2: number of clients: 96

Head:
tps = 25447.861572 (including connections establishing)
tps = 25448.012739 (excluding connections establishing)
261611 |
69604 LWLockNamed | CLogControlLock
6119 Lock | transactionid
4008
2874 LWLockTranche | buffer_mapping
2578 LWLockTranche | buffer_content
2355 LWLockNamed | ProcArrayLock
1245 Lock | extend
1168 LWLockTranche | clog
232 Lock | tuple
217 LWLockNamed | CheckpointerCommLock
160 LWLockNamed | XidGenLock
158 LWLockTranche | lock_manager
78 LWLockTranche | wal_insert
5 BufferPin | BufferPin

Patch:
tps = 32708.368938 (including connections establishing)
tps = 32708.765989 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 1000_96_ul.txt
326601 |
7471 LWLockNamed | CLogControlLock
5387 Lock | transactionid
4018
3331 LWLockTranche | buffer_mapping
3144 LWLockNamed | ProcArrayLock
1372 Lock | extend
722 LWLockTranche | buffer_content
393 LWLockNamed | XidGenLock
237 LWLockTranche | lock_manager
234 Lock | tuple
194 LWLockTranche | clog
96 Lock | relation
88 LWLockTranche | wal_insert
34 LWLockNamed | CheckpointerCommLock

Test3: number of clients: 64

Head:

tps = 28264.194438 (including connections establishing)
tps = 28264.336270 (excluding connections establishing)

218264 |
10314 LWLockNamed | CLogControlLock
4019
2067 Lock | transactionid
1950 LWLockTranche | buffer_mapping
1879 LWLockNamed | ProcArrayLock
592 Lock | extend
565 LWLockTranche | buffer_content
222 LWLockNamed | XidGenLock
143 LWLockTranche | clog
131 LWLockNamed | CheckpointerCommLock
63 LWLockTranche | lock_manager
52 Lock | tuple
35 LWLockTranche | wal_insert

Patch:
tps = 27906.376194 (including connections establishing)
tps = 27906.531392 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 1000_64_ul.txt
228108 |
4039
2294 Lock | transactionid
2116 LWLockTranche | buffer_mapping
1757 LWLockNamed | ProcArrayLock
1553 LWLockNamed | CLogControlLock
800 Lock | extend
403 LWLockTranche | buffer_content
92 LWLockNamed | XidGenLock
74 LWLockTranche | lock_manager
42 Lock | tuple
35 LWLockTranche | wal_insert
34 LWLockTranche | clog
14 LWLockNamed | CheckpointerCommLock

Test4: number of clients: 32

Head:
tps = 27587.999912 (including connections establishing)
tps = 27588.119611 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 1000_32_ul.txt
117762 |
4031
614 LWLockNamed | ProcArrayLock
379 LWLockNamed | CLogControlLock
344 Lock | transactionid
183 Lock | extend
102 LWLockTranche | buffer_mapping
71 LWLockTranche | buffer_content
39 LWLockNamed | XidGenLock
25 LWLockTranche | lock_manager
3 LWLockTranche | wal_insert
3 LWLockTranche | clog
2 LWLockNamed | CheckpointerCommLock
2 Lock | tuple

Patch:
tps = 28291.428848 (including connections establishing)
tps = 28291.586435 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 1000_32_ul.txt
116596 |
4041
757 LWLockNamed | ProcArrayLock
407 LWLockNamed | CLogControlLock
358 Lock | transactionid
183 Lock | extend
142 LWLockTranche | buffer_mapping
77 LWLockTranche | buffer_content
68 LWLockNamed | XidGenLock
35 LWLockTranche | lock_manager
15 LWLockTranche | wal_insert
7 LWLockTranche | clog
7 Lock | tuple
4 LWLockNamed | CheckpointerCommLock
1 Tuples only is on.

Summary:
- At 96 and more clients count we can see ClogControlLock at the top.
- With patch contention on ClogControlLock is reduced significantly.
I think these behaviours are same as we saw on power.

With 300 scale factor:
- Contention on ClogControlLock is significant only at 192 client
(still transaction id lock is on top), Which is completely removed
with group lock patch.

For 300 scale factor, I am posting data only at 192 client count (If
anyone interested in other data I can post).

Head:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 65930726
latency average: 5.242 ms
tps = 36621.827041 (including connections establishing)
tps = 36622.064081 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 300_192_ul.txt
437848 |
118966 Lock | transactionid
88869 LWLockNamed | CLogControlLock
18558 Lock | tuple
6183 LWLockTranche | buffer_content
5664 LWLockTranche | lock_manager
3995 LWLockNamed | ProcArrayLock
3646
1748 Lock | extend
1635 LWLockNamed | XidGenLock
401 LWLockTranche | wal_insert
33 BufferPin | BufferPin
5 LWLockTranche | proc
3 LWLockTranche | buffer_mapping

Patch:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 82616270
latency average: 4.183 ms
tps = 45894.737813 (including connections establishing)
tps = 45894.995634 (excluding connections establishing)
120372 Lock | transactionid
16346 Lock | tuple
7489 LWLockTranche | lock_manager
4514 LWLockNamed | ProcArrayLock
3632
3310 LWLockNamed | CLogControlLock
2287 LWLockNamed | XidGenLock
2271 Lock | extend
709 LWLockTranche | buffer_content
490 LWLockTranche | wal_insert
30 BufferPin | BufferPin
10 LWLockTranche | proc
6 LWLockTranche | buffer_mapping

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-20 15:33:21
Message-ID: acf46406-cf6d-41fe-4118-4e1c960b4790@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/20/2016 09:36 AM, Dilip Kumar wrote:
> On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I agree with these conclusions. I had a chance to talk with Andres
>> this morning at Postgres Vision and based on that conversation I'd
>> like to suggest a couple of additional tests:
>>
>> 1. Repeat this test on x86. In particular, I think you should test on
>> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>
> I have done my test on cthulhu, basic difference is that In POWER we
> saw ClogControlLock on top at 96 and more client with 300 scale
> factor. But, on cthulhu at 300 scale factor transactionid lock is
> always on top. So I repeated my test with 1000 scale factor as well on
> cthulhu.
>
> All configuration is same as my last test.
>
> Test with 1000 scale factor
> -------------------------------------
>
> Test1: number of clients: 192
>
> Head:
> tps = 21206.108856 (including connections establishing)
> tps = 21206.245441 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_192_ul.txt
> 310489 LWLockNamed | CLogControlLock
> 296152 |
> 35537 Lock | transactionid
> 15821 LWLockTranche | buffer_mapping
> 10342 LWLockTranche | buffer_content
> 8427 LWLockTranche | clog
> 3961
> 3165 Lock | extend
> 2861 Lock | tuple
> 2781 LWLockNamed | ProcArrayLock
> 1104 LWLockNamed | XidGenLock
> 745 LWLockTranche | lock_manager
> 371 LWLockNamed | CheckpointerCommLock
> 70 LWLockTranche | wal_insert
> 5 BufferPin | BufferPin
> 3 LWLockTranche | proc
>
> Patch:
> tps = 28725.038933 (including connections establishing)
> tps = 28725.367102 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_192_ul.txt
> 540061 |
> 57810 LWLockNamed | CLogControlLock
> 36264 LWLockTranche | buffer_mapping
> 29976 Lock | transactionid
> 4770 Lock | extend
> 4735 LWLockTranche | clog
> 4479 LWLockNamed | ProcArrayLock
> 4006
> 3955 LWLockTranche | buffer_content
> 2505 LWLockTranche | lock_manager
> 2179 Lock | tuple
> 1977 LWLockNamed | XidGenLock
> 905 LWLockNamed | CheckpointerCommLock
> 222 LWLockTranche | wal_insert
> 8 LWLockTranche | proc
>
> Test2: number of clients: 96
>
> Head:
> tps = 25447.861572 (including connections establishing)
> tps = 25448.012739 (excluding connections establishing)
> 261611 |
> 69604 LWLockNamed | CLogControlLock
> 6119 Lock | transactionid
> 4008
> 2874 LWLockTranche | buffer_mapping
> 2578 LWLockTranche | buffer_content
> 2355 LWLockNamed | ProcArrayLock
> 1245 Lock | extend
> 1168 LWLockTranche | clog
> 232 Lock | tuple
> 217 LWLockNamed | CheckpointerCommLock
> 160 LWLockNamed | XidGenLock
> 158 LWLockTranche | lock_manager
> 78 LWLockTranche | wal_insert
> 5 BufferPin | BufferPin
>
> Patch:
> tps = 32708.368938 (including connections establishing)
> tps = 32708.765989 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_96_ul.txt
> 326601 |
> 7471 LWLockNamed | CLogControlLock
> 5387 Lock | transactionid
> 4018
> 3331 LWLockTranche | buffer_mapping
> 3144 LWLockNamed | ProcArrayLock
> 1372 Lock | extend
> 722 LWLockTranche | buffer_content
> 393 LWLockNamed | XidGenLock
> 237 LWLockTranche | lock_manager
> 234 Lock | tuple
> 194 LWLockTranche | clog
> 96 Lock | relation
> 88 LWLockTranche | wal_insert
> 34 LWLockNamed | CheckpointerCommLock
>
> Test3: number of clients: 64
>
> Head:
>
> tps = 28264.194438 (including connections establishing)
> tps = 28264.336270 (excluding connections establishing)
>
> 218264 |
> 10314 LWLockNamed | CLogControlLock
> 4019
> 2067 Lock | transactionid
> 1950 LWLockTranche | buffer_mapping
> 1879 LWLockNamed | ProcArrayLock
> 592 Lock | extend
> 565 LWLockTranche | buffer_content
> 222 LWLockNamed | XidGenLock
> 143 LWLockTranche | clog
> 131 LWLockNamed | CheckpointerCommLock
> 63 LWLockTranche | lock_manager
> 52 Lock | tuple
> 35 LWLockTranche | wal_insert
>
> Patch:
> tps = 27906.376194 (including connections establishing)
> tps = 27906.531392 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_64_ul.txt
> 228108 |
> 4039
> 2294 Lock | transactionid
> 2116 LWLockTranche | buffer_mapping
> 1757 LWLockNamed | ProcArrayLock
> 1553 LWLockNamed | CLogControlLock
> 800 Lock | extend
> 403 LWLockTranche | buffer_content
> 92 LWLockNamed | XidGenLock
> 74 LWLockTranche | lock_manager
> 42 Lock | tuple
> 35 LWLockTranche | wal_insert
> 34 LWLockTranche | clog
> 14 LWLockNamed | CheckpointerCommLock
>
> Test4: number of clients: 32
>
> Head:
> tps = 27587.999912 (including connections establishing)
> tps = 27588.119611 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_32_ul.txt
> 117762 |
> 4031
> 614 LWLockNamed | ProcArrayLock
> 379 LWLockNamed | CLogControlLock
> 344 Lock | transactionid
> 183 Lock | extend
> 102 LWLockTranche | buffer_mapping
> 71 LWLockTranche | buffer_content
> 39 LWLockNamed | XidGenLock
> 25 LWLockTranche | lock_manager
> 3 LWLockTranche | wal_insert
> 3 LWLockTranche | clog
> 2 LWLockNamed | CheckpointerCommLock
> 2 Lock | tuple
>
> Patch:
> tps = 28291.428848 (including connections establishing)
> tps = 28291.586435 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_32_ul.txt
> 116596 |
> 4041
> 757 LWLockNamed | ProcArrayLock
> 407 LWLockNamed | CLogControlLock
> 358 Lock | transactionid
> 183 Lock | extend
> 142 LWLockTranche | buffer_mapping
> 77 LWLockTranche | buffer_content
> 68 LWLockNamed | XidGenLock
> 35 LWLockTranche | lock_manager
> 15 LWLockTranche | wal_insert
> 7 LWLockTranche | clog
> 7 Lock | tuple
> 4 LWLockNamed | CheckpointerCommLock
> 1 Tuples only is on.
>
> Summary:
> - At 96 and more clients count we can see ClogControlLock at the top.
> - With patch contention on ClogControlLock is reduced significantly.
> I think these behaviours are same as we saw on power.
>
> With 300 scale factor:
> - Contention on ClogControlLock is significant only at 192 client
> (still transaction id lock is on top), Which is completely removed
> with group lock patch.
>
> For 300 scale factor, I am posting data only at 192 client count (If
> anyone interested in other data I can post).
>

In the results you've posted on 10/12, you've mentioned a regression
with 32 clients, where you got 52k tps on master but only 48k tps with
the patch (so ~10% difference). I have no idea what scale was used for
those tests, and I see no such regression in the current results (but
you only report results for some of the client counts).

Also, which of the proposed patches have you been testing?

Can you collect and share a more complete set of data, perhaps based on
the scripts I use to do tests on the large machine with 36/72 cores,
available at https://bitbucket.org/tvondra/hp05-results ?

I've taken some time to build a simple web-based reports from the
results collected so far (also included in the git repository), and
pushed them here:

http://tvondra.bitbucket.org

For each of the completed runs, there's a report comparing tps for
different client counts with master and the three patches (average tps,
median and stddev), and it's possible to download a more thorough text
report with wait event stats, comparison of individual runs etc.

If you want to cooperate on this, I'm available - i.e. I can help you
get the tooling running, customize it etc.

Regarding the results collected on the "big machine" so far, I do have a
few observations:

pgbench / scale 300 (fits into 16GB shared buffers)
---------------------------------------------------
* in general, those results seem fine

* the results generally fall into 3 categories (I'll show results for
"pgbench -N" but regular pgbench behaves similarly):

(a) logged, sync_commit=on - no impact
http://tvondra.bitbucket.org/#pgbench-300-logged-sync-skip

(b) logged, sync_commit=off - improvement
http://tvondra.bitbucket.org/#pgbench-300-logged-async-skip

The thoughput gets improved by ~20% with 72 clients, and then it
levels-off (but does not drop unlike on master). With high client
counts the difference is up to 300%, but people who care about
throughput won't run with such client counts anyway.

And not only this improves throughput, it also significantly
reduces variability of the performance (i.e. measure throughput
each second and compute STDDEV of that). You can imagine this
as a much "smoother" chart of tps over time.

(c) unlogged, sync_commit=* - improvement
http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip

This is actually quite similar to (b).

dilip / scale 300 (fits into 16GB shared buffers)
-------------------------------------------------

* those results seem less OK

* I haven't found any significant regressions (in the sense of
significant performance drop compared to master), but the behavior in
some cases seem fairly strange (and it's repeatable)

* consider for example these results:

http://tvondra.bitbucket.org/#dilip-300-unlogged-async
http://tvondra.bitbucket.org/#dilip-300-logged-async

* the saw-like pattern is rather suspicious, and I don't think I've seen
anything like that before - I guess there's some feedback loop and we
better find it before committing any of the patches, because this is
something I don't want to see on any production machine (and I bet
neither do you)

* After looking into wait even details in the full text report at

http://tvondra.bitbucket.org/by-test/dilip-300-unlogged-async.txt

(section "wait events for dilip-300-unlogged-async (runs combined)")

I see that for pg-9.6-group-update, the statistics for 72, 108 and
144 clients (low - high - low), the summary looks like this

clients | wait_event_type | wait_event | wait_count | wait_pct
---------+-----------------+-----------------+------------+----------
72 | | | 374845 | 62.87
72 | Client | ClientRead | 136320 | 22.86
72 | LWLockNamed | CLogControlLock | 52804 | 8.86
72 | LWLockTranche | buffer_content | 15337 | 2.57
72 | LWLockNamed | XidGenLock | 7352 | 1.23
72 | LWLockNamed | ProcArrayLock | 6630 | 1.11

108 | | | 407179 | 46.01
108 | LWLockNamed | CLogControlLock | 300452 | 33.95
108 | LWLockTranche | buffer_content | 87597 | 9.90
108 | Client | ClientRead | 80901 | 9.14
108 | LWLockNamed | ProcArrayLock | 3290 | 0.37

144 | | | 623057 | 53.44
144 | LWLockNamed | CLogControlLock | 175072 | 15.02
144 | Client | ClientRead | 163451 | 14.02
144 | LWLockTranche | buffer_content | 147963 | 12.69
144 | LWLockNamed | XidGenLock | 38361 | 3.29
144 | Lock | transactionid | 8821 | 0.76

That is, there's sudden jump on CLogControlLock from 22% to 33% and
then back to 15% (and for 180 clients it jumps back to ~35%). That's
pretty strange, and all the patches behave exactly the same.

scale 3000 (45GB), shared_buffers=16GB
---------------------------------------

For the small scale, the whole data set fits into 16GB shared buffers,
so there were pretty much no writes except for WAL and CLOG. For scale
3000 that's no longer true - the backends will compete for buffers and
will constantly write dirty buffers to page cache.

I haven't realized this initially and the kernel was using the default
vm.dirty_* limits, i.e. 10% and 20%. As the machine has 3TB of RAM, this
resulted in rather excessive threshold (or "insane" if you want), so the
kernel regularly accumulated up to ~15GB of dirty data and then wrote it
out in very short period of time. Even though the machine has fairly
powerful storage (4GB write cache on controller, 10 x 12Gbps SAS SSDs),
this lead to pretty bad latency spikes / drops in throughput.

I've only done two runs with this configuration before realizing what's
happening, the results are illustrated here"

* http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-high-dirty-bytes
*
http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-high-dirty-bytes

I'm not sure how important those results are (if throughput and smooth
behavior matters, tuning the kernel thresholds is a must), but what I
find interesting is that while the patches manage to improve throughput
by 10-20%, they also (quite significantly) increase variability of the
results (jitter in the tps over time). It's particularly visible on the
pgbench results. I'm not sure that's a good tradeoff.

After fixing the kernel page cache thresholds (by setting
background_bytes to 256MB to perform smooth write-out), the effect
differs depending on the workload:

(a) dilip
http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync

- eliminates any impact of all the patches

(b) pgbench (-N)
http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip

- By far the most severe regression observed during the testing.
With 36 clients the throughput drops by ~40%, which I think is
pretty bad. Also the results are much more variable with the
patches (compared to master).

scale 3000 (45GB), shared_buffers=64GB
---------------------------------------

I've also done some tests with increased shared buffers, so that even
the large data set fits into them. Again, the results slightly depend on
the workload:

(a) dilip

* http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-64
* http://tvondra.bitbucket.org/#dilip-3000-unlogged-async-64

Pretty much no impact on throughput or variability. Unlike on the
small data set, it the patches don't even eliminate the performance
drop above 72 clients - the performance closely matches master.

(b) pgbench

* http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-64
* http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-noskip-64

There's a small benefit (~20% on the same client count), and the
performance drop only happens after 72 clients. The patches also
significantly increase variability of the results, particularly for
large client counts.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-20 15:45:06
Message-ID: CA+TgmoYq5vrfd3fJORXzLXUx6y2TYFEyPPfWEQ7vHp9DRMrcsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 20, 2016 at 3:36 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I agree with these conclusions. I had a chance to talk with Andres
>> this morning at Postgres Vision and based on that conversation I'd
>> like to suggest a couple of additional tests:
>>
>> 1. Repeat this test on x86. In particular, I think you should test on
>> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>
> I have done my test on cthulhu, basic difference is that In POWER we
> saw ClogControlLock on top at 96 and more client with 300 scale
> factor. But, on cthulhu at 300 scale factor transactionid lock is
> always on top. So I repeated my test with 1000 scale factor as well on
> cthulhu.

So the upshot appears to be that this problem is a lot worse on power2
than cthulhu, which suggests that this is architecture-dependent. I
guess it could also be kernel-dependent, but it doesn't seem likely,
because:

power2: Red Hat Enterprise Linux Server release 7.1 (Maipo),
3.10.0-229.14.1.ael7b.ppc64le
cthulhu: CentOS Linux release 7.2.1511 (Core), 3.10.0-229.7.2.el7.x86_64

So here's my theory. The whole reason why Tomas is having difficulty
seeing any big effect from these patches is because he's testing on
x86. When Dilip tests on x86, he doesn't see a big effect either,
regardless of workload. But when Dilip tests on POWER, which I think
is where he's mostly been testing, he sees a huge effect, because for
some reason POWER has major problems with this lock that don't exist
on x86.

If that's so, then we ought to be able to reproduce the big gains on
hydra, a community POWER server. In fact, I think I'll go run a quick
test over there right now...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-20 17:59:13
Message-ID: CA+TgmobJBv0qYEMazPEqsit4zkk_ECvafYdu8X=jAnVei0yaYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 20, 2016 at 11:45 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Oct 20, 2016 at 3:36 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> I agree with these conclusions. I had a chance to talk with Andres
>>> this morning at Postgres Vision and based on that conversation I'd
>>> like to suggest a couple of additional tests:
>>>
>>> 1. Repeat this test on x86. In particular, I think you should test on
>>> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>>
>> I have done my test on cthulhu, basic difference is that In POWER we
>> saw ClogControlLock on top at 96 and more client with 300 scale
>> factor. But, on cthulhu at 300 scale factor transactionid lock is
>> always on top. So I repeated my test with 1000 scale factor as well on
>> cthulhu.
>
> So the upshot appears to be that this problem is a lot worse on power2
> than cthulhu, which suggests that this is architecture-dependent. I
> guess it could also be kernel-dependent, but it doesn't seem likely,
> because:
>
> power2: Red Hat Enterprise Linux Server release 7.1 (Maipo),
> 3.10.0-229.14.1.ael7b.ppc64le
> cthulhu: CentOS Linux release 7.2.1511 (Core), 3.10.0-229.7.2.el7.x86_64
>
> So here's my theory. The whole reason why Tomas is having difficulty
> seeing any big effect from these patches is because he's testing on
> x86. When Dilip tests on x86, he doesn't see a big effect either,
> regardless of workload. But when Dilip tests on POWER, which I think
> is where he's mostly been testing, he sees a huge effect, because for
> some reason POWER has major problems with this lock that don't exist
> on x86.
>
> If that's so, then we ought to be able to reproduce the big gains on
> hydra, a community POWER server. In fact, I think I'll go run a quick
> test over there right now...

And ... nope. I ran a 30-minute pgbench test on unpatched master
using unlogged tables at scale factor 300 with 64 clients and got
these results:

14 LWLockTranche | wal_insert
36 LWLockTranche | lock_manager
45 LWLockTranche | buffer_content
223 Lock | tuple
527 LWLockNamed | CLogControlLock
921 Lock | extend
1195 LWLockNamed | XidGenLock
1248 LWLockNamed | ProcArrayLock
3349 Lock | transactionid
85957 Client | ClientRead
135935 |

I then started a run at 96 clients which I accidentally killed shortly
before it was scheduled to finish, but the results are not much
different; there is no hint of the runaway CLogControlLock contention
that Dilip sees on power2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-20 20:04:58
Message-ID: ecb99330-cdcc-dd53-983f-03f787c01fa4@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/20/2016 07:59 PM, Robert Haas wrote:
> On Thu, Oct 20, 2016 at 11:45 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Oct 20, 2016 at 3:36 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>>> On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> ...
>>
>> So here's my theory. The whole reason why Tomas is having difficulty
>> seeing any big effect from these patches is because he's testing on
>> x86. When Dilip tests on x86, he doesn't see a big effect either,
>> regardless of workload. But when Dilip tests on POWER, which I think
>> is where he's mostly been testing, he sees a huge effect, because for
>> some reason POWER has major problems with this lock that don't exist
>> on x86.
>>
>> If that's so, then we ought to be able to reproduce the big gains on
>> hydra, a community POWER server. In fact, I think I'll go run a quick
>> test over there right now...
>
> And ... nope. I ran a 30-minute pgbench test on unpatched master
> using unlogged tables at scale factor 300 with 64 clients and got
> these results:
>
> 14 LWLockTranche | wal_insert
> 36 LWLockTranche | lock_manager
> 45 LWLockTranche | buffer_content
> 223 Lock | tuple
> 527 LWLockNamed | CLogControlLock
> 921 Lock | extend
> 1195 LWLockNamed | XidGenLock
> 1248 LWLockNamed | ProcArrayLock
> 3349 Lock | transactionid
> 85957 Client | ClientRead
> 135935 |
>
> I then started a run at 96 clients which I accidentally killed shortly
> before it was scheduled to finish, but the results are not much
> different; there is no hint of the runaway CLogControlLock contention
> that Dilip sees on power2.
>

What shared_buffer size were you using? I assume the data set fit into
shared buffers, right?

FWIW as I explained in the lengthy post earlier today, I can actually
reproduce the significant CLogControlLock contention (and the patches do
reduce it), even on x86_64.

For example consider these two tests:

* http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
* http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip

However, it seems I can also reproduce fairly bad regressions, like for
example this case with data set exceeding shared_buffers:

* http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-21 01:01:02
Message-ID: CA+TgmoYyjH+pn6upFGSdCZR-z59Oo0Hjd_b2hLjGLe2nG+_WJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> I then started a run at 96 clients which I accidentally killed shortly
>> before it was scheduled to finish, but the results are not much
>> different; there is no hint of the runaway CLogControlLock contention
>> that Dilip sees on power2.
>>
> What shared_buffer size were you using? I assume the data set fit into
> shared buffers, right?

8GB.

> FWIW as I explained in the lengthy post earlier today, I can actually
> reproduce the significant CLogControlLock contention (and the patches do
> reduce it), even on x86_64.

/me goes back, rereads post. Sorry, I didn't look at this carefully
the first time.

> For example consider these two tests:
>
> * http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
> * http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
>
> However, it seems I can also reproduce fairly bad regressions, like for
> example this case with data set exceeding shared_buffers:
>
> * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip

I'm not sure how seriously we should take the regressions. I mean,
what I see there is that CLogControlLock contention goes down by about
50% -- which is the point of the patch -- and WALWriteLock contention
goes up dramatically -- which sucks, but can't really be blamed on the
patch except in the indirect sense that a backend can't spend much
time waiting for A if it's already spending all of its time waiting
for B. It would be nice to know why it happened, but we shouldn't
allow CLogControlLock to act as an admission control facility for
WALWriteLock (I think).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-21 02:27:41
Message-ID: CAFiTN-tCqDucbikxue0x0dwNOAa6yzjMxJNgXGQm90jVVRSj0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> In the results you've posted on 10/12, you've mentioned a regression with 32
> clients, where you got 52k tps on master but only 48k tps with the patch (so
> ~10% difference). I have no idea what scale was used for those tests,

That test was with scale factor 300 on POWER 4 socket machine. I think
I need to repeat this test with multiple reading to confirm it was
regression or run to run variation. I will do that soon and post the
results.

> and I
> see no such regression in the current results (but you only report results
> for some of the client counts).

This test is on X86 8 socket machine, At 1000 scale factor I have
given reading with all client counts (32,64,96,192), but at 300 scale
factor I posted only with 192 because on this machine (X86 8 socket
machine) I did not see much load on ClogControlLock at 300 scale
factor.
>
> Also, which of the proposed patches have you been testing?
I tested with GroupLock patch.

> Can you collect and share a more complete set of data, perhaps based on the
> scripts I use to do tests on the large machine with 36/72 cores, available
> at https://bitbucket.org/tvondra/hp05-results ?

I think from my last run I did not share data for -> X86 8 socket
machine, 300 scale factor, 32,64,96 client. I already have those data
so I ma sharing it. (Please let me know if you want to see at some
other client count, for that I need to run another test.)

Head:
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 77233356
latency average: 0.746 ms
tps = 42907.363243 (including connections establishing)
tps = 42907.546190 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 300_32_ul.txt
111757 |
3666
1289 LWLockNamed | ProcArrayLock
1142 Lock | transactionid
318 LWLockNamed | CLogControlLock
299 Lock | extend
109 LWLockNamed | XidGenLock
70 LWLockTranche | buffer_content
35 Lock | tuple
29 LWLockTranche | lock_manager
14 LWLockTranche | wal_insert
1 Tuples only is on.
1 LWLockNamed | CheckpointerCommLock

Group Lock Patch:

scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 77544028
latency average: 0.743 ms
tps = 43079.783906 (including connections establishing)
tps = 43079.960331 (excluding connections establishing
112209 |
3718
1402 LWLockNamed | ProcArrayLock
1070 Lock | transactionid
245 LWLockNamed | CLogControlLock
188 Lock | extend
80 LWLockNamed | XidGenLock
76 LWLockTranche | buffer_content
39 LWLockTranche | lock_manager
31 Lock | tuple
7 LWLockTranche | wal_insert
1 Tuples only is on.
1 LWLockTranche | buffer_mapping

Head:
number of clients: 64
number of threads: 64
duration: 1800 s
number of transactions actually processed: 76211698
latency average: 1.512 ms
tps = 42339.731054 (including connections establishing)
tps = 42339.930464 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 300_64_ul.txt
215734 |
5106 Lock | transactionid
3754 LWLockNamed | ProcArrayLock
3669
3267 LWLockNamed | CLogControlLock
661 Lock | extend
339 LWLockNamed | XidGenLock
310 Lock | tuple
289 LWLockTranche | buffer_content
205 LWLockTranche | lock_manager
50 LWLockTranche | wal_insert
2 LWLockTranche | buffer_mapping
1 Tuples only is on.
1 LWLockTranche | proc

GroupLock patch:
scaling factor: 300
query mode: prepared
number of clients: 64
number of threads: 64
duration: 1800 s
number of transactions actually processed: 76629309
latency average: 1.503 ms
tps = 42571.704635 (including connections establishing)
tps = 42571.905157 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 300_64_ul.txt
217840 |
5197 Lock | transactionid
3744 LWLockNamed | ProcArrayLock
3663
966 Lock | extend
849 LWLockNamed | CLogControlLock
372 Lock | tuple
305 LWLockNamed | XidGenLock
199 LWLockTranche | buffer_content
184 LWLockTranche | lock_manager
35 LWLockTranche | wal_insert
1 Tuples only is on.
1 LWLockTranche | proc
1 LWLockTranche | buffer_mapping

Head:
scaling factor: 300
query mode: prepared
number of clients: 96
number of threads: 96
duration: 1800 s
number of transactions actually processed: 77663593
latency average: 2.225 ms
tps = 43145.624864 (including connections establishing)
tps = 43145.838167 (excluding connections establishing)

302317 |
18836 Lock | transactionid
12912 LWLockNamed | CLogControlLock
4120 LWLockNamed | ProcArrayLock
3662
1700 Lock | tuple
1305 Lock | extend
1030 LWLockTranche | buffer_content
828 LWLockTranche | lock_manager
730 LWLockNamed | XidGenLock
107 LWLockTranche | wal_insert
4 LWLockTranche | buffer_mapping
1 Tuples only is on.
1 LWLockTranche | proc
1 BufferPin | BufferPin

Group Lock Patch:
scaling factor: 300
query mode: prepared
number of clients: 96
number of threads: 96
duration: 1800 s
number of transactions actually processed: 61608756
latency average: 2.805 ms
tps = 44385.885080 (including connections establishing)
tps = 44386.297364 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 300_96_ul.txt
237842 |
14379 Lock | transactionid
3335 LWLockNamed | ProcArrayLock
2850
1374 LWLockNamed | CLogControlLock
1200 Lock | tuple
992 Lock | extend
717 LWLockNamed | XidGenLock
625 LWLockTranche | lock_manager
259 LWLockTranche | buffer_content
105 LWLockTranche | wal_insert
4 LWLockTranche | buffer_mapping
2 LWLockTranche | proc

Head:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 65930726
latency average: 5.242 ms
tps = 36621.827041 (including connections establishing)
tps = 36622.064081 (excluding connections establishing)
[dilip(dot)kumar(at)cthulhu bin]$ cat 300_192_ul.txt
437848 |
118966 Lock | transactionid
88869 LWLockNamed | CLogControlLock
18558 Lock | tuple
6183 LWLockTranche | buffer_content
5664 LWLockTranche | lock_manager
3995 LWLockNamed | ProcArrayLock
3646
1748 Lock | extend
1635 LWLockNamed | XidGenLock
401 LWLockTranche | wal_insert
33 BufferPin | BufferPin
5 LWLockTranche | proc
3 LWLockTranche | buffer_mapping

GroupLock Patch:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 82616270
latency average: 4.183 ms
tps = 45894.737813 (including connections establishing)
tps = 45894.995634 (excluding connections establishing)
120372 Lock | transactionid
16346 Lock | tuple
7489 LWLockTranche | lock_manager
4514 LWLockNamed | ProcArrayLock
3632
3310 LWLockNamed | CLogControlLock
2287 LWLockNamed | XidGenLock
2271 Lock | extend
709 LWLockTranche | buffer_content
490 LWLockTranche | wal_insert
30 BufferPin | BufferPin
10 LWLockTranche | proc
6 LWLockTranche | buffer_mapping

Summary: On (X86 8 Socket machine, 300 S.F), I did not observe
significant wait on ClogControlLock upto 96 clients. However at 192 we
can see significant wait on ClogControlLock, but still not as bad as
we see on POWER.

>
> I've taken some time to build a simple web-based reports from the results
> collected so far (also included in the git repository), and pushed them
> here:
>
> http://tvondra.bitbucket.org
>
> For each of the completed runs, there's a report comparing tps for different
> client counts with master and the three patches (average tps, median and
> stddev), and it's possible to download a more thorough text report with wait
> event stats, comparison of individual runs etc.

I saw your report, I think presenting it this way can give very clear idea.
>
> If you want to cooperate on this, I'm available - i.e. I can help you get
> the tooling running, customize it etc.

That will be really helpful, then next time I can also present my
reports in same format.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-21 02:44:05
Message-ID: CAFiTN-tvJTGCvR=5gRUorX66j5ALjnZwfu1vmAR9g7i1_Vf3aQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 20, 2016 at 9:15 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> So here's my theory. The whole reason why Tomas is having difficulty
> seeing any big effect from these patches is because he's testing on
> x86. When Dilip tests on x86, he doesn't see a big effect either,
> regardless of workload. But when Dilip tests on POWER, which I think
> is where he's mostly been testing, he sees a huge effect, because for
> some reason POWER has major problems with this lock that don't exist
> on x86.

Right, because on POWER we can see big contention on ClogControlLock
with 300 scale factor, even at 96 client count, but on X86 with 300
scan factor there is almost no contention on ClogControlLock.

However at 1000 scale factor we can see significant contention on
ClogControlLock on X86 machine.

I want to test on POWER with 1000 scale factor to see whether
contention on ClogControlLock become much worse ?

I will run this test and post the results.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-21 06:13:14
Message-ID: CAA4eK1JBbYKUWXzzrrcRnPoChB_Tu2-fYt4aW41ADDfETwTVhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 21, 2016 at 6:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> I then started a run at 96 clients which I accidentally killed shortly
>>> before it was scheduled to finish, but the results are not much
>>> different; there is no hint of the runaway CLogControlLock contention
>>> that Dilip sees on power2.
>>>
>> What shared_buffer size were you using? I assume the data set fit into
>> shared buffers, right?
>
> 8GB.
>
>> FWIW as I explained in the lengthy post earlier today, I can actually
>> reproduce the significant CLogControlLock contention (and the patches do
>> reduce it), even on x86_64.
>
> /me goes back, rereads post. Sorry, I didn't look at this carefully
> the first time.
>
>> For example consider these two tests:
>>
>> * http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
>> * http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
>>
>> However, it seems I can also reproduce fairly bad regressions, like for
>> example this case with data set exceeding shared_buffers:
>>
>> * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
>
> I'm not sure how seriously we should take the regressions. I mean,
> what I see there is that CLogControlLock contention goes down by about
> 50% -- which is the point of the patch -- and WALWriteLock contention
> goes up dramatically -- which sucks, but can't really be blamed on the
> patch except in the indirect sense that a backend can't spend much
> time waiting for A if it's already spending all of its time waiting
> for B.
>

Right, I think not only WALWriteLock, but contention on other locks
also goes up as you can see in below table. I think there is nothing
much we can do for that with this patch. One thing which is unclear
is why on unlogged tests it is showing WALWriteLock?

test | clients |
wait_event_type | wait_event | master | granular_locking |
no_content_lock | group_update
--------------------------------------------------+---------+-----------------+----------------------+---------+------------------+-----------------+--------------

pgbench-3000-unlogged-sync-skip | 72 |
LWLockNamed | CLogControlLock | 217012 | 37326 |
32288 | 12040
pgbench-3000-unlogged-sync-skip | 72 |
LWLockNamed | WALWriteLock | 13188 | 104183 |
123359 | 103267
pgbench-3000-unlogged-sync-skip | 72 |
LWLockTranche | buffer_content | 10532 | 65880 |
57007 | 86176
pgbench-3000-unlogged-sync-skip | 72 |
LWLockTranche | wal_insert | 9280 | 85917 |
109472 | 99609
pgbench-3000-unlogged-sync-skip | 72 |
LWLockTranche | clog | 4623 | 25692 |
10422 | 11755

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-21 07:37:59
Message-ID: 730786b4-0eaf-4c33-b7da-e018cf35c208@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/21/2016 08:13 AM, Amit Kapila wrote:
> On Fri, Oct 21, 2016 at 6:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>> I then started a run at 96 clients which I accidentally killed shortly
>>>> before it was scheduled to finish, but the results are not much
>>>> different; there is no hint of the runaway CLogControlLock contention
>>>> that Dilip sees on power2.
>>>>
>>> What shared_buffer size were you using? I assume the data set fit into
>>> shared buffers, right?
>>
>> 8GB.
>>
>>> FWIW as I explained in the lengthy post earlier today, I can actually
>>> reproduce the significant CLogControlLock contention (and the patches do
>>> reduce it), even on x86_64.
>>
>> /me goes back, rereads post. Sorry, I didn't look at this carefully
>> the first time.
>>
>>> For example consider these two tests:
>>>
>>> * http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
>>> * http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
>>>
>>> However, it seems I can also reproduce fairly bad regressions, like for
>>> example this case with data set exceeding shared_buffers:
>>>
>>> * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
>>
>> I'm not sure how seriously we should take the regressions. I mean,
>> what I see there is that CLogControlLock contention goes down by about
>> 50% -- which is the point of the patch -- and WALWriteLock contention
>> goes up dramatically -- which sucks, but can't really be blamed on the
>> patch except in the indirect sense that a backend can't spend much
>> time waiting for A if it's already spending all of its time waiting
>> for B.
>>
>
> Right, I think not only WALWriteLock, but contention on other locks
> also goes up as you can see in below table. I think there is nothing
> much we can do for that with this patch. One thing which is unclear
> is why on unlogged tests it is showing WALWriteLock?
>

Well, although we don't write the table data to the WAL, we still need
to write commits and other stuff, right? And on scale 3000 (which
exceeds the 16GB shared buffers in this case), there's a continuous
stream of dirty pages (not to WAL, but evicted from shared buffers), so
iostat looks like this:

time tps wr_sec/s avgrq-sz avgqu-sz await %util
08:48:21 81654 1367483 16.75 127264.60 1294.80 97.41
08:48:31 41514 697516 16.80 103271.11 3015.01 97.64
08:48:41 78892 1359779 17.24 97308.42 928.36 96.76
08:48:51 58735 978475 16.66 92303.00 1472.82 95.92
08:49:01 62441 1068605 17.11 78482.71 1615.56 95.57
08:49:11 55571 945365 17.01 113672.62 1923.37 98.07
08:49:21 69016 1161586 16.83 87055.66 1363.05 95.53
08:49:31 54552 913461 16.74 98695.87 1761.30 97.84

That's ~500-600 MB/s of continuous writes. I'm sure the storage could
handle more than this (will do some testing after the tests complete),
but surely the WAL has to compete for bandwidth (it's on the same volume
/ devices). Another thing is that we only have 8 WAL insert locks, and
maybe that leads to contention with such high client counts.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-21 11:29:15
Message-ID: CAA4eK1L3iq8CQztz9SfG-5iJo2PLxHOV0jnWCspA7cFvoqJ6gQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 21, 2016 at 1:07 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 10/21/2016 08:13 AM, Amit Kapila wrote:
>>
>> On Fri, Oct 21, 2016 at 6:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
>> wrote:
>>>
>>> On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra
>>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>>
>>>>> I then started a run at 96 clients which I accidentally killed shortly
>>>>> before it was scheduled to finish, but the results are not much
>>>>> different; there is no hint of the runaway CLogControlLock contention
>>>>> that Dilip sees on power2.
>>>>>
>>>> What shared_buffer size were you using? I assume the data set fit into
>>>> shared buffers, right?
>>>
>>>
>>> 8GB.
>>>
>>>> FWIW as I explained in the lengthy post earlier today, I can actually
>>>> reproduce the significant CLogControlLock contention (and the patches do
>>>> reduce it), even on x86_64.
>>>
>>>
>>> /me goes back, rereads post. Sorry, I didn't look at this carefully
>>> the first time.
>>>
>>>> For example consider these two tests:
>>>>
>>>> * http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
>>>> * http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
>>>>
>>>> However, it seems I can also reproduce fairly bad regressions, like for
>>>> example this case with data set exceeding shared_buffers:
>>>>
>>>> * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
>>>
>>>
>>> I'm not sure how seriously we should take the regressions. I mean,
>>> what I see there is that CLogControlLock contention goes down by about
>>> 50% -- which is the point of the patch -- and WALWriteLock contention
>>> goes up dramatically -- which sucks, but can't really be blamed on the
>>> patch except in the indirect sense that a backend can't spend much
>>> time waiting for A if it's already spending all of its time waiting
>>> for B.
>>>
>>
>> Right, I think not only WALWriteLock, but contention on other locks
>> also goes up as you can see in below table. I think there is nothing
>> much we can do for that with this patch. One thing which is unclear
>> is why on unlogged tests it is showing WALWriteLock?
>>
>
> Well, although we don't write the table data to the WAL, we still need to
> write commits and other stuff, right?
>

We do need to write commit, but do we need to flush it immediately to
WAL for unlogged tables? It seems we allow WALWriter to do that,
refer logic in RecordTransactionCommit.

And on scale 3000 (which exceeds the
> 16GB shared buffers in this case), there's a continuous stream of dirty
> pages (not to WAL, but evicted from shared buffers), so iostat looks like
> this:
>
> time tps wr_sec/s avgrq-sz avgqu-sz await %util
> 08:48:21 81654 1367483 16.75 127264.60 1294.80 97.41
> 08:48:31 41514 697516 16.80 103271.11 3015.01 97.64
> 08:48:41 78892 1359779 17.24 97308.42 928.36 96.76
> 08:48:51 58735 978475 16.66 92303.00 1472.82 95.92
> 08:49:01 62441 1068605 17.11 78482.71 1615.56 95.57
> 08:49:11 55571 945365 17.01 113672.62 1923.37 98.07
> 08:49:21 69016 1161586 16.83 87055.66 1363.05 95.53
> 08:49:31 54552 913461 16.74 98695.87 1761.30 97.84
>
> That's ~500-600 MB/s of continuous writes. I'm sure the storage could handle
> more than this (will do some testing after the tests complete), but surely
> the WAL has to compete for bandwidth (it's on the same volume / devices).
> Another thing is that we only have 8 WAL insert locks, and maybe that leads
> to contention with such high client counts.
>

Yeah, quite possible, but I don't think increasing that would benefit
in general, because while writing WAL we need to take all the
wal_insert locks. In any case, I think that is a separate problem to
study.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-24 09:18:29
Message-ID: CAFiTN-t15PjFTFQH0fBfM5jVSv6rfm5J_8vj=RGuKeFpdLgSoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>> In the results you've posted on 10/12, you've mentioned a regression with 32
>> clients, where you got 52k tps on master but only 48k tps with the patch (so
>> ~10% difference). I have no idea what scale was used for those tests,
>
> That test was with scale factor 300 on POWER 4 socket machine. I think
> I need to repeat this test with multiple reading to confirm it was
> regression or run to run variation. I will do that soon and post the
> results.

As promised, I have rerun my test (3 times), and I did not see any regression.
Median of 3 run on both head and with group lock patch are same.
However I am posting results of all three runs.

I think in my earlier reading, we saw TPS ~48K with the patch, but I
think over multiple run we get this reading with both head as well as
with patch.

Head:
--------
run1:

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 87784836
latency average = 0.656 ms
tps = 48769.327513 (including connections establishing)
tps = 48769.543276 (excluding connections establishing)

run2:
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 91240374
latency average = 0.631 ms
tps = 50689.069717 (including connections establishing)
tps = 50689.263505 (excluding connections establishing)

run3:
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 90966003
latency average = 0.633 ms
tps = 50536.639303 (including connections establishing)
tps = 50536.836924 (excluding connections establishing)

With group lock patch:
------------------------------
run1:
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 87316264
latency average = 0.660 ms
tps = 48509.008040 (including connections establishing)
tps = 48509.194978 (excluding connections establishing)

run2:
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 91950412
latency average = 0.626 ms
tps = 51083.507790 (including connections establishing)
tps = 51083.704489 (excluding connections establishing)

run3:
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 90378462
latency average = 0.637 ms
tps = 50210.225983 (including connections establishing)
tps = 50210.405401 (excluding connections establishing)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-25 04:10:09
Message-ID: CAA4eK1LyR2A+m=RBSZ6rcPEwJ=rVi1ADPSndXHZdjn56yqO6Vg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 24, 2016 at 2:48 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>> In the results you've posted on 10/12, you've mentioned a regression with 32
>>> clients, where you got 52k tps on master but only 48k tps with the patch (so
>>> ~10% difference). I have no idea what scale was used for those tests,
>>
>> That test was with scale factor 300 on POWER 4 socket machine. I think
>> I need to repeat this test with multiple reading to confirm it was
>> regression or run to run variation. I will do that soon and post the
>> results.
>
> As promised, I have rerun my test (3 times), and I did not see any regression.
>

Thanks Tomas and Dilip for doing detailed performance tests for this
patch. I would like to summarise the performance testing results.

1. With update intensive workload, we are seeing gains from 23%~192%
at client count >=64 with group_update patch [1].
2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing
gains from 12% to ~70% at client count >=64 [2]. Tests are done on
8-socket intel m/c.
3. With pgbench workload (both simple-update and tpc-b at 300 scale
factor), we are seeing gain 10% to > 50% at client count >=64 [3].
Tests are done on 8-socket intel m/c.
4. To see why the patch only helps at higher client count, we have
done wait event testing for various workloads [4], [5] and the results
indicate that at lower clients, the waits are mostly due to
transactionid or clientread. At client-counts where contention due to
CLOGControlLock is significant, this patch helps a lot to reduce that
contention. These tests are done on on 8-socket intel m/c and
4-socket power m/c
5. With pgbench workload (unlogged tables), we are seeing gains from
15% to > 300% at client count >=72 [6].

There are many more tests done for the proposed patches where gains
are either or similar lines as above or are neutral. We do see
regression in some cases.

1. When data doesn't fit in shared buffers, there is regression at
some client counts [7], but on analysis it has been found that it is
mainly due to the shift in contention from CLOGControlLock to
WALWriteLock and or other locks.
2. We do see in some cases that granular_locking and no_content_lock
patches has shown significant increase in contention on
CLOGControlLock. I have already shared my analysis for same upthread
[8].

Attached is the latest group update clog patch.

In last commit fest, the patch was returned with feedback to evaluate
the cases where it can show win and I think above results indicates
that the patch has significant benefit on various workloads. What I
think is pending at this stage is the either one of the committer or
the reviewers of this patch needs to provide feedback on my analysis
[8] for the cases where patches are not showing win.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAFiTN-tr_%3D25EQUFezKNRk%3D4N-V%2BD6WMxo7HWs9BMaNx7S3y6w%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAFiTN-v5hm1EO4cLXYmpppYdNQk%2Bn4N-O1m%2B%2B3U9f0Ga1gBzRQ%40mail.gmail.com
[4] - https://www.postgresql.org/message-id/CAFiTN-taV4iVkPHrxg%3DYCicKjBS6%3DQZm_cM4hbS_2q2ryLhUUw%40mail.gmail.com
[5] - https://www.postgresql.org/message-id/CAFiTN-uQ%2BJbd31cXvRbj48Ba6TqDUDpLKSPnsUCCYRju0Y0U8Q%40mail.gmail.com
[6] - http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
[7] - http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
[8] - https://www.postgresql.org/message-id/CAA4eK1J9VxJUnpOiQDf0O%3DZ87QUMbw%3DuGcQr4EaGbHSCibx9yA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v9.patch application/octet-stream 15.6 KB

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-26 22:45:11
Message-ID: 4a52a34f-57fa-7bcf-d34c-c15db40f0361@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/25/2016 06:10 AM, Amit Kapila wrote:
> On Mon, Oct 24, 2016 at 2:48 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>>> On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
>>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>
>>>> In the results you've posted on 10/12, you've mentioned a regression with 32
>>>> clients, where you got 52k tps on master but only 48k tps with the patch (so
>>>> ~10% difference). I have no idea what scale was used for those tests,
>>>
>>> That test was with scale factor 300 on POWER 4 socket machine. I think
>>> I need to repeat this test with multiple reading to confirm it was
>>> regression or run to run variation. I will do that soon and post the
>>> results.
>>
>> As promised, I have rerun my test (3 times), and I did not see any regression.
>>
>
> Thanks Tomas and Dilip for doing detailed performance tests for this
> patch. I would like to summarise the performance testing results.
>
> 1. With update intensive workload, we are seeing gains from 23%~192%
> at client count >=64 with group_update patch [1].
> 2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing
> gains from 12% to ~70% at client count >=64 [2]. Tests are done on
> 8-socket intel m/c.
> 3. With pgbench workload (both simple-update and tpc-b at 300 scale
> factor), we are seeing gain 10% to > 50% at client count >=64 [3].
> Tests are done on 8-socket intel m/c.
> 4. To see why the patch only helps at higher client count, we have
> done wait event testing for various workloads [4], [5] and the results
> indicate that at lower clients, the waits are mostly due to
> transactionid or clientread. At client-counts where contention due to
> CLOGControlLock is significant, this patch helps a lot to reduce that
> contention. These tests are done on on 8-socket intel m/c and
> 4-socket power m/c
> 5. With pgbench workload (unlogged tables), we are seeing gains from
> 15% to > 300% at client count >=72 [6].
>

It's not entirely clear which of the above tests were done on unlogged
tables, and I don't see that in the referenced e-mails. That would be an
interesting thing to mention in the summary, I think.

> There are many more tests done for the proposed patches where gains
> are either or similar lines as above or are neutral. We do see
> regression in some cases.
>
> 1. When data doesn't fit in shared buffers, there is regression at
> some client counts [7], but on analysis it has been found that it is
> mainly due to the shift in contention from CLOGControlLock to
> WALWriteLock and or other locks.

The questions is why shifting the lock contention to WALWriteLock should
cause such significant performance drop, particularly when the test was
done on unlogged tables. Or, if that's the case, how it makes the
performance drop less problematic / acceptable.

FWIW I plan to run the same test with logged tables - if it shows
similar regression, I'll be much more worried, because that's a fairly
typical scenario (logged tables, data set > shared buffers), and we
surely can't just go and break that.

> 2. We do see in some cases that granular_locking and no_content_lock
> patches has shown significant increase in contention on
> CLOGControlLock. I have already shared my analysis for same upthread
> [8].

I do agree that some cases this significantly reduces contention on the
CLogControlLock. I do however think that currently the performance gains
are limited almost exclusively to cases on unlogged tables, and some
logged+async cases.

On logged tables it usually looks like this (i.e. modest increase for
high client counts at the expense of significantly higher variability):

http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64

or like this (i.e. only partial recovery for the drop above 36 clients):

http://tvondra.bitbucket.org/#pgbench-3000-logged-async-skip-64

And of course, there are cases like this:

http://tvondra.bitbucket.org/#dilip-300-logged-async

I'd really like to understand why the patched results behave that
differently depending on client count.

>
> Attached is the latest group update clog patch.
>

How is that different from the previous versions?

>
> In last commit fest, the patch was returned with feedback to evaluate
> the cases where it can show win and I think above results indicates
> that the patch has significant benefit on various workloads. What I
> think is pending at this stage is the either one of the committer or
> the reviewers of this patch needs to provide feedback on my analysis
> [8] for the cases where patches are not showing win.
>
> Thoughts?
>

I do agree the patch(es) significantly reduce CLogControlLock, although
with WAL logging enabled (which is what matters for most production
deployments) it pretty much only shifts the contention to a different
lock (so the immediate performance benefit is 0).

Which raises the question why to commit this patch now, before we have a
patch addressing the WAL locks. I realize this is a chicken-egg problem,
but my worry is that the increased WALWriteLock contention will cause
regressions in current workloads.

BTW I've ran some tests with the number of clog buffers increases to
512, and it seems like a fairly positive. Compare for example these two
results:

http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip-clog-512

The first one is with the default 128 buffers, the other one is with 512
buffers. The impact on master is pretty obvious - for 72 clients the tps
jumps from 160k to 197k, and for higher client counts it gives us about
+50k tps (typically increase from ~80k to ~130k tps). And the tps
variability is significantly reduced.

For the other workload, the results are less convincing though:

http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
http://tvondra.bitbucket.org/#dilip-300-unlogged-sync-clog-512

Interesting that the master adopts the zig-zag patter, but shifted.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-27 11:44:56
Message-ID: CAA4eK1KTbNbZSDo=6k4YgaJh_FM20zJCKu2Yt0bxaFMv9QcSXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 10/25/2016 06:10 AM, Amit Kapila wrote:
>>
>> On Mon, Oct 24, 2016 at 2:48 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
>> wrote:
>>>
>>> On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
>>> wrote:
>>>>
>>>> On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
>>>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>
>>>>> In the results you've posted on 10/12, you've mentioned a regression
>>>>> with 32
>>>>> clients, where you got 52k tps on master but only 48k tps with the
>>>>> patch (so
>>>>> ~10% difference). I have no idea what scale was used for those tests,
>>>>
>>>>
>>>> That test was with scale factor 300 on POWER 4 socket machine. I think
>>>> I need to repeat this test with multiple reading to confirm it was
>>>> regression or run to run variation. I will do that soon and post the
>>>> results.
>>>
>>>
>>> As promised, I have rerun my test (3 times), and I did not see any
>>> regression.
>>>
>>
>> Thanks Tomas and Dilip for doing detailed performance tests for this
>> patch. I would like to summarise the performance testing results.
>>
>> 1. With update intensive workload, we are seeing gains from 23%~192%
>> at client count >=64 with group_update patch [1].
>> 2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing
>> gains from 12% to ~70% at client count >=64 [2]. Tests are done on
>> 8-socket intel m/c.
>> 3. With pgbench workload (both simple-update and tpc-b at 300 scale
>> factor), we are seeing gain 10% to > 50% at client count >=64 [3].
>> Tests are done on 8-socket intel m/c.
>> 4. To see why the patch only helps at higher client count, we have
>> done wait event testing for various workloads [4], [5] and the results
>> indicate that at lower clients, the waits are mostly due to
>> transactionid or clientread. At client-counts where contention due to
>> CLOGControlLock is significant, this patch helps a lot to reduce that
>> contention. These tests are done on on 8-socket intel m/c and
>> 4-socket power m/c
>> 5. With pgbench workload (unlogged tables), we are seeing gains from
>> 15% to > 300% at client count >=72 [6].
>>
>
> It's not entirely clear which of the above tests were done on unlogged
> tables, and I don't see that in the referenced e-mails. That would be an
> interesting thing to mention in the summary, I think.
>

One thing is clear that all results are on either
synchronous_commit=off or on unlogged tables. I think Dilip can
answer better which of those are on unlogged and which on
synchronous_commit=off.

>> There are many more tests done for the proposed patches where gains
>> are either or similar lines as above or are neutral. We do see
>> regression in some cases.
>>
>> 1. When data doesn't fit in shared buffers, there is regression at
>> some client counts [7], but on analysis it has been found that it is
>> mainly due to the shift in contention from CLOGControlLock to
>> WALWriteLock and or other locks.
>
>
> The questions is why shifting the lock contention to WALWriteLock should
> cause such significant performance drop, particularly when the test was done
> on unlogged tables. Or, if that's the case, how it makes the performance
> drop less problematic / acceptable.
>

Whenever the contention shifts to other lock, there is a chance that
it can show performance dip in some cases and I have seen that
previously as well. The theory behind that could be like this, say you
have two locks L1 and L2, and there are 100 processes that are
contending on L1 and 50 on L2. Now say, you reduce contention on L1
such that it leads to 120 processes contending on L2, so increased
contention on L2 can slowdown the overall throughput of all processes.

> FWIW I plan to run the same test with logged tables - if it shows similar
> regression, I'll be much more worried, because that's a fairly typical
> scenario (logged tables, data set > shared buffers), and we surely can't
> just go and break that.
>

Sure, please do those tests.

>> 2. We do see in some cases that granular_locking and no_content_lock
>> patches has shown significant increase in contention on
>> CLOGControlLock. I have already shared my analysis for same upthread
>> [8].
>
>
> I do agree that some cases this significantly reduces contention on the
> CLogControlLock. I do however think that currently the performance gains are
> limited almost exclusively to cases on unlogged tables, and some
> logged+async cases.
>

Right, because the contention is mainly visible for those workloads.

> On logged tables it usually looks like this (i.e. modest increase for high
> client counts at the expense of significantly higher variability):
>
> http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64
>

What variability are you referring to in those results?

> or like this (i.e. only partial recovery for the drop above 36 clients):
>
> http://tvondra.bitbucket.org/#pgbench-3000-logged-async-skip-64
>
> And of course, there are cases like this:
>
> http://tvondra.bitbucket.org/#dilip-300-logged-async
>
> I'd really like to understand why the patched results behave that
> differently depending on client count.
>

I have already explained this upthread [1]. Refer text after line "I
have checked the wait event results where there is more fluctuation:"

>>
>> Attached is the latest group update clog patch.
>>
>
> How is that different from the previous versions?
>

Previous patch was showing some hunks when you try to apply. I
thought it might be better to rebase so that it can be applied
cleanly, otherwise there is no change in code.

>>
>>
>> In last commit fest, the patch was returned with feedback to evaluate
>> the cases where it can show win and I think above results indicates
>> that the patch has significant benefit on various workloads. What I
>> think is pending at this stage is the either one of the committer or
>> the reviewers of this patch needs to provide feedback on my analysis
>> [8] for the cases where patches are not showing win.
>>
>> Thoughts?
>>
>
> I do agree the patch(es) significantly reduce CLogControlLock, although with
> WAL logging enabled (which is what matters for most production deployments)
> it pretty much only shifts the contention to a different lock (so the
> immediate performance benefit is 0).
>

Yeah, but I think there are use cases where users can use
synchronous_commit=off.

> Which raises the question why to commit this patch now, before we have a
> patch addressing the WAL locks. I realize this is a chicken-egg problem, but
> my worry is that the increased WALWriteLock contention will cause
> regressions in current workloads.
>

I think if we use that theory, we won't be able to make progress in
terms of reducing lock contention. I think we have previously
committed the code in such situations. For example while reducing
contention in buffer management area
(d72731a70450b5e7084991b9caa15cb58a2820df), I have noticed such a
behaviour and reported my analysis [2] as well (In the mail [2], you
can see there is performance improvement at 1000 scale factor and dip
at 5000 scale factor). Later on, when the contention on dynahash
spinlocks got alleviated (44ca4022f3f9297bab5cbffdd97973dbba1879ed),
the results were much better. If we would not have reduced the
contention in buffer management, then the benefits with dynahash
improvements wouldn't have been much in those workloads (if you want,
I can find out and share the results of dynhash improvements).

> BTW I've ran some tests with the number of clog buffers increases to 512,
> and it seems like a fairly positive. Compare for example these two results:
>
> http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
> http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip-clog-512
>
> The first one is with the default 128 buffers, the other one is with 512
> buffers. The impact on master is pretty obvious - for 72 clients the tps
> jumps from 160k to 197k, and for higher client counts it gives us about +50k
> tps (typically increase from ~80k to ~130k tps). And the tps variability is
> significantly reduced.
>

Interesting, because last time I have done such testing by increasing
clog buffers, it didn't show any improvement, rather If I remember
correctly it was showing some regression. I am not sure what is best
way to handle this, may be we can make clogbuffers as guc variable.

[1] - https://www.postgresql.org/message-id/CAA4eK1J9VxJUnpOiQDf0O%3DZ87QUMbw%3DuGcQr4EaGbHSCibx9yA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1JUPn1rV0ep5DR74skcv%2BRRK7i2inM1X01ajG%2BgCX-hMw%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-27 12:23:39
Message-ID: CAFiTN-uFQJDNATkMt7=bJUSOPD+t7sGvTkYjX_3CChMiE0224g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 27, 2016 at 5:14 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>> Thanks Tomas and Dilip for doing detailed performance tests for this
>>> patch. I would like to summarise the performance testing results.
>>>
>>> 1. With update intensive workload, we are seeing gains from 23%~192%
>>> at client count >=64 with group_update patch [1].

this is with unlogged table

>>> 2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing
>>> gains from 12% to ~70% at client count >=64 [2]. Tests are done on
>>> 8-socket intel m/c.

this is with synchronous_commit=off

>>> 3. With pgbench workload (both simple-update and tpc-b at 300 scale
>>> factor), we are seeing gain 10% to > 50% at client count >=64 [3].
>>> Tests are done on 8-socket intel m/c.

this is with synchronous_commit=off

>>> 4. To see why the patch only helps at higher client count, we have
>>> done wait event testing for various workloads [4], [5] and the results
>>> indicate that at lower clients, the waits are mostly due to
>>> transactionid or clientread. At client-counts where contention due to
>>> CLOGControlLock is significant, this patch helps a lot to reduce that
>>> contention. These tests are done on on 8-socket intel m/c and
>>> 4-socket power m/c

these both are with synchronous_commit=off + unlogged table

>>> 5. With pgbench workload (unlogged tables), we are seeing gains from
>>> 15% to > 300% at client count >=72 [6].
>>>
>>
>> It's not entirely clear which of the above tests were done on unlogged
>> tables, and I don't see that in the referenced e-mails. That would be an
>> interesting thing to mention in the summary, I think.
>>
>
> One thing is clear that all results are on either
> synchronous_commit=off or on unlogged tables. I think Dilip can
> answer better which of those are on unlogged and which on
> synchronous_commit=off.

I have mentioned this above under each of your test point..

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-30 18:32:48
Message-ID: b3586234-6c80-5b64-1261-871e0e852bbb@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 10/27/2016 01:44 PM, Amit Kapila wrote:
> On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> FWIW I plan to run the same test with logged tables - if it shows similar
>> regression, I'll be much more worried, because that's a fairly typical
>> scenario (logged tables, data set > shared buffers), and we surely can't
>> just go and break that.
>>
>
> Sure, please do those tests.
>

OK, so I do have results for those tests - that is, scale 3000 with
shared_buffers=16GB (so continuously writing out dirty buffers). The
following reports show the results slightly differently - all three "tps
charts" next to each other, then the speedup charts and tables.

Overall, the results are surprisingly positive - look at these results
(all ending with "-retest"):

[1] http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest

[2]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-noskip-retest

[3]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest

All three show significant improvement, even with fairly low client
counts. For example with 72 clients, the tps improves 20%, without
significantly affecting variability variability of the results( measured
as stdddev, more on this later).

It's however interesting that "no_content_lock" is almost exactly the
same as master, while the other two cases improve significantly.

The other interesting thing is that "pgbench -N" [3] shows no such
improvement, unlike regular pgbench and Dilip's workload. Not sure why,
though - I'd expect to see significant improvement in this case.

I have also repeated those tests with clog buffers increased to 512 (so
4x the current maximum of 128). I only have results for Dilip's workload
and "pgbench -N":

[4]
http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest-512

[5]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest-512

The results are somewhat surprising, I guess, because the effect is
wildly different for each workload.

For Dilip's workload increasing clog buffers to 512 pretty much
eliminates all benefits of the patches. For example with 288 client, the
group_update patch gives ~60k tps on 128 buffers [1] but only 42k tps on
512 buffers [4].

With "pgbench -N", the effect is exactly the opposite - while with 128
buffers there was pretty much no benefit from any of the patches [3],
with 512 buffers we suddenly get almost 2x the throughput, but only for
group_update and master (while the other two patches show no improvement
at all).

I don't have results for the regular pgbench ("noskip") with 512 buffers
yet, but I'm curious what that will show.

In general I however think that the patches don't show any regression in
any of those workloads (at least not with 128 buffers). Based solely on
the results, I like the group_update more, because it performs as good
as master or significantly better.

>>> 2. We do see in some cases that granular_locking and
>>> no_content_lock patches has shown significant increase in
>>> contention on CLOGControlLock. I have already shared my analysis
>>> for same upthread [8].
>>

I've read that analysis, but I'm not sure I see how it explains the "zig
zag" behavior. I do understand that shifting the contention to some
other (already busy) lock may negatively impact throughput, or that the
group_update may result in updating multiple clog pages, but I don't
understand two things:

(1) Why this should result in the fluctuations we observe in some of the
cases. For example, why should we see 150k tps on, 72 clients, then drop
to 92k with 108 clients, then back to 130k on 144 clients, then 84k on
180 clients etc. That seems fairly strange.

(2) Why this should affect all three patches, when only group_update has
to modify multiple clog pages.

For example consider this:

http://tvondra.bitbucket.org/index2.html#dilip-300-logged-async

For example looking at % of time spent on different locks with the
group_update patch, I see this (ignoring locks with ~1%):

event_type wait_event 36 72 108 144 180 216 252 288
---------------------------------------------------------------------
- - 60 63 45 53 38 50 33 48
Client ClientRead 33 23 9 14 6 10 4 8
LWLockNamed CLogControlLock 2 7 33 14 34 14 33 14
LWLockTranche buffer_content 0 2 9 13 19 18 26 22

I don't see any sign of contention shifting to other locks, just
CLogControlLock fluctuating between 14% and 33% for some reason.

Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's
some sort of CPU / OS scheduling artifact. For example, the system has
36 physical cores, 72 virtual ones (thanks to HT). I find it strange
that the "good" client counts are always multiples of 72, while the
"bad" ones fall in between.

72 = 72 * 1 (good)
108 = 72 * 1.5 (bad)
144 = 72 * 2 (good)
180 = 72 * 2.5 (bad)
216 = 72 * 3 (good)
252 = 72 * 3.5 (bad)
288 = 72 * 4 (good)

So maybe this has something to do with how OS schedules the tasks, or
maybe some internal heuristics in the CPU, or something like that.

>> On logged tables it usually looks like this (i.e. modest increase for high
>> client counts at the expense of significantly higher variability):
>>
>> http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64
>>
>
> What variability are you referring to in those results?
>

Good question. What I mean by "variability" is how stable the tps is
during the benchmark (when measured on per-second granularity). For
example, let's run a 10-second benchmark, measuring number of
transactions committed each second.

Then all those runs do 1000 tps on average:

run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500
run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000

I guess we agree those runs behave very differently, despite having the
same throughput. So this is what STDDEV(tps) measures, i.e. the third
chart on the reports, shows.

So for example this [6] shows that the patches give us higher throughput
with >= 180 clients, but we also pay for that with increased variability
of the results (i.e. the tps chart will have jitter):

[6]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-64

Of course, exchanging throughput, latency and variability is one of the
crucial trade-offs in transactions systems - at some point the resources
get saturated and higher throughput can only be achieved in exchange for
latency (e.g. by grouping requests). But still, we'd like to get stable
tps from the system, not something that gives us 2000 tps one second and
0 tps the next one.

Of course, this is not perfect - it does not show whether there are
transactions with significantly higher latency, and so on. It'd be good
to also measure latency, but I haven't collected that info during the
runs so far.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-31 04:01:52
Message-ID: 956b695d-f6c5-d652-bf0c-cea95981547c@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/30/16 1:32 PM, Tomas Vondra wrote:
>
> Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's
> some sort of CPU / OS scheduling artifact. For example, the system has
> 36 physical cores, 72 virtual ones (thanks to HT). I find it strange
> that the "good" client counts are always multiples of 72, while the
> "bad" ones fall in between.
>
> 72 = 72 * 1 (good)
> 108 = 72 * 1.5 (bad)
> 144 = 72 * 2 (good)
> 180 = 72 * 2.5 (bad)
> 216 = 72 * 3 (good)
> 252 = 72 * 3.5 (bad)
> 288 = 72 * 4 (good)
>
> So maybe this has something to do with how OS schedules the tasks, or
> maybe some internal heuristics in the CPU, or something like that.

It might be enlightening to run a series of tests that are 72*.1 or *.2
apart (say, 72, 79, 86, ..., 137, 144).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-31 13:24:58
Message-ID: f769dd83-6288-2c37-4958-b7ddad0bc974@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/31/2016 05:01 AM, Jim Nasby wrote:
> On 10/30/16 1:32 PM, Tomas Vondra wrote:
>>
>> Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's
>> some sort of CPU / OS scheduling artifact. For example, the system has
>> 36 physical cores, 72 virtual ones (thanks to HT). I find it strange
>> that the "good" client counts are always multiples of 72, while the
>> "bad" ones fall in between.
>>
>> 72 = 72 * 1 (good)
>> 108 = 72 * 1.5 (bad)
>> 144 = 72 * 2 (good)
>> 180 = 72 * 2.5 (bad)
>> 216 = 72 * 3 (good)
>> 252 = 72 * 3.5 (bad)
>> 288 = 72 * 4 (good)
>>
>> So maybe this has something to do with how OS schedules the tasks, or
>> maybe some internal heuristics in the CPU, or something like that.
>
> It might be enlightening to run a series of tests that are 72*.1 or *.2
> apart (say, 72, 79, 86, ..., 137, 144).

Yeah, I've started a benchmark with client a step of 6 clients

36 42 48 54 60 66 72 78 ... 252 258 264 270 276 282 288

instead of just

36 72 108 144 180 216 252 288

which did a test every 36 clients. To compensate for the 6x longer runs,
I'm only running tests for "group-update" and "master", so I should have
the results in ~36h.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-31 13:32:19
Message-ID: 5960ada5-98f5-dacf-903f-6e153aed76ce@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/30/2016 07:32 PM, Tomas Vondra wrote:
> Hi,
>
> On 10/27/2016 01:44 PM, Amit Kapila wrote:
>> On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>
>>> FWIW I plan to run the same test with logged tables - if it shows
>>> similar
>>> regression, I'll be much more worried, because that's a fairly typical
>>> scenario (logged tables, data set > shared buffers), and we surely can't
>>> just go and break that.
>>>
>>
>> Sure, please do those tests.
>>
>
> OK, so I do have results for those tests - that is, scale 3000 with
> shared_buffers=16GB (so continuously writing out dirty buffers). The
> following reports show the results slightly differently - all three "tps
> charts" next to each other, then the speedup charts and tables.
>
> Overall, the results are surprisingly positive - look at these results
> (all ending with "-retest"):
>
> [1] http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest
>
> [2]
> http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-noskip-retest
>
>
> [3]
> http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest
>
>
> All three show significant improvement, even with fairly low client
> counts. For example with 72 clients, the tps improves 20%, without
> significantly affecting variability variability of the results( measured
> as stdddev, more on this later).
>
> It's however interesting that "no_content_lock" is almost exactly the
> same as master, while the other two cases improve significantly.
>
> The other interesting thing is that "pgbench -N" [3] shows no such
> improvement, unlike regular pgbench and Dilip's workload. Not sure why,
> though - I'd expect to see significant improvement in this case.
>
> I have also repeated those tests with clog buffers increased to 512 (so
> 4x the current maximum of 128). I only have results for Dilip's workload
> and "pgbench -N":
>
> [4]
> http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest-512
>
> [5]
> http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest-512
>
>
> The results are somewhat surprising, I guess, because the effect is
> wildly different for each workload.
>
> For Dilip's workload increasing clog buffers to 512 pretty much
> eliminates all benefits of the patches. For example with 288 client,
> the group_update patch gives ~60k tps on 128 buffers [1] but only 42k
> tps on 512 buffers [4].
>
> With "pgbench -N", the effect is exactly the opposite - while with
> 128 buffers there was pretty much no benefit from any of the patches
> [3], with 512 buffers we suddenly get almost 2x the throughput, but
> only for group_update and master (while the other two patches show no
> improvement at all).
>

The remaining benchmark with 512 clog buffers completed, and the impact
roughly matches Dilip's benchmark - that is, increasing the number of
clog buffers eliminates all positive impact of the patches observed on
128 buffers. Compare these two reports:

[a] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest

[b] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest-512

With 128 buffers the group_update and granular_locking patches achieve
up to 50k tps, while master and no_content_lock do ~30k tps. After
increasing number of clog buffers, we get only ~30k in all cases.

I'm not sure what's causing this, whether we're hitting limits of the
simple LRU cache used for clog buffers, or something else. But maybe
there's something in the design of clog buffers that make them work less
efficiently with more clog buffers? I'm not sure whether that's
something we need to fix before eventually committing any of them.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-31 13:51:52
Message-ID: CAA4eK1Ksd6D0H9HPmMS3S7UpL2G8JMJ0kvRCDz=4=AqFn790sg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 31, 2016 at 12:02 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> On 10/27/2016 01:44 PM, Amit Kapila wrote:
>
> I've read that analysis, but I'm not sure I see how it explains the "zig
> zag" behavior. I do understand that shifting the contention to some other
> (already busy) lock may negatively impact throughput, or that the
> group_update may result in updating multiple clog pages, but I don't
> understand two things:
>
> (1) Why this should result in the fluctuations we observe in some of the
> cases. For example, why should we see 150k tps on, 72 clients, then drop to
> 92k with 108 clients, then back to 130k on 144 clients, then 84k on 180
> clients etc. That seems fairly strange.
>

I don't think hitting multiple clog pages has much to do with
client-count. However, we can wait to see your further detailed test
report.

> (2) Why this should affect all three patches, when only group_update has to
> modify multiple clog pages.
>

No, all three patches can be affected due to multiple clog pages.
Read second paragraph ("I think one of the probable reasons that could
happen for both the approaches") in same e-mail [1]. It is basically
due to frequent release-and-reacquire of locks.

>
>
>>> On logged tables it usually looks like this (i.e. modest increase for
>>> high
>>> client counts at the expense of significantly higher variability):
>>>
>>> http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64
>>>
>>
>> What variability are you referring to in those results?
>
>>
>
> Good question. What I mean by "variability" is how stable the tps is during
> the benchmark (when measured on per-second granularity). For example, let's
> run a 10-second benchmark, measuring number of transactions committed each
> second.
>
> Then all those runs do 1000 tps on average:
>
> run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
> run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500
> run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000
>

Generally, such behaviours are seen due to writes. Are WAL and DATA
on same disk in your tests?

[1] - https://www.postgresql.org/message-id/CAA4eK1J9VxJUnpOiQDf0O%3DZ87QUMbw%3DuGcQr4EaGbHSCibx9yA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-31 13:55:59
Message-ID: CAA4eK1JBoStBJyb0gH=n6NszYNfezKm+Fo+uwphgY-0mtThxiw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 31, 2016 at 7:02 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> The remaining benchmark with 512 clog buffers completed, and the impact
> roughly matches Dilip's benchmark - that is, increasing the number of clog
> buffers eliminates all positive impact of the patches observed on 128
> buffers. Compare these two reports:
>
> [a] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest
>
> [b] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest-512
>
> With 128 buffers the group_update and granular_locking patches achieve up to
> 50k tps, while master and no_content_lock do ~30k tps. After increasing
> number of clog buffers, we get only ~30k in all cases.
>
> I'm not sure what's causing this, whether we're hitting limits of the simple
> LRU cache used for clog buffers, or something else.
>

I have also seen previously that increasing clog buffers to 256 can
impact performance negatively. So, probably here the gains due to
group_update patch is negated due to the impact of increasing clog
buffers. I am not sure if it is good idea to see the impact of
increasing clog buffers along with this patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-31 14:28:54
Message-ID: 8efd9956-059a-78f3-32ff-f1e1a4dd09c8@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/31/2016 02:51 PM, Amit Kapila wrote:
> On Mon, Oct 31, 2016 at 12:02 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Hi,
>>
>> On 10/27/2016 01:44 PM, Amit Kapila wrote:
>>
>> I've read that analysis, but I'm not sure I see how it explains the "zig
>> zag" behavior. I do understand that shifting the contention to some other
>> (already busy) lock may negatively impact throughput, or that the
>> group_update may result in updating multiple clog pages, but I don't
>> understand two things:
>>
>> (1) Why this should result in the fluctuations we observe in some of the
>> cases. For example, why should we see 150k tps on, 72 clients, then drop to
>> 92k with 108 clients, then back to 130k on 144 clients, then 84k on 180
>> clients etc. That seems fairly strange.
>>
>
> I don't think hitting multiple clog pages has much to do with
> client-count. However, we can wait to see your further detailed test
> report.
>
>> (2) Why this should affect all three patches, when only group_update has to
>> modify multiple clog pages.
>>
>
> No, all three patches can be affected due to multiple clog pages.
> Read second paragraph ("I think one of the probable reasons that could
> happen for both the approaches") in same e-mail [1]. It is basically
> due to frequent release-and-reacquire of locks.
>
>>
>>
>>>> On logged tables it usually looks like this (i.e. modest increase for
>>>> high
>>>> client counts at the expense of significantly higher variability):
>>>>
>>>> http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64
>>>>
>>>
>>> What variability are you referring to in those results?
>>
>>>
>>
>> Good question. What I mean by "variability" is how stable the tps is during
>> the benchmark (when measured on per-second granularity). For example, let's
>> run a 10-second benchmark, measuring number of transactions committed each
>> second.
>>
>> Then all those runs do 1000 tps on average:
>>
>> run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
>> run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500
>> run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000
>>
>
> Generally, such behaviours are seen due to writes. Are WAL and DATA
> on same disk in your tests?
>

Yes, there's one RAID device on 10 SSDs, with 4GB of the controller.
I've done some tests and it easily handles > 1.5GB/s in sequential
writes, and >500MB/s in sustained random writes.

Also, let me point out that most of the tests were done so that the
whole data set fits into shared_buffers, and with no checkpoints during
the runs (so no writes to data files should really happen).

For example these tests were done on scale 3000 (45GB data set) with
64GB shared buffers:

[a]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-unlogged-sync-noskip-64

[b]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-async-noskip-64

and I could show similar cases with scale 300 on 16GB shared buffers.

In those cases, there's very little contention between WAL and the rest
of the data base (in terms of I/O).

And moreover, this setup (single device for the whole cluster) is very
common, we can't just neglect it.

But my main point here really is that the trade-off in those cases may
not be really all that great, because you get the best performance at
36/72 clients, and then the tps drops and variability increases. At
least not right now, before tackling contention on the WAL lock (or
whatever lock becomes the bottleneck).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-31 19:43:56
Message-ID: CAA4eK1KC6uQHWhOMmkoACx1OJeKjcGxMU42WapjFVvN6FFuxJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 31, 2016 at 7:58 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 10/31/2016 02:51 PM, Amit Kapila wrote:
> And moreover, this setup (single device for the whole cluster) is very
> common, we can't just neglect it.
>
> But my main point here really is that the trade-off in those cases may not
> be really all that great, because you get the best performance at 36/72
> clients, and then the tps drops and variability increases. At least not
> right now, before tackling contention on the WAL lock (or whatever lock
> becomes the bottleneck).
>

Okay, but does wait event results show increase in contention on some
other locks for pgbench-3000-logged-sync-skip-64? Can you share wait
events for the runs where there is a fluctuation?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-31 21:36:28
Message-ID: ca9836a6-3820-ce99-a2c8-853c91e4f896@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/31/2016 08:43 PM, Amit Kapila wrote:
> On Mon, Oct 31, 2016 at 7:58 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 10/31/2016 02:51 PM, Amit Kapila wrote:
>> And moreover, this setup (single device for the whole cluster) is very
>> common, we can't just neglect it.
>>
>> But my main point here really is that the trade-off in those cases may not
>> be really all that great, because you get the best performance at 36/72
>> clients, and then the tps drops and variability increases. At least not
>> right now, before tackling contention on the WAL lock (or whatever lock
>> becomes the bottleneck).
>>
>
> Okay, but does wait event results show increase in contention on some
> other locks for pgbench-3000-logged-sync-skip-64? Can you share wait
> events for the runs where there is a fluctuation?
>

Sure, I do have wait event stats, including a summary for different
client counts - see this:

http://tvondra.bitbucket.org/by-test/pgbench-3000-logged-sync-skip-64.txt

Looking only at group_update patch for three interesting client counts,
it looks like this:

wait_event_type | wait_event | 108 144 180
-----------------+-------------------+-------------------------
LWLockNamed | WALWriteLock | 661284 847057 1006061
| | 126654 191506 265386
Client | ClientRead | 37273 52791 64799
LWLockTranche | wal_insert | 28394 51893 79932
LWLockNamed | CLogControlLock | 7766 14913 23138
LWLockNamed | WALBufMappingLock | 3615 3739 3803
LWLockNamed | ProcArrayLock | 913 1776 2685
Lock | extend | 909 2082 2228
LWLockNamed | XidGenLock | 301 349 675
LWLockTranche | clog | 173 331 607
LWLockTranche | buffer_content | 163 468 737
LWLockTranche | lock_manager | 88 140 145

Compared to master, this shows significant reduction of contention on
CLogControlLock (which on master has 20k, 83k and 200k samples), and
moving the contention to WALWriteLock.

But perhaps you're asking about variability during the benchmark? I
suppose that could be extracted from the collected data, but I haven't
done that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-31 21:48:01
Message-ID: 5275bf49-545e-e189-48c5-17b5defc45a2@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/31/2016 02:24 PM, Tomas Vondra wrote:
> On 10/31/2016 05:01 AM, Jim Nasby wrote:
>> On 10/30/16 1:32 PM, Tomas Vondra wrote:
>>>
>>> Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's
>>> some sort of CPU / OS scheduling artifact. For example, the system has
>>> 36 physical cores, 72 virtual ones (thanks to HT). I find it strange
>>> that the "good" client counts are always multiples of 72, while the
>>> "bad" ones fall in between.
>>>
>>> 72 = 72 * 1 (good)
>>> 108 = 72 * 1.5 (bad)
>>> 144 = 72 * 2 (good)
>>> 180 = 72 * 2.5 (bad)
>>> 216 = 72 * 3 (good)
>>> 252 = 72 * 3.5 (bad)
>>> 288 = 72 * 4 (good)
>>>
>>> So maybe this has something to do with how OS schedules the tasks, or
>>> maybe some internal heuristics in the CPU, or something like that.
>>
>> It might be enlightening to run a series of tests that are 72*.1 or *.2
>> apart (say, 72, 79, 86, ..., 137, 144).
>
> Yeah, I've started a benchmark with client a step of 6 clients
>
> 36 42 48 54 60 66 72 78 ... 252 258 264 270 276 282 288
>
> instead of just
>
> 36 72 108 144 180 216 252 288
>
> which did a test every 36 clients. To compensate for the 6x longer runs,
> I'm only running tests for "group-update" and "master", so I should have
> the results in ~36h.
>

So I've been curious and looked at results of the runs executed so far,
and for the group_update patch it looks like this:

clients tps
-----------------
36 117663
42 139791
48 129331
54 144970
60 124174
66 137227
72 146064
78 100267
84 141538
90 96607
96 139290
102 93976
108 136421
114 91848
120 133563
126 89801
132 132607
138 87912
144 129688
150 87221
156 129608
162 85403
168 130193
174 83863
180 129337
186 81968
192 128571
198 82053
204 128020
210 80768
216 124153
222 80493
228 125503
234 78950
240 125670
246 78418
252 123532
258 77623
264 124366
270 76726
276 119054
282 76960
288 121819

So, similar saw-like behavior, perfectly periodic. But the really
strange thing is the peaks/valleys don't match those observed before!

That is, during the previous runs, 72, 144, 216 and 288 were "good"
while 108, 180 and 252 were "bad". But in those runs, all those client
counts are "good" ...

Honestly, I have no idea what to think about this ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-11-01 19:13:16
Message-ID: CA+TgmobmDLM1dzFvN_tJoBunarMpz+cehx1cVfv-QMKk4rZLPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 31, 2016 at 5:48 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Honestly, I have no idea what to think about this ...

I think a lot of the details here depend on OS scheduler behavior.
For example, here's one of the first scalability graphs I ever did:

http://rhaas.blogspot.com/2011/09/scalability-in-graphical-form-analyzed.html

It's a nice advertisement for fast-path locking, but look at the funny
shape of the red and green lines between 1 and 32 cores. The curve is
oddly bowl-shaped. As the post discusses, we actually dip WAY under
linear scalability in the 8-20 core range and then shoot up like a
rocket afterwards so that at 32 cores we actually achieve super-linear
scalability. You can't blame this on anything except Linux. Someone
shared BSD graphs (I forget which flavor) with me privately and they
don't exhibit this poor behavior. (They had different poor behaviors
instead - performance collapsed at high client counts. That was a
long time ago so it's probably fixed now.)

This is why I think it's fundamentally wrong to look at this patch and
say "well, contention goes down, and in some cases that makes
performance go up, but because in other cases it decreases performance
or increases variability we shouldn't commit it". If we took that
approach, we wouldn't have fast-path locking today, because the early
versions of fast-path locking could exhibit *major* regressions
precisely because of contention shifting to other locks, specifically
SInvalReadLock and msgNumLock. (cf. commit
b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4). If we say that because the
contention on those other locks can get worse as a result of
contention on this lock being reduced, or even worse, if we try to
take responsibility for what effect reducing lock contention might
have on the operating system scheduler discipline (which will
certainly differ from system to system and version to version), we're
never going to get anywhere, because there's almost always going to be
some way that reducing contention in one place can bite you someplace
else.

I also believe it's pretty normal for patches that remove lock
contention to increase variability. If you run an auto race where
every car has a speed governor installed that limits it to 80 kph,
there will be much less variability in the finish times than if you
remove the governor, but that's a stupid way to run a race. You won't
get much innovation around increasing the top speed of the cars under
those circumstances, either. Nobody ever bothered optimizing the
contention around msgNumLock before fast-path locking happened,
because the heavyweight lock manager burdened the system so heavily
that you couldn't generate enough contention on it to matter.
Similarly, we're not going to get much traction around optimizing the
other locks to which contention would shift if we applied this patch
unless we apply it. This is not theoretical: EnterpriseDB staff have
already done work on trying to optimize WALWriteLock, but it's hard to
get a benefit. The more contention other contention we eliminate, the
easier it will be to see whether a proposed change to WALWriteLock
helps. Of course, we'll also be more at the mercy of operating system
scheduler discipline, but that's not all a bad thing either. The
Linux kernel guys have been known to run PostgreSQL to see whether
proposed changes help or hurt, but they're not going to try those
tests after applying patches that we rejected because they expose us
to existing Linux shortcomings.

I don't want to be perceived as advocating too forcefully for a patch
that was, after all, written by a colleague. However, I sincerely
believe it's a mistake to say that a patch which reduces lock
contention must show a tangible win or at least no loss on every piece
of hardware, on every kernel, at every client count with no increase
in variability in any configuration. Very few (if any) patches are
going to be able to meet that bar, and if we make that the bar, people
aren't going to write patches to reduce lock contention in PostgreSQL.
For that to be worth doing, you have to be able to get the patch
committed in finite time. We've spent an entire release cycle
dithering over this patch. Several alternative patches have been
written that are not any better (and the people who wrote those
patches don't seem especially interested in doing further work on them
anyway). There is increasing evidence that the patch is effective at
solving the problem it claims to solve, and that any downsides are
just the result of poor lock-scaling behavior elsewhere which we could
be working on fixing if we weren't still spending time on this. Is
that really not good enough?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-11-02 03:31:16
Message-ID: ca714e28-81a2-440c-bd92-79fe1fa0fc15@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/01/2016 08:13 PM, Robert Haas wrote:
> On Mon, Oct 31, 2016 at 5:48 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Honestly, I have no idea what to think about this ...
>
> I think a lot of the details here depend on OS scheduler behavior.
> For example, here's one of the first scalability graphs I ever did:
>
> http://rhaas.blogspot.com/2011/09/scalability-in-graphical-form-analyzed.html
>
> It's a nice advertisement for fast-path locking, but look at the funny
> shape of the red and green lines between 1 and 32 cores. The curve is
> oddly bowl-shaped. As the post discusses, we actually dip WAY under
> linear scalability in the 8-20 core range and then shoot up like a
> rocket afterwards so that at 32 cores we actually achieve super-linear
> scalability. You can't blame this on anything except Linux. Someone
> shared BSD graphs (I forget which flavor) with me privately and they
> don't exhibit this poor behavior. (They had different poor behaviors
> instead - performance collapsed at high client counts. That was a
> long time ago so it's probably fixed now.)
>
> This is why I think it's fundamentally wrong to look at this patch and
> say "well, contention goes down, and in some cases that makes
> performance go up, but because in other cases it decreases performance
> or increases variability we shouldn't commit it". If we took that
> approach, we wouldn't have fast-path locking today, because the early
> versions of fast-path locking could exhibit *major* regressions
> precisely because of contention shifting to other locks, specifically
> SInvalReadLock and msgNumLock. (cf. commit
> b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4). If we say that because the
> contention on those other locks can get worse as a result of
> contention on this lock being reduced, or even worse, if we try to
> take responsibility for what effect reducing lock contention might
> have on the operating system scheduler discipline (which will
> certainly differ from system to system and version to version), we're
> never going to get anywhere, because there's almost always going to be
> some way that reducing contention in one place can bite you someplace
> else.
>

I don't think I've suggested not committing any of the clog patches (or
other patches in general) because shifting the contention somewhere else
might cause regressions. At the end of the last CF I've however stated
that we need to better understand the impact on various wokloads, and I
think Amit agreed with that conclusion.

We have that understanding now, I believe - also thanks to your idea of
sampling wait events data.

You're right we can't fix all the contention points in one patch, and
that shifting the contention may cause regressions. But we should at
least understand what workloads might be impacted, how serious the
regressions may get etc. Which is why all the testing was done.

> I also believe it's pretty normal for patches that remove lock
> contention to increase variability. If you run an auto race where
> every car has a speed governor installed that limits it to 80 kph,
> there will be much less variability in the finish times than if you
> remove the governor, but that's a stupid way to run a race. You won't
> get much innovation around increasing the top speed of the cars under
> those circumstances, either. Nobody ever bothered optimizing the
> contention around msgNumLock before fast-path locking happened,
> because the heavyweight lock manager burdened the system so heavily
> that you couldn't generate enough contention on it to matter.
> Similarly, we're not going to get much traction around optimizing the
> other locks to which contention would shift if we applied this patch
> unless we apply it. This is not theoretical: EnterpriseDB staff have
> already done work on trying to optimize WALWriteLock, but it's hard to
> get a benefit. The more contention other contention we eliminate, the
> easier it will be to see whether a proposed change to WALWriteLock
> helps.

Sure, I understand that. My main worry was that people will get worse
performance with the next major version that what they get now (assuming
we don't manage to address the other contention points). Which is
difficult to explain to users & customers, no matter how reasonable it
seems to us.

The difference is that both the fast-path locks and msgNumLock went into
9.2, so that end users probably never saw that regression. But we don't
know if that happens for clog and WAL.

Perhaps you have a working patch addressing the WAL contention, so that
we could see how that changes the results?

> Of course, we'll also be more at the mercy of operating system
> scheduler discipline, but that's not all a bad thing either. The
> Linux kernel guys have been known to run PostgreSQL to see whether
> proposed changes help or hurt, but they're not going to try those
> tests after applying patches that we rejected because they expose us
> to existing Linux shortcomings.
>

I might be wrong, but I doubt the kernel guys are running particularly
wide set of tests, so how likely is it they will notice issues with
specific workloads? Wouldn't it be great if we could tell them there's a
bug and provide a workload that reproduces it?

I don't see how "it's a Linux issue" makes it someone else's problem.
The kernel guys can't really test everything (and are not obliged to).
It's up to us to do more testing in this area, and report issues to the
kernel guys (which is not happening as much as it should).

>
> I don't want to be perceived as advocating too forcefully for a
> patch that was, after all, written by a colleague. However, I
> sincerely believe it's a mistake to say that a patch which reduces
> lock contention must show a tangible win or at least no loss on every
> piece of hardware, on every kernel, at every client count with no
> increase in variability in any configuration.
>

I don't think anyone suggested that.

>
> Very few (if any) patches are going to be able to meet that bar, and
> if we make that the bar, people aren't going to write patches to
> reduce lock contention in PostgreSQL. For that to be worth doing, you
> have to be able to get the patch committed in finite time. We've
> spent an entire release cycle dithering over this patch. Several
> alternative patches have been written that are not any better (and
> the people who wrote those patches don't seem especially interested
> in doing further work on them anyway). There is increasing evidence
> that the patch is effective at solving the problem it claims to
> solve, and that any downsides are just the result of poor
> lock-scaling behavior elsewhere which we could be working on fixing
> if we weren't still spending time on this. Is that really not good
> enough?
>

Except that a few days ago, after getting results from the last round of
tests, I've stated that we haven't really found any regressions that
would matter, and that group_update seems to be performing the best (and
actually significantly improves results for some of the tests). I
haven't done any code review, though.

The one remaining thing is the strange zig-zag behavior, but that might
easily be a due to scheduling in kernel, or something else. I don't
consider it a blocker for any of the patches, though.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-11-02 16:52:45
Message-ID: CAA4eK1JjatUZu0+HCi=5VM1q-hFgN_OhegPAwEUJqxf-7pESbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 2, 2016 at 9:01 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 11/01/2016 08:13 PM, Robert Haas wrote:
>>
>> On Mon, Oct 31, 2016 at 5:48 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>
>
> The one remaining thing is the strange zig-zag behavior, but that might
> easily be a due to scheduling in kernel, or something else. I don't consider
> it a blocker for any of the patches, though.
>

The only reason I could think of for that zig-zag behaviour is
frequent multiple clog page accesses and it could be due to below
reasons:

a. transaction and its subtransactions (IIRC, Dilip's case has one
main transaction and two subtransactions) can't fit into same page, in
which case the group_update optimization won't apply and I don't think
we can do anything for it.
b. In the same group, multiple clog pages are being accessed. It is
not a likely scenario, but it can happen and we might be able to
improve a bit if that is happening.
c. The transactions at same time tries to update different clog page.
I think as mentioned upthread we can handle it by using slots an
allowing multiple groups to work together instead of a single group.

To check if there is any impact due to (a) or (b), I have added few
logs in code (patch - group_update_clog_v9_log). The log message
could be "all xacts are not on same page" or "Group contains
different pages".

Patch group_update_clog_v9_slots tries to address (c). So if there is
any problem due to (c), this patch should improve the situation.

Can you please try to run the test where you saw zig-zag behaviour
with both the patches separately? I think if there is anything due to
postgres, then you can see either one of the new log message or
performance will be improved, OTOH if we see same behaviour, then I
think we can probably assume it due to scheduler activity and move on.
Also one point to note here is that even when the performance is down
in that curve, it is equal to or better than HEAD.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v9_log.patch application/octet-stream 15.9 KB
group_update_clog_v9_slots.patch application/octet-stream 16.8 KB

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-11-02 17:18:30
Message-ID: 71db49d4-2d90-bcb9-5778-ab5ce35d8c12@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/02/2016 05:52 PM, Amit Kapila wrote:
> On Wed, Nov 2, 2016 at 9:01 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 11/01/2016 08:13 PM, Robert Haas wrote:
>>>
>>> On Mon, Oct 31, 2016 at 5:48 PM, Tomas Vondra
>>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>
>>
>> The one remaining thing is the strange zig-zag behavior, but that might
>> easily be a due to scheduling in kernel, or something else. I don't consider
>> it a blocker for any of the patches, though.
>>
>
> The only reason I could think of for that zig-zag behaviour is
> frequent multiple clog page accesses and it could be due to below
> reasons:
>
> a. transaction and its subtransactions (IIRC, Dilip's case has one
> main transaction and two subtransactions) can't fit into same page, in
> which case the group_update optimization won't apply and I don't think
> we can do anything for it.
> b. In the same group, multiple clog pages are being accessed. It is
> not a likely scenario, but it can happen and we might be able to
> improve a bit if that is happening.
> c. The transactions at same time tries to update different clog page.
> I think as mentioned upthread we can handle it by using slots an
> allowing multiple groups to work together instead of a single group.
>
> To check if there is any impact due to (a) or (b), I have added few
> logs in code (patch - group_update_clog_v9_log). The log message
> could be "all xacts are not on same page" or "Group contains
> different pages".
>
> Patch group_update_clog_v9_slots tries to address (c). So if there
> is any problem due to (c), this patch should improve the situation.
>
> Can you please try to run the test where you saw zig-zag behaviour
> with both the patches separately? I think if there is anything due
> to postgres, then you can see either one of the new log message or
> performance will be improved, OTOH if we see same behaviour, then I
> think we can probably assume it due to scheduler activity and move
> on. Also one point to note here is that even when the performance is
> down in that curve, it is equal to or better than HEAD.
>

Will do.

Based on the results with more client counts (increment by 6 clients
instead of 36), I think this really looks like something unrelated to
any of the patches - kernel, CPU, or something already present in
current master.

The attached results show that:

(a) master shows the same zig-zag behavior - No idea why this wasn't
observed on the previous runs.

(b) group_update actually seems to improve the situation, because the
performance keeps stable up to 72 clients, while on master the
fluctuation starts way earlier.

I'll redo the tests with a newer kernel - this was on 3.10.x which is
what Red Hat 7.2 uses, I'll try on 4.8.6. Then I'll try with the patches
you submitted, if the 4.8.6 kernel does not help.

Overall, I'm convinced this issue is unrelated to the patches.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-11-02 17:18:38
Message-ID: 04e39ac9-af28-0fda-8f72-7268197d281f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/02/2016 05:52 PM, Amit Kapila wrote:
> On Wed, Nov 2, 2016 at 9:01 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 11/01/2016 08:13 PM, Robert Haas wrote:
>>>
>>> On Mon, Oct 31, 2016 at 5:48 PM, Tomas Vondra
>>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>
>>
>> The one remaining thing is the strange zig-zag behavior, but that might
>> easily be a due to scheduling in kernel, or something else. I don't consider
>> it a blocker for any of the patches, though.
>>
>
> The only reason I could think of for that zig-zag behaviour is
> frequent multiple clog page accesses and it could be due to below
> reasons:
>
> a. transaction and its subtransactions (IIRC, Dilip's case has one
> main transaction and two subtransactions) can't fit into same page, in
> which case the group_update optimization won't apply and I don't think
> we can do anything for it.
> b. In the same group, multiple clog pages are being accessed. It is
> not a likely scenario, but it can happen and we might be able to
> improve a bit if that is happening.
> c. The transactions at same time tries to update different clog page.
> I think as mentioned upthread we can handle it by using slots an
> allowing multiple groups to work together instead of a single group.
>
> To check if there is any impact due to (a) or (b), I have added few
> logs in code (patch - group_update_clog_v9_log). The log message
> could be "all xacts are not on same page" or "Group contains
> different pages".
>
> Patch group_update_clog_v9_slots tries to address (c). So if there
> is any problem due to (c), this patch should improve the situation.
>
> Can you please try to run the test where you saw zig-zag behaviour
> with both the patches separately? I think if there is anything due
> to postgres, then you can see either one of the new log message or
> performance will be improved, OTOH if we see same behaviour, then I
> think we can probably assume it due to scheduler activity and move
> on. Also one point to note here is that even when the performance is
> down in that curve, it is equal to or better than HEAD.
>

Will do.

Based on the results with more client counts (increment by 6 clients
instead of 36), I think this really looks like something unrelated to
any of the patches - kernel, CPU, or something already present in
current master.

The attached results show that:

(a) master shows the same zig-zag behavior - No idea why this wasn't
observed on the previous runs.

(b) group_update actually seems to improve the situation, because the
performance keeps stable up to 72 clients, while on master the
fluctuation starts way earlier.

I'll redo the tests with a newer kernel - this was on 3.10.x which is
what Red Hat 7.2 uses, I'll try on 4.8.6. Then I'll try with the patches
you submitted, if the 4.8.6 kernel does not help.

Overall, I'm convinced this issue is unrelated to the patches.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
image/png 40.0 KB
zig-zag.ods application/vnd.oasis.opendocument.spreadsheet 25.8 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-11-03 15:08:02
Message-ID: CA+TgmoYC_tSGgZHWajuC8kwu_ZPrttZew0OwnH6Fcrs+UigS+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 1, 2016 at 11:31 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> I don't think I've suggested not committing any of the clog patches (or
> other patches in general) because shifting the contention somewhere else
> might cause regressions. At the end of the last CF I've however stated that
> we need to better understand the impact on various wokloads, and I think
> Amit agreed with that conclusion.
>
> We have that understanding now, I believe - also thanks to your idea of
> sampling wait events data.
>
> You're right we can't fix all the contention points in one patch, and that
> shifting the contention may cause regressions. But we should at least
> understand what workloads might be impacted, how serious the regressions may
> get etc. Which is why all the testing was done.

OK.

> Sure, I understand that. My main worry was that people will get worse
> performance with the next major version that what they get now (assuming we
> don't manage to address the other contention points). Which is difficult to
> explain to users & customers, no matter how reasonable it seems to us.
>
> The difference is that both the fast-path locks and msgNumLock went into
> 9.2, so that end users probably never saw that regression. But we don't know
> if that happens for clog and WAL.
>
> Perhaps you have a working patch addressing the WAL contention, so that we
> could see how that changes the results?

I don't think we do, yet. Amit or Kuntal might know more. At some
level I think we're just hitting the limits of the hardware's ability
to lay bytes on a platter, and fine-tuning the locking may not help
much.

> I might be wrong, but I doubt the kernel guys are running particularly wide
> set of tests, so how likely is it they will notice issues with specific
> workloads? Wouldn't it be great if we could tell them there's a bug and
> provide a workload that reproduces it?
>
> I don't see how "it's a Linux issue" makes it someone else's problem. The
> kernel guys can't really test everything (and are not obliged to). It's up
> to us to do more testing in this area, and report issues to the kernel guys
> (which is not happening as much as it should).

I don't exactly disagree with any of that. I just want to find a
course of action that we can agree on and move forward. This has been
cooking for a long time, and I want to converge on some resolution.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-11-04 09:20:06
Message-ID: CAA4eK1JHBmkUmWK1vNsBD4qD+EhuAf3FkVDt0r0rxAJc9aTN+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 3, 2016 at 8:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Nov 1, 2016 at 11:31 PM, Tomas Vondra
>> The difference is that both the fast-path locks and msgNumLock went into
>> 9.2, so that end users probably never saw that regression. But we don't know
>> if that happens for clog and WAL.
>>
>> Perhaps you have a working patch addressing the WAL contention, so that we
>> could see how that changes the results?
>
> I don't think we do, yet.
>

Right. At this stage, we are just evaluating the ways (basic idea is
to split the OS writes and Flush requests in separate locks) to reduce
it. It is difficult to speculate results at this stage. I think
after spending some more time (probably few weeks), we will be in
position to share our findings.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-12-05 00:30:36
Message-ID: CAJrrPGeYh2_=AqZr9WCO86xV+JwY+L6bMWSxMSx5xwHJsaHwHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 4, 2016 at 8:20 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Thu, Nov 3, 2016 at 8:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Tue, Nov 1, 2016 at 11:31 PM, Tomas Vondra
> >> The difference is that both the fast-path locks and msgNumLock went into
> >> 9.2, so that end users probably never saw that regression. But we don't
> know
> >> if that happens for clog and WAL.
> >>
> >> Perhaps you have a working patch addressing the WAL contention, so that
> we
> >> could see how that changes the results?
> >
> > I don't think we do, yet.
> >
>
> Right. At this stage, we are just evaluating the ways (basic idea is
> to split the OS writes and Flush requests in separate locks) to reduce
> it. It is difficult to speculate results at this stage. I think
> after spending some more time (probably few weeks), we will be in
> position to share our findings.
>
>
As per my understanding the current state of the patch is waiting for the
performance results from author.

Moved to next CF with "waiting on author" status. Please feel free to
update the status if the current status differs with the actual patch
status.

Regards,
Hari Babu
Fujitsu Australia


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-12-05 02:14:05
Message-ID: CAA4eK1+n3tjnEOmoOuU66_Smd_+a=sMCyt3V=fEJxaq6V015PA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 5, 2016 at 6:00 AM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com> wrote:
>
>
> On Fri, Nov 4, 2016 at 8:20 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Thu, Nov 3, 2016 at 8:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> > On Tue, Nov 1, 2016 at 11:31 PM, Tomas Vondra
>> >> The difference is that both the fast-path locks and msgNumLock went
>> >> into
>> >> 9.2, so that end users probably never saw that regression. But we don't
>> >> know
>> >> if that happens for clog and WAL.
>> >>
>> >> Perhaps you have a working patch addressing the WAL contention, so that
>> >> we
>> >> could see how that changes the results?
>> >
>> > I don't think we do, yet.
>> >
>>
>> Right. At this stage, we are just evaluating the ways (basic idea is
>> to split the OS writes and Flush requests in separate locks) to reduce
>> it. It is difficult to speculate results at this stage. I think
>> after spending some more time (probably few weeks), we will be in
>> position to share our findings.
>>
>
> As per my understanding the current state of the patch is waiting for the
> performance results from author.
>

No, that is not true. You have quoted the wrong message, that
discussion was about WALWriteLock contention not about the patch being
discussed in this thread. I have posted the latest set of patches
here [1]. Tomas is supposed to share the results of his tests. He
mentioned to me in PGConf Asia last week that he ran few tests on
Power Box, so let us wait for him to share his findings.

> Moved to next CF with "waiting on author" status. Please feel free to
> update the status if the current status differs with the actual patch
> status.
>

I think we should keep the status as "Needs Review".

[1] - https://www.postgresql.org/message-id/CAA4eK1JjatUZu0%2BHCi%3D5VM1q-hFgN_OhegPAwEUJqxf-7pESbg%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-12-05 02:38:26
Message-ID: CAJrrPGcO8XPNHNuOmrpFTtj8=5tWwgSBbmEfR53dFx_g3efYYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 5, 2016 at 1:14 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Mon, Dec 5, 2016 at 6:00 AM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
> wrote:
>
> No, that is not true. You have quoted the wrong message, that
> discussion was about WALWriteLock contention not about the patch being
> discussed in this thread. I have posted the latest set of patches
> here [1]. Tomas is supposed to share the results of his tests. He
> mentioned to me in PGConf Asia last week that he ran few tests on
> Power Box, so let us wait for him to share his findings.
>
> > Moved to next CF with "waiting on author" status. Please feel free to
> > update the status if the current status differs with the actual patch
> > status.
> >
>
> I think we should keep the status as "Needs Review".
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1JjatUZu0%
> 2BHCi%3D5VM1q-hFgN_OhegPAwEUJqxf-7pESbg%40mail.gmail.com

Thanks for the update.
I changed the status to "needs review" in 2017-01 commitfest.

Regards,
Hari Babu
Fujitsu Australia


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-12-22 13:29:13
Message-ID: 84c22fbb-b9c4-a02f-384b-b4feb2c67193@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

> The attached results show that:
>
> (a) master shows the same zig-zag behavior - No idea why this wasn't
> observed on the previous runs.
>
> (b) group_update actually seems to improve the situation, because the
> performance keeps stable up to 72 clients, while on master the
> fluctuation starts way earlier.
>
> I'll redo the tests with a newer kernel - this was on 3.10.x which is
> what Red Hat 7.2 uses, I'll try on 4.8.6. Then I'll try with the patches
> you submitted, if the 4.8.6 kernel does not help.
>
> Overall, I'm convinced this issue is unrelated to the patches.

I've been unable to rerun the tests on this hardware with a newer
kernel, so nothing new on the x86 front.

But as discussed with Amit in Tokyo at pgconf.asia, I got access to a
Power8e machine (IBM 8247-22L to be precise). It's a much smaller
machine compared to the x86 one, though - it only has 24 cores in 2
sockets, 128GB of RAM and less powerful storage, for example.

I've repeated a subset of x86 tests and pushed them to

https://bitbucket.org/tvondra/power8-results-2

The new results are prefixed with "power-" and I've tried to put them
right next to the "same" x86 tests.

In all cases the patches significantly reduce the contention on
CLogControlLock, just like on x86. Which is good and expected.

Otherwise the results are rather boring - no major regressions compared
to master, and all the patches perform almost exactly the same. Compare
for example this:

* http://tvondra.bitbucket.org/#dilip-300-unlogged-sync

* http://tvondra.bitbucket.org/#power-dilip-300-unlogged-sync

So the results seem much smoother compared to x86, and the performance
difference is roughly 3x, which matches the 24 vs. 72 cores.

For pgbench, the difference is much more significant, though:

* http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip

* http://tvondra.bitbucket.org/#power-pgbench-300-unlogged-sync-skip

So, we're doing ~40k on Power8, but 220k on x86 (which is ~6x more, so
double per-core throughput). My first guess was that this is due to the
x86 machine having better I/O subsystem, so I've reran the tests with
data directory in tmpfs, but that produced almost the same results.

Of course, this observation is unrelated to this patch.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-12-23 02:58:46
Message-ID: CAA4eK1KAysCmBaYGbFXyu0wGTZuMWeNgYBHNd3goNrfD0-G2CA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 22, 2016 at 6:59 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> But as discussed with Amit in Tokyo at pgconf.asia, I got access to a
> Power8e machine (IBM 8247-22L to be precise). It's a much smaller machine
> compared to the x86 one, though - it only has 24 cores in 2 sockets, 128GB
> of RAM and less powerful storage, for example.
>
> I've repeated a subset of x86 tests and pushed them to
>
> https://bitbucket.org/tvondra/power8-results-2
>
> The new results are prefixed with "power-" and I've tried to put them right
> next to the "same" x86 tests.
>
> In all cases the patches significantly reduce the contention on
> CLogControlLock, just like on x86. Which is good and expected.
>

The results look positive. Do you think we can conclude based on all
the tests you and Dilip have done, that we can move forward with this
patch (in particular group-update) or do you still want to do more
tests? I am aware that in one of the tests we have observed that
reducing contention on CLOGControlLock has increased the contention on
WALWriteLock, but I feel we can leave that point as a note to
committer and let him take a final call. From the code perspective
already Robert and Andres have taken one pass of review and I have
addressed all their comments, so surely more review of code can help,
but I think that is not a big deal considering patch size is
relatively small.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-12-27 23:13:12
Message-ID: 91d57161-d3ea-0cc2-6066-80713e4f90d7@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/23/2016 03:58 AM, Amit Kapila wrote:
> On Thu, Dec 22, 2016 at 6:59 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Hi,
>>
>> But as discussed with Amit in Tokyo at pgconf.asia, I got access to a
>> Power8e machine (IBM 8247-22L to be precise). It's a much smaller machine
>> compared to the x86 one, though - it only has 24 cores in 2 sockets, 128GB
>> of RAM and less powerful storage, for example.
>>
>> I've repeated a subset of x86 tests and pushed them to
>>
>> https://bitbucket.org/tvondra/power8-results-2
>>
>> The new results are prefixed with "power-" and I've tried to put them right
>> next to the "same" x86 tests.
>>
>> In all cases the patches significantly reduce the contention on
>> CLogControlLock, just like on x86. Which is good and expected.
>>
>
> The results look positive. Do you think we can conclude based on all
> the tests you and Dilip have done, that we can move forward with this
> patch (in particular group-update) or do you still want to do more
> tests? I am aware that in one of the tests we have observed that
> reducing contention on CLOGControlLock has increased the contention on
> WALWriteLock, but I feel we can leave that point as a note to
> committer and let him take a final call. From the code perspective
> already Robert and Andres have taken one pass of review and I have
> addressed all their comments, so surely more review of code can help,
> but I think that is not a big deal considering patch size is
> relatively small.
>

Yes, I believe that seems like a reasonable conclusion. I've done a few
more tests on the Power machine with data placed on a tmpfs filesystem
(to minimize all the I/O overhead), but the results are the same.

I don't think more testing is needed at this point, at lest not with the
synthetic test cases we've been using for the testing. The patch already
received way more benchmarking than most other patches.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-12-29 05:11:25
Message-ID: CAFiTN-tTazysx8GvDFgNvCh17J9AbGM=XgD6PftkoWVCMUSsXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 23, 2016 at 8:28 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> The results look positive. Do you think we can conclude based on all
> the tests you and Dilip have done, that we can move forward with this
> patch (in particular group-update) or do you still want to do more
> tests? I am aware that in one of the tests we have observed that
> reducing contention on CLOGControlLock has increased the contention on
> WALWriteLock, but I feel we can leave that point as a note to
> committer and let him take a final call. From the code perspective
> already Robert and Andres have taken one pass of review and I have
> addressed all their comments, so surely more review of code can help,
> but I think that is not a big deal considering patch size is
> relatively small.

I have done one more pass of the review today. I have few comments.

+ if (nextidx != INVALID_PGPROCNO)
+ {
+ /* Sleep until the leader updates our XID status. */
+ for (;;)
+ {
+ /* acts as a read barrier */
+ PGSemaphoreLock(&proc->sem);
+ if (!proc->clogGroupMember)
+ break;
+ extraWaits++;
+ }
+
+ Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
+
+ /* Fix semaphore count for any absorbed wakeups */
+ while (extraWaits-- > 0)
+ PGSemaphoreUnlock(&proc->sem);
+ return true;
+ }

1. extraWaits is used only locally in this block so I guess we can
declare inside this block only.

2. It seems that we have missed one unlock in case of absorbed
wakeups. You have initialised extraWaits with -1 and if there is one
extra wake up then extraWaits will become 0 (it means we have made one
extra call to PGSemaphoreLock and it's our responsibility to fix it as
the leader will Unlock only once). But it appear in such case we will
not make any call to PGSemaphoreUnlock. Am I missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-12-31 03:31:05
Message-ID: CAA4eK1J+67edo_Wnrfx8oJ+rWM_BAr+v6JqvQjKPdLOxR=0d5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 29, 2016 at 10:41 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> I have done one more pass of the review today. I have few comments.
>
> + if (nextidx != INVALID_PGPROCNO)
> + {
> + /* Sleep until the leader updates our XID status. */
> + for (;;)
> + {
> + /* acts as a read barrier */
> + PGSemaphoreLock(&proc->sem);
> + if (!proc->clogGroupMember)
> + break;
> + extraWaits++;
> + }
> +
> + Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
> +
> + /* Fix semaphore count for any absorbed wakeups */
> + while (extraWaits-- > 0)
> + PGSemaphoreUnlock(&proc->sem);
> + return true;
> + }
>
> 1. extraWaits is used only locally in this block so I guess we can
> declare inside this block only.
>

Agreed and changed accordingly.

> 2. It seems that we have missed one unlock in case of absorbed
> wakeups. You have initialised extraWaits with -1 and if there is one
> extra wake up then extraWaits will become 0 (it means we have made one
> extra call to PGSemaphoreLock and it's our responsibility to fix it as
> the leader will Unlock only once). But it appear in such case we will
> not make any call to PGSemaphoreUnlock.
>

Good catch! I have fixed it by initialising extraWaits to 0. This
same issue exists from Group clear xid for which I will send a patch
separately.

Apart from above, the patch needs to be adjusted for commit be7b2848
which has changed the definition of PGSemaphore.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v10.patch application/octet-stream 15.6 KB

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-01-11 05:25:48
Message-ID: CAFiTN-sSp=UkeXezHnapuaLCNWmmAgzuYVEODbOwj5K2zbgWng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 31, 2016 at 9:01 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> Agreed and changed accordingly.
>
>> 2. It seems that we have missed one unlock in case of absorbed
>> wakeups. You have initialised extraWaits with -1 and if there is one
>> extra wake up then extraWaits will become 0 (it means we have made one
>> extra call to PGSemaphoreLock and it's our responsibility to fix it as
>> the leader will Unlock only once). But it appear in such case we will
>> not make any call to PGSemaphoreUnlock.
>>
>
> Good catch! I have fixed it by initialising extraWaits to 0. This
> same issue exists from Group clear xid for which I will send a patch
> separately.
>
> Apart from above, the patch needs to be adjusted for commit be7b2848
> which has changed the definition of PGSemaphore.

I have reviewed the latest patch and I don't have any more comments.
So if there is no objection from other reviewers I can move it to
"Ready For Committer"?

I have performed one more test, with 3000 scale factor because
previously I tested only up to 1000 scale factor. The purpose of this
test is to check whether there is any regression at higher scale
factor.

Machine: Intel 8 socket machine.
Scale Factor: 3000
Shared Buffer: 8GB
Test: Pgbench RW test.
Run: 30 mins median of 3

Other modified GUC:
-N 300 -c min_wal_size=15GB -c max_wal_size=20GB -c
checkpoint_timeout=900 -c maintenance_work_mem=1GB -c
checkpoint_completion_target=0.9

Summary:
- Did not observed any regression.
- The performance gain is in sync with what we have observed with
other tests at lower scale factors.

Sync_Commit_Off:
client Head Patch

8 10065 10009
16 18487 18826
32 28167 28057
64 26655 28712
128 20152 24917
256 16740 22891

Sync_Commit_On:

Client Head Patch

8 5102 5110
16 8087 8282
32 12523 12548
64 14701 15112
128 14656 15238
256 13421 16424

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-01-17 06:09:38
Message-ID: CAFiTN-tZbtsbAx1Wrsmqd7Z8weTOTpCt9sqikj3wXapVEcw-eA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 11, 2017 at 10:55 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I have reviewed the latest patch and I don't have any more comments.
> So if there is no objection from other reviewers I can move it to
> "Ready For Committer"?

Seeing no objections, I have moved it to Ready For Committer.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-01-17 12:18:10
Message-ID: CAA4eK1L9WHYTJY8KUq_+yzVMp1ep1UqB+b3x3Kb6hTVEUWascA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 17, 2017 at 11:39 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Wed, Jan 11, 2017 at 10:55 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> I have reviewed the latest patch and I don't have any more comments.
>> So if there is no objection from other reviewers I can move it to
>> "Ready For Committer"?
>
> Seeing no objections, I have moved it to Ready For Committer.
>

Thanks for the review.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-02-01 04:35:51
Message-ID: CAB7nPqTYT8oWXxmudwFbLxK7rJWU73LcJz3J+nCfVGvKE31NAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 17, 2017 at 9:18 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Tue, Jan 17, 2017 at 11:39 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> On Wed, Jan 11, 2017 at 10:55 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>>> I have reviewed the latest patch and I don't have any more comments.
>>> So if there is no objection from other reviewers I can move it to
>>> "Ready For Committer"?
>>
>> Seeing no objections, I have moved it to Ready For Committer.
>>
>
> Thanks for the review.

Moved to CF 2017-03, the 8th commit fest of this patch.
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-09 22:49:42
Message-ID: CA+TgmobgWHcXDcChX2+BqJDk2dkPVF85ZrJFhUyHHQmw8diTpA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 31, 2017 at 11:35 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
>> Thanks for the review.
>
> Moved to CF 2017-03, the 8th commit fest of this patch.

I think eight is enough. Committed with some cosmetic changes.

I think the turning point for this somewhat-troubled patch was when we
realized that, while results were somewhat mixed on whether it
improved performance, wait event monitoring showed that it definitely
reduced contention significantly. However, I just realized that in
both this case and in the case of group XID clearing, we weren't
advertising a wait event for the PGSemaphoreLock calls that are part
of the group locking machinery. I think we should fix that, because a
quick test shows that can happen fairly often -- not, I think, as
often as we would have seen LWLock waits without these patches, but
often enough that you'll want to know. Patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
group-update-waits-v1.patch application/octet-stream 4.8 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-10 02:17:33
Message-ID: 9010.1489112253@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I think eight is enough. Committed with some cosmetic changes.

Buildfarm thinks eight wasn't enough.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=clam&dt=2017-03-10%2002%3A00%3A01

regards, tom lane


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-10 02:28:33
Message-ID: CAA4eK1JHKow35+9yQ8JqL4P=L0H77qiq34LDBcoeRqJi=NnHjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 10, 2017 at 7:47 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I think eight is enough. Committed with some cosmetic changes.
>
> Buildfarm thinks eight wasn't enough.
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=clam&dt=2017-03-10%2002%3A00%3A01
>

Will look into this, though I don't have access to that machine, but
it looks to be a power machine and I have access to somewhat similar
machine.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-10 02:35:50
Message-ID: CA+TgmobrMF8ALx_7pGM+4G=i-o3NBf+FrB4bh6XHqUF7NuVgDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 9, 2017 at 9:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I think eight is enough. Committed with some cosmetic changes.
>
> Buildfarm thinks eight wasn't enough.
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=clam&dt=2017-03-10%2002%3A00%3A01

At first I was confused how you knew that this was the fault of this
patch, but this seems like a pretty indicator:

TRAP: FailedAssertion("!(curval == 0 || (curval == 0x03 && status !=
0x00) || curval == status)", File: "clog.c", Line: 574)

I'm not sure whether it's related to this problem or not, but now that
I look at it, this (preexisting) comment looks like entirely wishful
thinking:

* If we update more than one xid on this page while it is being written
* out, we might find that some of the bits go to disk and others don't.
* If we are updating commits on the page with the top-level xid that
* could break atomicity, so we subcommit the subxids first before we mark
* the top-level commit.

The problem with that is the word "before". There are no memory
barriers here, so there's zero guarantee that other processes see the
writes in the order they're performed here. But it might be a stretch
to suppose that that would cause this symptom.

Maybe we should replace that Assert() with an elog() and dump out the
actual values.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-10 05:21:00
Message-ID: 20807.1489123260@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Mar 9, 2017 at 9:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Buildfarm thinks eight wasn't enough.
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=clam&dt=2017-03-10%2002%3A00%3A01

> At first I was confused how you knew that this was the fault of this
> patch, but this seems like a pretty indicator:
> TRAP: FailedAssertion("!(curval == 0 || (curval == 0x03 && status !=
> 0x00) || curval == status)", File: "clog.c", Line: 574)

Yeah, that's what led me to blame the clog-group-update patch.

> I'm not sure whether it's related to this problem or not, but now that
> I look at it, this (preexisting) comment looks like entirely wishful
> thinking:
> * If we update more than one xid on this page while it is being written
> * out, we might find that some of the bits go to disk and others don't.
> * If we are updating commits on the page with the top-level xid that
> * could break atomicity, so we subcommit the subxids first before we mark
> * the top-level commit.

Maybe, but that comment dates to 2008 according to git, and clam has
been, er, happy as a clam up to now. My money is on a newly-introduced
memory-access-ordering bug.

Also, I see clam reported in green just now, so it's not 100%
reproducible :-(

regards, tom lane


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-10 06:13:40
Message-ID: CAA4eK1KAteYXb-KRY=tBRcM=D20o5UvgHePxpwRaBS7eqrkBaQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 10, 2017 at 10:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Mar 9, 2017 at 9:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Buildfarm thinks eight wasn't enough.
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=clam&dt=2017-03-10%2002%3A00%3A01
>
>> At first I was confused how you knew that this was the fault of this
>> patch, but this seems like a pretty indicator:
>> TRAP: FailedAssertion("!(curval == 0 || (curval == 0x03 && status !=
>> 0x00) || curval == status)", File: "clog.c", Line: 574)
>
> Yeah, that's what led me to blame the clog-group-update patch.
>
>> I'm not sure whether it's related to this problem or not, but now that
>> I look at it, this (preexisting) comment looks like entirely wishful
>> thinking:
>> * If we update more than one xid on this page while it is being written
>> * out, we might find that some of the bits go to disk and others don't.
>> * If we are updating commits on the page with the top-level xid that
>> * could break atomicity, so we subcommit the subxids first before we mark
>> * the top-level commit.
>
> Maybe, but that comment dates to 2008 according to git, and clam has
> been, er, happy as a clam up to now. My money is on a newly-introduced
> memory-access-ordering bug.
>
> Also, I see clam reported in green just now, so it's not 100%
> reproducible :-(
>

Just to let you know that I think I have figured out the reason of
failure. If we run the regressions with attached patch, it will make
the regression tests fail consistently in same way. The patch just
makes all transaction status updates to go via group clog update
mechanism. Now, the reason of the problem is that the patch has
relied on XidCache in PGPROC for subtransactions when they are not
overflowed which is okay for Commits, but not for Rollback to
Savepoint and Rollback. For Rollback to Savepoint, we just pass the
particular (sub)-transaction id to abort, but group mechanism will
abort all the sub-transactions in that top transaction to Rollback. I
am still analysing what could be the best way to fix this issue. I
think there could be multiple ways to fix this problem. One way is
that we can advertise the fact that the status update for transaction
involves subtransactions and then we can use xidcache for actually
processing the status update. Second is advertise all the
subtransaction ids for which status needs to be update, but I am sure
that is not-at all efficient as that will cosume lot of memory. Last
resort could be that we don't use group clog update optimization when
transaction has sub-transactions.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
force_clog_group_commit_v1.patch application/octet-stream 943 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-10 06:21:28
Message-ID: 22895.1489126888@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
> Just to let you know that I think I have figured out the reason of
> failure. If we run the regressions with attached patch, it will make
> the regression tests fail consistently in same way. The patch just
> makes all transaction status updates to go via group clog update
> mechanism.

This does *not* give me a warm fuzzy feeling that this patch was
ready to commit. Or even that it was tested to the claimed degree.

regards, tom lane


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-10 11:17:54
Message-ID: CAA4eK1K0YdjL0A7kXGxgVHsh7fpE35MRmzPu9x8jdKAYPmi0mg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 10, 2017 at 11:43 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Fri, Mar 10, 2017 at 10:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>
>> Also, I see clam reported in green just now, so it's not 100%
>> reproducible :-(
>>
>
> Just to let you know that I think I have figured out the reason of
> failure. If we run the regressions with attached patch, it will make
> the regression tests fail consistently in same way. The patch just
> makes all transaction status updates to go via group clog update
> mechanism. Now, the reason of the problem is that the patch has
> relied on XidCache in PGPROC for subtransactions when they are not
> overflowed which is okay for Commits, but not for Rollback to
> Savepoint and Rollback. For Rollback to Savepoint, we just pass the
> particular (sub)-transaction id to abort, but group mechanism will
> abort all the sub-transactions in that top transaction to Rollback. I
> am still analysing what could be the best way to fix this issue. I
> think there could be multiple ways to fix this problem. One way is
> that we can advertise the fact that the status update for transaction
> involves subtransactions and then we can use xidcache for actually
> processing the status update. Second is advertise all the
> subtransaction ids for which status needs to be update, but I am sure
> that is not-at all efficient as that will cosume lot of memory. Last
> resort could be that we don't use group clog update optimization when
> transaction has sub-transactions.
>

On further analysis, I don't think the first way mentioned above can
work for Rollback To Savepoint because it can pass just a subset of
sub-tranasctions in which case we can never identify it by looking at
subxids in PGPROC unless we advertise all such subxids. The case I am
talking is something like:

Begin;
Savepoint one;
Insert ...
Savepoint two
Insert ..
Savepoint three
Insert ...
Rollback to Savepoint two;

Now, for Rollback to Savepoint two, we pass transaction ids
corresponding to Savepoint three and two.

So, I think we can apply this optimization only for transactions that
always commits which will anyway be the most common use case. Another
alternative as mentioned above is to do this optimization when there
are no subtransactions involved. Attached two patches implements
these two approaches (fix_clog_group_commit_opt_v1.patch - allow
optimization only for commits; fix_clog_group_commit_opt_v2.patch -
allow optimizations for transaction status updates that don't involve
subxids). I think the first approach is a better way to deal with
this, let me know your thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
fix_clog_group_commit_opt_v1.patch application/octet-stream 1.4 KB
fix_clog_group_commit_opt_v2.patch application/octet-stream 4.6 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-10 11:25:43
Message-ID: CAA4eK1+nSab4hXqTUrnXkrnjS_Z3Z4Aaa=MFzoU26jc17aiybQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 10, 2017 at 11:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
>> Just to let you know that I think I have figured out the reason of
>> failure. If we run the regressions with attached patch, it will make
>> the regression tests fail consistently in same way. The patch just
>> makes all transaction status updates to go via group clog update
>> mechanism.
>
> This does *not* give me a warm fuzzy feeling that this patch was
> ready to commit. Or even that it was tested to the claimed degree.
>

I think this is more of an implementation detail missed by me. We
have done quite some performance/stress testing with a different
number of savepoints, but this could have been caught only by having
Rollback to Savepoint followed by a commit. I agree that we could
have devised some simple way (like the one I shared above) to test the
wide range of tests with this new mechanism earlier. This is a
learning from here and I will try to be more cautious about such
things in future.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-10 20:40:48
Message-ID: CA+TgmoYn5se57DskytdyeYVua8_EFAAiOtHA2CDzqP+aB1+BUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 10, 2017 at 6:25 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Fri, Mar 10, 2017 at 11:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
>>> Just to let you know that I think I have figured out the reason of
>>> failure. If we run the regressions with attached patch, it will make
>>> the regression tests fail consistently in same way. The patch just
>>> makes all transaction status updates to go via group clog update
>>> mechanism.
>>
>> This does *not* give me a warm fuzzy feeling that this patch was
>> ready to commit. Or even that it was tested to the claimed degree.
>>
>
> I think this is more of an implementation detail missed by me. We
> have done quite some performance/stress testing with a different
> number of savepoints, but this could have been caught only by having
> Rollback to Savepoint followed by a commit. I agree that we could
> have devised some simple way (like the one I shared above) to test the
> wide range of tests with this new mechanism earlier. This is a
> learning from here and I will try to be more cautious about such
> things in future.

After some study, I don't feel confident that it's this simple. The
underlying issue here is that TransactionGroupUpdateXidStatus thinks
it can assume that proc->clogGroupMemberXid, pgxact->nxids, and
proc->subxids.xids match the values that were passed to
TransactionIdSetPageStatus, but that's not checked anywhere. For
example, I thought about adding these assertions:

Assert(nsubxids == MyPgXact->nxids);
Assert(memcmp(subxids, MyProc->subxids.xids,
nsubxids * sizeof(TransactionId)) == 0);

There's not even a comment in the patch anywhere that notes that we're
assuming this, let alone anything that checks that it's actually true,
which seems worrying.

One thing that seems off is that we have this new field
clogGroupMemberXid, which we use to determine the XID that is being
committed, but for the subxids we think it's going to be true in every
case. Well, that seems a bit odd, right? I mean, if the contents of
the PGXACT are a valid way to figure out the subxids that we need to
worry about, then why not also it to get the toplevel XID?

Another point that's kind of bothering me is that this whole approach
now seems to me to be an abstraction violation. It relies on the set
of subxids for which we're setting status in clog matching the set of
subxids advertised in PGPROC. But actually there's a fair amount of
separation between those things. What's getting passed down to clog
is coming from xact.c's transaction state stack, which is completely
separate from the procarray. Now after going over the logic in some
detail, it does look to me that you're correct that in the case of a
toplevel commit they will always match, but in some sense that looks
accidental.

For example, look at this code from RecordTransactionAbort:

/*
* If we're aborting a subtransaction, we can immediately remove failed
* XIDs from PGPROC's cache of running child XIDs. We do that here for
* subxacts, because we already have the child XID array at hand. For
* main xacts, the equivalent happens just after this function returns.
*/
if (isSubXact)
XidCacheRemoveRunningXids(xid, nchildren, children, latestXid);

That code paints the removal of the aborted subxids from our PGPROC as
an optimization, not a requirement for correctness. And without this
patch, that's correct: the XIDs are advertised in PGPROC so that we
construct correct snapshots, but they only need to be present there
for so long as there is a possibility that those XIDs might in the
future commit. Once they've aborted, it's not *necessary* for them to
appear in PGPROC any more, but it doesn't hurt anything if they do.
However, with this patch, removing them from PGPROC becomes a hard
requirement, because otherwise the set of XIDs that are running
according to the transaction state stack and the set that are running
according to the PGPROC might be different. Yet, neither the original
patch nor your proposed fix patch updated any of the comments here.

One might wonder whether it's even wise to tie these things together
too closely. For example, you can imagine a future patch for
autonomous transactions stashing their XIDs in the subxids array.
That'd be fine for snapshot purposes, but it would break this.

Finally, I had an unexplained hang during the TAP tests while testing
out your fix patch. I haven't been able to reproduce that so it
might've just been an artifact of something stupid I did, or of some
unrelated bug, but I think it's best to back up and reconsider a bit
here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-10 21:27:07
Message-ID: CA+TgmoanXMjxnr6whEXMpc9Nts2L_sztDV5ck4pviGQ++TNP5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 10, 2017 at 3:40 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Finally, I had an unexplained hang during the TAP tests while testing
> out your fix patch. I haven't been able to reproduce that so it
> might've just been an artifact of something stupid I did, or of some
> unrelated bug, but I think it's best to back up and reconsider a bit
> here.

I was able to reproduce this with the following patch:

diff --git a/src/backend/access/transam/clog.c
b/src/backend/access/transam/clog.c
index bff42dc..0546425 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -268,9 +268,11 @@ set_status_by_pages(int nsubxids, TransactionId *subxids,
* has a race condition (see TransactionGroupUpdateXidStatus) but the
* worst thing that happens if we mess up is a small loss of efficiency;
* the intent is to avoid having the leader access pages it wouldn't
- * otherwise need to touch. Finally, we skip it for prepared transactions,
- * which don't have the semaphore we would need for this optimization,
- * and which are anyway probably not all that common.
+ * otherwise need to touch. We also skip it if the transaction status is
+ * other than commit, because for rollback and rollback to savepoint, the
+ * list of subxids won't be same as subxids array in PGPROC. Finally, we skip
+ * it for prepared transactions, which don't have the semaphore we would need
+ * for this optimization, and which are anyway probably not all that common.
*/
static void
TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
@@ -280,15 +282,20 @@ TransactionIdSetPageStatus(TransactionId xid,
int nsubxids,
{
if (all_xact_same_page &&
nsubxids < PGPROC_MAX_CACHED_SUBXIDS &&
+ status == TRANSACTION_STATUS_COMMITTED &&
!IsGXactActive())
{
+ Assert(nsubxids == MyPgXact->nxids);
+ Assert(memcmp(subxids, MyProc->subxids.xids,
+ nsubxids * sizeof(TransactionId)) == 0);
+
/*
* If we can immediately acquire CLogControlLock, we update the status
* of our own XID and release the lock. If not, try use group XID
* update. If that doesn't work out, fall back to waiting for the
* lock to perform an update for this transaction only.
*/
- if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
+ if (false && LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
{
TransactionIdSetPageStatusInternal(xid, nsubxids,
subxids, status, lsn, pageno);
LWLockRelease(CLogControlLock);

make check-world hung here:

t/009_twophase.pl ..........
1..13
ok 1 - Commit prepared transaction after restart
ok 2 - Rollback prepared transaction after restart

[rhaas pgsql]$ ps uxww | grep postgres
rhaas 72255 0.0 0.0 2447996 1684 s000 S+ 3:40PM 0:00.00
/Users/rhaas/pgsql/tmp_install/Users/rhaas/install/dev/bin/psql -XAtq
-d port=64230 host=/var/folders/y8/r2ycj_jj2vd65v71rmyddpr40000gn/T/ZVWy0JGbuw
dbname='postgres' -f - -v ON_ERROR_STOP=1
rhaas 72253 0.0 0.0 2478532 1548 ?? Ss 3:40PM 0:00.00
postgres: bgworker: logical replication launcher
rhaas 72252 0.0 0.0 2483132 740 ?? Ss 3:40PM 0:00.05
postgres: stats collector process
rhaas 72251 0.0 0.0 2486724 1952 ?? Ss 3:40PM 0:00.02
postgres: autovacuum launcher process
rhaas 72250 0.0 0.0 2477508 880 ?? Ss 3:40PM 0:00.03
postgres: wal writer process
rhaas 72249 0.0 0.0 2477508 972 ?? Ss 3:40PM 0:00.03
postgres: writer process
rhaas 72248 0.0 0.0 2477508 1252 ?? Ss 3:40PM 0:00.00
postgres: checkpointer process
rhaas 72246 0.0 0.0 2481604 5076 s000 S+ 3:40PM 0:00.03
/Users/rhaas/pgsql/tmp_install/Users/rhaas/install/dev/bin/postgres -D
/Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata
rhaas 72337 0.0 0.0 2433796 688 s002 S+ 4:14PM 0:00.00
grep postgres
rhaas 72256 0.0 0.0 2478920 2984 ?? Ss 3:40PM 0:00.00
postgres: rhaas postgres [local] COMMIT PREPARED waiting for 0/301D0D0

Backtrace of PID 72256:

#0 0x00007fff8ecc85c2 in poll ()
#1 0x00000001078eb727 in WaitEventSetWaitBlock [inlined] () at
/Users/rhaas/pgsql/src/backend/storage/ipc/latch.c:1118
#2 0x00000001078eb727 in WaitEventSetWait (set=0x7fab3c8366c8,
timeout=-1, occurred_events=0x7fff585e5410, nevents=1,
wait_event_info=<value temporarily unavailable, due to optimizations>)
at latch.c:949
#3 0x00000001078eb409 in WaitLatchOrSocket (latch=<value temporarily
unavailable, due to optimizations>, wakeEvents=<value temporarily
unavailable, due to optimizations>, sock=-1, timeout=<value
temporarily unavailable, due to optimizations>,
wait_event_info=134217741) at latch.c:349
#4 0x00000001078cf077 in SyncRepWaitForLSN (lsn=<value temporarily
unavailable, due to optimizations>, commit=<value temporarily
unavailable, due to optimizations>) at syncrep.c:284
#5 0x00000001076a2dab in FinishPreparedTransaction (gid=<value
temporarily unavailable, due to optimizations>, isCommit=1 '\001') at
twophase.c:2110
#6 0x0000000107919420 in standard_ProcessUtility (pstmt=<value
temporarily unavailable, due to optimizations>, queryString=<value
temporarily unavailable, due to optimizations>,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x7fab3c853cf8,
completionTag=<value temporarily unavailable, due to optimizations>)
at utility.c:452
#7 0x00000001079186f3 in PortalRunUtility (portal=0x7fab3c874a40,
pstmt=0x7fab3c853c00, isTopLevel=1 '\001', setHoldSnapshot=<value
temporarily unavailable, due to optimizations>, dest=0x7fab3c853cf8,
completionTag=0x7fab3c8366f8 "\n") at pquery.c:1165
#8 0x0000000107917cd6 in PortalRunMulti (portal=<value temporarily
unavailable, due to optimizations>, isTopLevel=1 '\001',
setHoldSnapshot=0 '\0', dest=0x7fab3c853cf8, altdest=0x7fab3c853cf8,
completionTag=<value temporarily unavailable, due to optimizations>)
at pquery.c:1315
#9 0x0000000107917634 in PortalRun (portal=0x7fab3c874a40,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x7fab3c853cf8,
altdest=0x7fab3c853cf8, completionTag=0x7fff585e5a30 "") at
pquery.c:788
#10 0x000000010791586b in PostgresMain (argc=<value temporarily
unavailable, due to optimizations>, argv=<value temporarily
unavailable, due to optimizations>, dbname=<value temporarily
unavailable, due to optimizations>, username=<value temporarily
unavailable, due to optimizations>) at postgres.c:1101
#11 0x0000000107897a68 in PostmasterMain (argc=<value temporarily
unavailable, due to optimizations>, argv=<value temporarily
unavailable, due to optimizations>) at postmaster.c:4317
#12 0x00000001078124cd in main (argc=<value temporarily unavailable,
due to optimizations>, argv=<value temporarily unavailable, due to
optimizations>) at main.c:228

debug_query_string is COMMIT PREPARED 'xact_009_1'

end of regress_log_009_twophase looks like this:

ok 2 - Rollback prepared transaction after restart
### Stopping node "master" using mode immediate
# Running: pg_ctl -D
/Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata
-m immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID
### Starting node "master"
# Running: pg_ctl -D
/Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata
-l /Users/rhaas/pgsql/src/test/recovery/tmp_check/log/009_twophase_master.log
start
waiting for server to start.... done
server started
# Postmaster PID for node "master" is 72246

The smoking gun was in 009_twophase_slave.log:

TRAP: FailedAssertion("!(nsubxids == MyPgXact->nxids)", File:
"clog.c", Line: 288)

...and then the node shuts down, which is why this hangs forever.
(Also... what's up with it hanging forever instead of timing out or
failing or something?)

So evidently on a standby it is in fact possible for the procarray
contents not to match what got passed down to clog. Now you might say
"well, we shouldn't be using group update on a standby anyway", but
it's possible for a hot standby backend to hold a shared lock on
CLogControlLock, and then the startup process would be pushed into the
group-update path and - boom.

Anyway, this is surely fixable, but I think it's another piece of
evidence that the assumption that the transaction status stack will
match the procarray is fairly fragile.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-10 22:19:17
Message-ID: 20170310221917.dijugbcqym35of7o@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:

> The smoking gun was in 009_twophase_slave.log:
>
> TRAP: FailedAssertion("!(nsubxids == MyPgXact->nxids)", File:
> "clog.c", Line: 288)
>
> ...and then the node shuts down, which is why this hangs forever.
> (Also... what's up with it hanging forever instead of timing out or
> failing or something?)

This bit my while messing with 2PC tests recently. I think it'd be
worth doing something about this, such as causing the test to die if we
request a server to (re)start and it doesn't start or it immediately
crashes. This doesn't solve the problem of a server crashing at a point
not immediately after start, though.

(It'd be very annoying to have to sprinkle the Perl test code with
"assert $server->islive", but perhaps we can add assertions of some kind
in PostgresNode itself).

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-11 00:39:14
Message-ID: CAA4eK1KxSVNcwSZgP_6ViCe1amSO_TwxhJbqBcC9xri9U1hjfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 11, 2017 at 2:10 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 10, 2017 at 6:25 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> On Fri, Mar 10, 2017 at 11:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
>>>> Just to let you know that I think I have figured out the reason of
>>>> failure. If we run the regressions with attached patch, it will make
>>>> the regression tests fail consistently in same way. The patch just
>>>> makes all transaction status updates to go via group clog update
>>>> mechanism.
>>>
>>> This does *not* give me a warm fuzzy feeling that this patch was
>>> ready to commit. Or even that it was tested to the claimed degree.
>>>
>>
>> I think this is more of an implementation detail missed by me. We
>> have done quite some performance/stress testing with a different
>> number of savepoints, but this could have been caught only by having
>> Rollback to Savepoint followed by a commit. I agree that we could
>> have devised some simple way (like the one I shared above) to test the
>> wide range of tests with this new mechanism earlier. This is a
>> learning from here and I will try to be more cautious about such
>> things in future.
>
> After some study, I don't feel confident that it's this simple. The
> underlying issue here is that TransactionGroupUpdateXidStatus thinks
> it can assume that proc->clogGroupMemberXid, pgxact->nxids, and
> proc->subxids.xids match the values that were passed to
> TransactionIdSetPageStatus, but that's not checked anywhere. For
> example, I thought about adding these assertions:
>
> Assert(nsubxids == MyPgXact->nxids);
> Assert(memcmp(subxids, MyProc->subxids.xids,
> nsubxids * sizeof(TransactionId)) == 0);
>
> There's not even a comment in the patch anywhere that notes that we're
> assuming this, let alone anything that checks that it's actually true,
> which seems worrying.
>
> One thing that seems off is that we have this new field
> clogGroupMemberXid, which we use to determine the XID that is being
> committed, but for the subxids we think it's going to be true in every
> case. Well, that seems a bit odd, right? I mean, if the contents of
> the PGXACT are a valid way to figure out the subxids that we need to
> worry about, then why not also it to get the toplevel XID?
>
> Another point that's kind of bothering me is that this whole approach
> now seems to me to be an abstraction violation. It relies on the set
> of subxids for which we're setting status in clog matching the set of
> subxids advertised in PGPROC. But actually there's a fair amount of
> separation between those things. What's getting passed down to clog
> is coming from xact.c's transaction state stack, which is completely
> separate from the procarray. Now after going over the logic in some
> detail, it does look to me that you're correct that in the case of a
> toplevel commit they will always match, but in some sense that looks
> accidental.
>
> For example, look at this code from RecordTransactionAbort:
>
> /*
> * If we're aborting a subtransaction, we can immediately remove failed
> * XIDs from PGPROC's cache of running child XIDs. We do that here for
> * subxacts, because we already have the child XID array at hand. For
> * main xacts, the equivalent happens just after this function returns.
> */
> if (isSubXact)
> XidCacheRemoveRunningXids(xid, nchildren, children, latestXid);
>
> That code paints the removal of the aborted subxids from our PGPROC as
> an optimization, not a requirement for correctness. And without this
> patch, that's correct: the XIDs are advertised in PGPROC so that we
> construct correct snapshots, but they only need to be present there
> for so long as there is a possibility that those XIDs might in the
> future commit. Once they've aborted, it's not *necessary* for them to
> appear in PGPROC any more, but it doesn't hurt anything if they do.
> However, with this patch, removing them from PGPROC becomes a hard
> requirement, because otherwise the set of XIDs that are running
> according to the transaction state stack and the set that are running
> according to the PGPROC might be different. Yet, neither the original
> patch nor your proposed fix patch updated any of the comments here.
>

There was a comment in existing code (proc.h) which states that it
will contain non-aborted transactions. I agree that having it
explicitly mentioned in patch would have been much better.

/*
* Each backend advertises up to PGPROC_MAX_CACHED_SUBXIDS TransactionIds
* for non-aborted subtransactions of its current top transaction. These
* have to be treated as running XIDs by other backends.

> One might wonder whether it's even wise to tie these things together
> too closely. For example, you can imagine a future patch for
> autonomous transactions stashing their XIDs in the subxids array.
> That'd be fine for snapshot purposes, but it would break this.
>
> Finally, I had an unexplained hang during the TAP tests while testing
> out your fix patch. I haven't been able to reproduce that so it
> might've just been an artifact of something stupid I did, or of some
> unrelated bug, but I think it's best to back up and reconsider a bit
> here.
>

I agree that more analysis can help us to decide if we can use subxids
from PGPROC and if so under what conditions. Have you considered the
another patch I have posted to fix the issue which is to do this
optimization only when subxids are not present? In that patch, it
will remove the dependency of relying on subxids in PGPROC.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-12 02:41:57
Message-ID: CA+Tgmoa8c9W-UsFAC6f5cYH8e8OBEtTMaR6SphO+EduK9L+z1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 10, 2017 at 7:39 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I agree that more analysis can help us to decide if we can use subxids
> from PGPROC and if so under what conditions. Have you considered the
> another patch I have posted to fix the issue which is to do this
> optimization only when subxids are not present? In that patch, it
> will remove the dependency of relying on subxids in PGPROC.

Well, that's an option, but it narrows the scope of the optimization
quite a bit. I think Simon previously opposed handling only the
no-subxid cases (although I may be misremembering) and I'm not that
keen about it either.

I was wondering about doing an explicit test: if the XID being
committed matches the one in the PGPROC, and nsubxids matches, and the
actual list of XIDs matches, then apply the optimization. That could
replace the logic that you've proposed to exclude non-commit cases,
gxact cases, etc. and it seems fundamentally safer. But it might be a
more expensive test, too, so I'm not sure.

It would be nice to get some other opinions on how (and whether) to
proceed with this. I'm feeling really nervous about this right at the
moment, because it seems like everybody including me missed some
fairly critical points relating to the safety (or lack thereof) of
this patch, and I want to make sure that if it gets committed again,
we've really got everything nailed down tight.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-17 06:30:20
Message-ID: CAA4eK1Li2cU78LmOEn8gBQaTacn=1z+3ovNODPL2b6f5s8yR-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 12, 2017 at 8:11 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 10, 2017 at 7:39 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> I agree that more analysis can help us to decide if we can use subxids
>> from PGPROC and if so under what conditions. Have you considered the
>> another patch I have posted to fix the issue which is to do this
>> optimization only when subxids are not present? In that patch, it
>> will remove the dependency of relying on subxids in PGPROC.
>
> Well, that's an option, but it narrows the scope of the optimization
> quite a bit. I think Simon previously opposed handling only the
> no-subxid cases (although I may be misremembering) and I'm not that
> keen about it either.
>
> I was wondering about doing an explicit test: if the XID being
> committed matches the one in the PGPROC, and nsubxids matches, and the
> actual list of XIDs matches, then apply the optimization. That could
> replace the logic that you've proposed to exclude non-commit cases,
> gxact cases, etc. and it seems fundamentally safer. But it might be a
> more expensive test, too, so I'm not sure.
>

I think if the number of subxids is very small let us say under 5 or
so, then such a check might not matter, but otherwise it could be
expensive.

> It would be nice to get some other opinions on how (and whether) to
> proceed with this. I'm feeling really nervous about this right at the
> moment, because it seems like everybody including me missed some
> fairly critical points relating to the safety (or lack thereof) of
> this patch, and I want to make sure that if it gets committed again,
> we've really got everything nailed down tight.
>

I think the basic thing that is missing in the last patch was that we
can't apply this optimization during WAL replay as during
recovery/hotstandby the xids/subxids are tracked in KnownAssignedXids.
The same is mentioned in header file comments in procarray.c and in
GetSnapshotData (look at an else loop of the check if
(!snapshot->takenDuringRecovery)). As far as I can see, the patch has
considered that in the initial versions but then the check got dropped
in one of the later revisions by mistake. The patch version-5 [1] has
the check for recovery, but during some code rearrangement, it got
dropped in version-6 [2]. Having said that, I think the improvement
in case there are subtransactions will be lesser because having
subtransactions means more work under LWLock and that will have lesser
context switches. This optimization is all about the reduction in
frequent context switches, so I think even if we don't optimize the
case for subtransactions we are not leaving much on the table and it
will make this optimization much safe. To substantiate this theory
with data, see the difference in performance when subtransactions are
used [3] and when they are not used [4].

So we have four ways to proceed:
1. Have this optimization for subtransactions and make it safe by
having some additional conditions like check for recovery, explicit
check for if the actual transaction ids match with ids stored in proc.
2. Have this optimization when there are no subtransactions. In this
case, we can have a very simple check for this optimization.
3. Drop this patch and idea.
4. Consider it for next version.

I personally think second way is okay for this release as that looks
safe and gets us the maximum benefit we can achieve by this
optimization and then consider adding optimization for subtransactions
(first way) in the future version if we think it is safe and gives us
the benefit.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1KUVPxBcGTdOuKyvf5p1sQ0HeUbSMbTxtQc%3DP65OxiZog%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1L4iV-2qe7AyMVsb%2Bnz7SiX8JvCO%2BCqhXwaiXgm3CaBUw%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAFiTN-u3%3DXUi7z8dTOgxZ98E7gL1tzL%3Dq9Yd%3DCwWCtTtS6pOZw%40mail.gmail.com
[4] - https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-20 02:57:42
Message-ID: CA+TgmoaXa0nxRaoCC4u5qLyA9EQeKMFJNBP5M8qbDrnthR6fVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 17, 2017 at 2:30 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> I was wondering about doing an explicit test: if the XID being
>> committed matches the one in the PGPROC, and nsubxids matches, and the
>> actual list of XIDs matches, then apply the optimization. That could
>> replace the logic that you've proposed to exclude non-commit cases,
>> gxact cases, etc. and it seems fundamentally safer. But it might be a
>> more expensive test, too, so I'm not sure.
>
> I think if the number of subxids is very small let us say under 5 or
> so, then such a check might not matter, but otherwise it could be
> expensive.

We could find out by testing it. We could also restrict the
optimization to cases with just a few subxids, because if you've got a
large number of subxids this optimization probably isn't buying much
anyway. We're trying to avoid grabbing CLogControlLock to do a very
small amount of work, but if you've got 10 or 20 subxids we're doing
as much work anyway as the group update optimization is attempting to
put into one batch.

> So we have four ways to proceed:
> 1. Have this optimization for subtransactions and make it safe by
> having some additional conditions like check for recovery, explicit
> check for if the actual transaction ids match with ids stored in proc.
> 2. Have this optimization when there are no subtransactions. In this
> case, we can have a very simple check for this optimization.
> 3. Drop this patch and idea.
> 4. Consider it for next version.
>
> I personally think second way is okay for this release as that looks
> safe and gets us the maximum benefit we can achieve by this
> optimization and then consider adding optimization for subtransactions
> (first way) in the future version if we think it is safe and gives us
> the benefit.
>
> Thoughts?

I don't like #2 very much. Restricting it to a relatively small
number of transactions - whatever we can show doesn't hurt performance
- seems OK, but restriction it to the exactly-zero-subtransactions
case seems poor.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-21 12:49:58
Message-ID: CAA4eK1JMJec6pUk6dRB_jxR-df1u31O_ATYGuRBnFJ_7Opm-AA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 20, 2017 at 8:27 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 17, 2017 at 2:30 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>> I was wondering about doing an explicit test: if the XID being
>>> committed matches the one in the PGPROC, and nsubxids matches, and the
>>> actual list of XIDs matches, then apply the optimization. That could
>>> replace the logic that you've proposed to exclude non-commit cases,
>>> gxact cases, etc. and it seems fundamentally safer. But it might be a
>>> more expensive test, too, so I'm not sure.
>>
>> I think if the number of subxids is very small let us say under 5 or
>> so, then such a check might not matter, but otherwise it could be
>> expensive.
>
> We could find out by testing it. We could also restrict the
> optimization to cases with just a few subxids, because if you've got a
> large number of subxids this optimization probably isn't buying much
> anyway.
>

Yes, and I have modified the patch to compare xids and subxids for
group update. In the initial short tests (with few client counts), it
seems like till 3 savepoints we can win and 10 savepoints onwards
there is some regression or at the very least there doesn't appear to
be any benefit. We need more tests to identify what is the safe
number, but I thought it is better to share the patch to see if we
agree on the changes because if not, then the whole testing needs to
be repeated. Let me know what do you think about attached?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v11.patch application/octet-stream 14.2 KB

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-03-23 07:48:22
Message-ID: CAE9k0PkdmKwpdZG9FX_5pZafYCetS814a3WoXA2ng1hzjvWueg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi All,

I have tried to test 'group_update_clog_v11.1.patch' shared upthread by
Amit on a high end machine. I have tested the patch with various savepoints
in my test script. The machine details along with test scripts and the test
results are shown below,

Machine details:
============
24 sockets, 192 CPU(s)
RAM - 500GB

test script:
========

\set aid random (1,30000000)
\set tid random (1,3000)

BEGIN;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s1;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s3;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s4;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s5;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
END;

Non-default parameters
==================
max_connections = 200
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
checkpoint_timeout=900
synchronous_commit=off

pgbench -M prepared -c $thread -j $thread -T $time_for_reading postgres -f
~/test_script.sql

where, time_for_reading = 10 mins

Test Results:
=========

With 3 savepoints
=============

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
128 50275 53704 6.82048732
64 62860 66561 5.887686923
8 18464 18752 1.559792028

With 5 savepoints
=============

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
128 46559 47715 2.482871196
64 52306 52082 -0.4282491492
8 12289 12852 4.581332899

With 7 savepoints
=============

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
128 41367 41500 0.3215123166
64 42996 41473 -3.542189971
8 9665 9657 -0.0827728919

With 10 savepoints
==============

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
128 34513 34597 0.24338655
64 32581 32035 -1.675823333
8 7293 7622 4.511175099
*Conclusion:*
As seen from the test results mentioned above, there is some performance
improvement with 3 SP(s), with 5 SP(s) the results with patch is slightly
better than HEAD, with 7 and 10 SP(s) we do see regression with patch.
Therefore, I think the threshold value of 4 for number of subtransactions
considered in the patch looks fine to me.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Tue, Mar 21, 2017 at 6:19 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> On Mon, Mar 20, 2017 at 8:27 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> > On Fri, Mar 17, 2017 at 2:30 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >>> I was wondering about doing an explicit test: if the XID being
> >>> committed matches the one in the PGPROC, and nsubxids matches, and the
> >>> actual list of XIDs matches, then apply the optimization. That could
> >>> replace the logic that you've proposed to exclude non-commit cases,
> >>> gxact cases, etc. and it seems fundamentally safer. But it might be a
> >>> more expensive test, too, so I'm not sure.
> >>
> >> I think if the number of subxids is very small let us say under 5 or
> >> so, then such a check might not matter, but otherwise it could be
> >> expensive.
> >
> > We could find out by testing it. We could also restrict the
> > optimization to cases with just a few subxids, because if you've got a
> > large number of subxids this optimization probably isn't buying much
> > anyway.
> >
>
> Yes, and I have modified the patch to compare xids and subxids for
> group update. In the initial short tests (with few client counts), it
> seems like till 3 savepoints we can win and 10 savepoints onwards
> there is some regression or at the very least there doesn't appear to
> be any benefit. We need more tests to identify what is the safe
> number, but I thought it is better to share the patch to see if we
> agree on the changes because if not, then the whole testing needs to
> be repeated. Let me know what do you think about attached?
>
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-04-07 17:47:38
Message-ID: CA+TgmoYKqjN79SpLKWC8kq0gJti1V7MmbG5GZ+arkbGHrKnVow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 9, 2017 at 5:49 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> However, I just realized that in
> both this case and in the case of group XID clearing, we weren't
> advertising a wait event for the PGSemaphoreLock calls that are part
> of the group locking machinery. I think we should fix that, because a
> quick test shows that can happen fairly often -- not, I think, as
> often as we would have seen LWLock waits without these patches, but
> often enough that you'll want to know. Patch attached.

I've pushed the portion of this that relates to ProcArrayLock. (I
know this hasn't been discussed much, but there doesn't really seem to
be any reason for anybody to object, and looking at just the
LWLock/ProcArrayLock wait events gives a highly misleading answer.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-07-03 12:45:47
Message-ID: CAA4eK1JhimqZy7xFmbqjRxJno5GpjS-RrqXM2928xDJ58Sdfsg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 23, 2017 at 1:18 PM, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
wrote:
>
> *Conclusion:*
> As seen from the test results mentioned above, there is some performance
> improvement with 3 SP(s), with 5 SP(s) the results with patch is slightly
> better than HEAD, with 7 and 10 SP(s) we do see regression with patch.
> Therefore, I think the threshold value of 4 for number of subtransactions
> considered in the patch looks fine to me.
>
>
Thanks for the tests. Attached find the rebased patch on HEAD. I have ran
latest pgindent on patch. I have yet to add wait event for group lock
waits in this patch as is done by Robert in commit
d4116a771925379c33cf4c6634ca620ed08b551d for ProcArrayGroupUpdate.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v12.patch application/octet-stream 14.2 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-07-04 04:33:38
Message-ID: CAA4eK1KudxzgWhuywY_X=yeSAhJMT4DwCjroV5Ay60xaeB2Eew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 3, 2017 at 6:15 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Thu, Mar 23, 2017 at 1:18 PM, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
> wrote:
>>
>> Conclusion:
>> As seen from the test results mentioned above, there is some performance
>> improvement with 3 SP(s), with 5 SP(s) the results with patch is slightly
>> better than HEAD, with 7 and 10 SP(s) we do see regression with patch.
>> Therefore, I think the threshold value of 4 for number of subtransactions
>> considered in the patch looks fine to me.
>>
>
> Thanks for the tests. Attached find the rebased patch on HEAD. I have ran
> latest pgindent on patch. I have yet to add wait event for group lock waits
> in this patch as is done by Robert in commit
> d4116a771925379c33cf4c6634ca620ed08b551d for ProcArrayGroupUpdate.
>

I have updated the patch to support wait events and moved it to upcoming CF.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v13.patch application/octet-stream 16.3 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-08-29 21:13:25
Message-ID: CA+TgmoYsGdXbfmT5-YCnP=RbSuKzw9WHErDB66F4vJ_nDYn+Gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 4, 2017 at 12:33 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I have updated the patch to support wait events and moved it to upcoming CF.

This patch doesn't apply any more, but I made it apply with a hammer
and then did a little benchmarking (scylla, EDB server, Intel Xeon
E5-2695 v3 @ 2.30GHz, 2 sockets, 14 cores/socket, 2 threads/core).
The results were not impressive. There's basically no clog contention
to remove, so the patch just doesn't really do anything. For example,
here's a wait event profile with master and using Ashutosh's test
script with 5 savepoints:

1 Lock | tuple
2 IO | SLRUSync
5 LWLock | wal_insert
5 LWLock | XidGenLock
9 IO | DataFileRead
12 LWLock | lock_manager
16 IO | SLRURead
20 LWLock | CLogControlLock
97 LWLock | buffer_content
216 Lock | transactionid
237 LWLock | ProcArrayLock
1238 IPC | ProcArrayGroupUpdate
2266 Client | ClientRead

This is just a 5-minute test; maybe things would change if we ran it
for longer, but if only 0.5% of the samples are blocked on
CLogControlLock without the patch, obviously the patch can't help
much. I did some other experiments too, but I won't bother
summarizing the results here because they're basically boring. I
guess I should have used a bigger machine.

Given that we've changed the approach here somewhat, I think we need
to validate that we're still seeing a substantial reduction in
CLogControlLock contention on big machines.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-08-30 07:24:03
Message-ID: CAA4eK1KQJbzQ2E9_hV5Ajjqk0Y4AmGzYhDd9=JnNzr7YgxgQ8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 30, 2017 at 2:43 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jul 4, 2017 at 12:33 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> I have updated the patch to support wait events and moved it to upcoming CF.
>
> This patch doesn't apply any more, but I made it apply with a hammer
> and then did a little benchmarking (scylla, EDB server, Intel Xeon
> E5-2695 v3 @ 2.30GHz, 2 sockets, 14 cores/socket, 2 threads/core).
> The results were not impressive. There's basically no clog contention
> to remove, so the patch just doesn't really do anything.
>

Yeah, in such a case patch won't help.

> For example,
> here's a wait event profile with master and using Ashutosh's test
> script with 5 savepoints:
>
> 1 Lock | tuple
> 2 IO | SLRUSync
> 5 LWLock | wal_insert
> 5 LWLock | XidGenLock
> 9 IO | DataFileRead
> 12 LWLock | lock_manager
> 16 IO | SLRURead
> 20 LWLock | CLogControlLock
> 97 LWLock | buffer_content
> 216 Lock | transactionid
> 237 LWLock | ProcArrayLock
> 1238 IPC | ProcArrayGroupUpdate
> 2266 Client | ClientRead
>
> This is just a 5-minute test; maybe things would change if we ran it
> for longer, but if only 0.5% of the samples are blocked on
> CLogControlLock without the patch, obviously the patch can't help
> much. I did some other experiments too, but I won't bother
> summarizing the results here because they're basically boring. I
> guess I should have used a bigger machine.
>

That would have been better. In any case, will do the tests on some
higher end machine and will share the results.

> Given that we've changed the approach here somewhat, I think we need
> to validate that we're still seeing a substantial reduction in
> CLogControlLock contention on big machines.
>

Sure will do so. In the meantime, I have rebased the patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
group_update_clog_v14.patch application/octet-stream 16.4 KB

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-09-01 14:03:19
Message-ID: CAFiTN-sEDY-AmemEdqBmROrqurCPqwAbG9sbkyhP1zB2CmieVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 30, 2017 at 12:54 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> That would have been better. In any case, will do the tests on some
> higher end machine and will share the results.
>
>> Given that we've changed the approach here somewhat, I think we need
>> to validate that we're still seeing a substantial reduction in
>> CLogControlLock contention on big machines.
>>
>
> Sure will do so. In the meantime, I have rebased the patch.

I have repeated some of the tests we have performed earlier.

Machine:
Intel 8 socket machine with 128 core.

Configuration:

shared_buffers=8GB
checkpoint_timeout=40min
max_wal_size=20GB
max_connections=300
maintenance_work_mem=4GB
synchronous_commit=off
checkpoint_completion_target=0.9

I have run taken one reading for each test to measure the wait event.
Observation is same that at higher client count there is a significant
reduction in the contention on ClogControlLock.

Benchmark: Pgbench simple_update, 30 mins run:

Head: (64 client) : (TPS 60720)
53808 Client | ClientRead
26147 IPC | ProcArrayGroupUpdate
7866 LWLock | CLogControlLock
3705 Activity | LogicalLauncherMain
3699 Activity | AutoVacuumMain
3353 LWLock | ProcArrayLoc
3099 LWLock | wal_insert
2825 Activity | BgWriterMain
2688 Lock | extend
1436 Activity | WalWriterMain

Patch: (64 client) : (TPS 67207)
53235 Client | ClientRead
29470 IPC | ProcArrayGroupUpdate
4302 LWLock | wal_insert
3717 Activity | LogicalLauncherMain
3715 Activity | AutoVacuumMain
3463 LWLock | ProcArrayLock
3140 Lock | extend
2934 Activity | BgWriterMain
1434 Activity | WalWriterMain
1198 Activity | CheckpointerMain
1073 LWLock | XidGenLock
869 IPC | ClogGroupUpdate

Head:(72 Client): (TPS 57856)

55820 Client | ClientRead
34318 IPC | ProcArrayGroupUpdate
15392 LWLock | CLogControlLock
3708 Activity | LogicalLauncherMain
3705 Activity | AutoVacuumMain
3436 LWLock | ProcArrayLock

Patch:(72 Client): (TPS 65740)

60356 Client | ClientRead
38545 IPC | ProcArrayGroupUpdate
4573 LWLock | wal_insert
3708 Activity | LogicalLauncherMain
3705 Activity | AutoVacuumMain
3508 LWLock | ProcArrayLock
3492 Lock | extend
2903 Activity | BgWriterMain
1903 LWLock | XidGenLock
1383 Activity | WalWriterMain
1212 Activity | CheckpointerMain
1056 IPC | ClogGroupUpdate

Head:(96 Client): (TPS 52170)

62841 LWLock | CLogControlLock
56150 IPC | ProcArrayGroupUpdate
54761 Client | ClientRead
7037 LWLock | wal_insert
4077 Lock | extend
3727 Activity | LogicalLauncherMain
3727 Activity | AutoVacuumMain
3027 LWLock | ProcArrayLock

Patch:(96 Client): (TPS 67932)

87378 IPC | ProcArrayGroupUpdate
80201 Client | ClientRead
11511 LWLock | wal_insert
4102 Lock | extend
3971 LWLock | ProcArrayLock
3731 Activity | LogicalLauncherMain
3731 Activity | AutoVacuumMain
2948 Activity | BgWriterMain
1763 LWLock | XidGenLock
1736 IPC | ClogGroupUpdate

Head:(128 Client): (TPS 40820)

182569 LWLock | CLogControlLock
61484 IPC | ProcArrayGroupUpdate
37969 Client | ClientRead
5135 LWLock | wal_insert
3699 Activity | LogicalLauncherMain
3699 Activity | AutoVacuumMain

Patch:(128 Client): (TPS 67054)

174583 IPC | ProcArrayGroupUpdate
66084 Client | ClientRead
16738 LWLock | wal_insert
4993 IPC | ClogGroupUpdate
4893 LWLock | ProcArrayLock
4839 Lock | extend

Benchmark: select for update with 3 save points, 10 mins run

Script:
\set aid random (1,30000000)
\set tid random (1,3000)

BEGIN;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s1;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s3;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
END;

Head:(64 Client): (TPS 44577.1802)

53808 Client | ClientRead
26147 IPC | ProcArrayGroupUpdate
7866 LWLock | CLogControlLock
3705 Activity | LogicalLauncherMain
3699 Activity | AutoVacuumMain
3353 LWLock | ProcArrayLock
3099 LWLock | wal_insert

Patch:(64 Client): (TPS 46156.245)

53235 Client | ClientRead
29470 IPC | ProcArrayGroupUpdate
4302 LWLock | wal_insert
3717 Activity | LogicalLauncherMain
3715 Activity | AutoVacuumMain
3463 LWLock | ProcArrayLock
3140 Lock | extend
2934 Activity | BgWriterMain
1434 Activity | WalWriterMain
1198 Activity | CheckpointerMain
1073 LWLock | XidGenLock
869 IPC | ClogGroupUpdate

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-09-01 15:47:25
Message-ID: CA+TgmoZAoC+Ms_SAHwA=4AxaDnJiTVqmyOtmZ8ZQEZGFR8-zfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 1, 2017 at 10:03 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> Sure will do so. In the meantime, I have rebased the patch.
>
> I have repeated some of the tests we have performed earlier.

OK, these tests seem to show that this is still working. Committed,
again. Let's hope this attempt goes better than the last one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2017-09-02 02:11:59
Message-ID: CAA4eK1Lv8n851=92u4G=FF03_8AxoO=-uTYDXX=GtQN0mciXbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 1, 2017 at 9:17 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Sep 1, 2017 at 10:03 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>>> Sure will do so. In the meantime, I have rebased the patch.
>>
>> I have repeated some of the tests we have performed earlier.
>

Thanks for repeating the performance tests.

> OK, these tests seem to show that this is still working. Committed,
> again. Let's hope this attempt goes better than the last one.
>

Thanks for committing.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com