Quick Links

Re: CLOG contention, part 2

Lists:	pgsql-hackers

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	CLOG contention, part 2
Date:	2012-01-08 14:25:40
Message-ID:	CA+U5nM+wH-PUbH9p7p5LX3RD0XOhEZ6bonCx7REASkkv_154tA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Recent results from Robert show clog contention is still an issue.

In various discussions Tom noted that pages prior to RecentXmin are
readonly and we might find a way to make use of that fact in providing
different mechanisms or resources.

I've taken that idea and used it to build a second Clog cache, known
as ClogHistory which allows access to the read-only tail of pages in
the clog. Once a page has been written to for the last time, it will
be accessed via the ClogHistory Slru in preference to the normal Clog
Slru. This separates historical accesses by readers from current write
access by committers. Historical access doesn't force dirty writes,
nor are commits made to wait when historical access occurs.

The patch is very simple because all the writes still continue through
the normal route, so is suitable for 9.2.

I'm no longer working on "clog partitioning" patch for this release.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
clog_history.v1.patch	text/x-patch	6.6 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-12 11:14:51
Message-ID:	CA+U5nMK4GwZEy8yv93v=U8dUmFcRMzvQWd_sf3s5LcgF_8nLfg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 8, 2012 at 2:25 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> I've taken that idea and used it to build a second Clog cache, known
> as ClogHistory which allows access to the read-only tail of pages in
> the clog. Once a page has been written to for the last time, it will
> be accessed via the ClogHistory Slru in preference to the normal Clog
> Slru. This separates historical accesses by readers from current write
> access by committers. Historical access doesn't force dirty writes,
> nor are commits made to wait when historical access occurs.

Why do we need this in 9.2?

We now have clog_buffers = 32 and we have write rates ~16,000 tps. At
those write rates we fill a clog buffer every 2 seconds, so the clog
cache completely churns every 64 seconds.

If we wish to achieve those rates in the real world, any access to
data that was written by a transaction more than a minute ago will
cause clog cache page faults, leading to stalls in new transactions.

To avoid those problems we need
* background writing of the clog LRU (already posted as a separate patch)
* a way of separating access to historical data from the main commit
path (this patch)

And to evaluate such situations, we need a way to simulate data that
contains many transactions. 32 buffers can hold just over 1 million
transaction ids, so benchmarks against databases containing > 10
million separate transactions are recommended (remembering that this
is just 10 mins of data on high TPS systems). A pgbench patch is
provided separately to aid in the evaluation.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-20 13:37:48
Message-ID:	CA+Tgmoa-+UThwitSYkn6tnpBCoHksaHPQqG=Z3oA9fg9HoXQvQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I've taken that idea and used it to build a second Clog cache, known
> as ClogHistory which allows access to the read-only tail of pages in
> the clog. Once a page has been written to for the last time, it will
> be accessed via the ClogHistory Slru in preference to the normal Clog
> Slru. This separates historical accesses by readers from current write
> access by committers. Historical access doesn't force dirty writes,
> nor are commits made to wait when historical access occurs.

This seems to need a rebase.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-20 14:44:21
Message-ID:	CA+U5nMKESrL_XRuEaXvT_84iwVFmE9hfG7jXG8bUebkpAeYttg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 1:37 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> I've taken that idea and used it to build a second Clog cache, known
>> as ClogHistory which allows access to the read-only tail of pages in
>> the clog. Once a page has been written to for the last time, it will
>> be accessed via the ClogHistory Slru in preference to the normal Clog
>> Slru. This separates historical accesses by readers from current write
>> access by committers. Historical access doesn't force dirty writes,
>> nor are commits made to wait when historical access occurs.
>
> This seems to need a rebase.

OT: It would save lots of time if we had 2 things for the CF app:

1. Emails that go to appropriate people when status changes. e.g. when
someone sets "Waiting for Author" the author gets an email so they
know the reviewer is expecting something. No knowing that wastes lots
of days, so if we want to do this in less days that seems like a great
place to start.

2. Something that automatically tests patches. If you submit a patch
we run up a blank VM and run patch applies on all patches. As soon as
we get a fail, an email goes to patch author. That way authors know as
soon as a recent commit invalidates something.

Those things have wasted time for me in the past, so they're
opportunities to improve the process, not must haves.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-20 15:08:23
Message-ID:	CA+TgmobRvkSybGCOrsN8Q_aa9bW8WB2pN3AroS9E_7Xp4H9oBg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 9:44 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Fri, Jan 20, 2012 at 1:37 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> I've taken that idea and used it to build a second Clog cache, known
>>> as ClogHistory which allows access to the read-only tail of pages in
>>> the clog. Once a page has been written to for the last time, it will
>>> be accessed via the ClogHistory Slru in preference to the normal Clog
>>> Slru. This separates historical accesses by readers from current write
>>> access by committers. Historical access doesn't force dirty writes,
>>> nor are commits made to wait when historical access occurs.
>>
>> This seems to need a rebase.
>
> OT: It would save lots of time if we had 2 things for the CF app:
>
> 1. Emails that go to appropriate people when status changes. e.g. when
> someone sets "Waiting for Author" the author gets an email so they
> know the reviewer is expecting something. No knowing that wastes lots
> of days, so if we want to do this in less days that seems like a great
> place to start.
>
> 2. Something that automatically tests patches. If you submit a patch
> we run up a blank VM and run patch applies on all patches. As soon as
> we get a fail, an email goes to patch author. That way authors know as
> soon as a recent commit invalidates something.
>
> Those things have wasted time for me in the past, so they're
> opportunities to improve the process, not must haves.

Yeah, I agree that that would be nice. I just haven't had time to
implement much of anything for the CF application in a long time. My
management has been very interested in the performance and scalability
stuff, so that's been my main focus for 9.2. I'm going to see if I
can carve out some time for this once the dust settles.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-20 15:16:42
Message-ID:	CA+U5nMLqww8OYhq-OcgeD_c08zin1=aehMOGCVOnzem2P=tH6Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Still applies and compiles cleanly for me.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-20 15:32:43
Message-ID:	CA+TgmobEfHZE1LjcSaZcs0q_trfRLysg4Svk5JDuhtwQFGsDPA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 10:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Fri, Jan 20, 2012 at 1:37 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> I've taken that idea and used it to build a second Clog cache, known
>>> as ClogHistory which allows access to the read-only tail of pages in
>>> the clog. Once a page has been written to for the last time, it will
>>> be accessed via the ClogHistory Slru in preference to the normal Clog
>>> Slru. This separates historical accesses by readers from current write
>>> access by committers. Historical access doesn't force dirty writes,
>>> nor are commits made to wait when historical access occurs.
>>
>> This seems to need a rebase.
>
> Still applies and compiles cleanly for me.

D'oh. You're right. Looks like I accidentally tried to apply this to
the 9.1 sources. Sigh...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-20 15:38:24
Message-ID:	CA+U5nMJ-LHUbdRrxeZ8WvKESeU+sh5ePMxZVwKP+Ocv-CVzejg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 3:32 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Jan 20, 2012 at 10:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On Fri, Jan 20, 2012 at 1:37 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>>> I've taken that idea and used it to build a second Clog cache, known
>>>> as ClogHistory which allows access to the read-only tail of pages in
>>>> the clog. Once a page has been written to for the last time, it will
>>>> be accessed via the ClogHistory Slru in preference to the normal Clog
>>>> Slru. This separates historical accesses by readers from current write
>>>> access by committers. Historical access doesn't force dirty writes,
>>>> nor are commits made to wait when historical access occurs.
>>>
>>> This seems to need a rebase.
>>
>> Still applies and compiles cleanly for me.
>
> D'oh. You're right. Looks like I accidentally tried to apply this to
> the 9.1 sources. Sigh...

No worries. It's Friday.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-20 15:44:28
Message-ID:	CA+Tgmoa_cdCuW+YP-wC=izdsF2BDZS-AZJqRsP1wq_i0ma+5Xw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 10:38 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Fri, Jan 20, 2012 at 3:32 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Fri, Jan 20, 2012 at 10:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> On Fri, Jan 20, 2012 at 1:37 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>> On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>>>> I've taken that idea and used it to build a second Clog cache, known
>>>>> as ClogHistory which allows access to the read-only tail of pages in
>>>>> the clog. Once a page has been written to for the last time, it will
>>>>> be accessed via the ClogHistory Slru in preference to the normal Clog
>>>>> Slru. This separates historical accesses by readers from current write
>>>>> access by committers. Historical access doesn't force dirty writes,
>>>>> nor are commits made to wait when historical access occurs.
>>>>
>>>> This seems to need a rebase.
>>>
>>> Still applies and compiles cleanly for me.
>>
>> D'oh. You're right. Looks like I accidentally tried to apply this to
>> the 9.1 sources. Sigh...
>
> No worries. It's Friday.

http://www.youtube.com/watch?v=kfVsfOSbJY0

Of course, I even ran git log to check that I had the latest
sources... but what I had, of course, was the latest 9.1 sources,
which still have recently-timestamped commits, and I didn't look
carefully enough. Sigh.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-21 13:57:45
Message-ID:	CA+TgmoYdwNQJQ31wiWiZPodVCO-9CqyiOz3W_r8b549JsPfiXQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 10:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> D'oh. You're right. Looks like I accidentally tried to apply this to
>>> the 9.1 sources. Sigh...
>>
>> No worries. It's Friday.

Server passed 'make check' with this patch, but when I tried to fire
it up for some test runs, it fell over with:

FATAL: no more LWLockIds available

I assume that it must be dependent on the config settings used. Here are mine:

shared_buffers = 8GB
maintenance_work_mem = 1GB
synchronous_commit = off
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
wal_writer_delay = 20ms

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-21 15:31:21
Message-ID:	CA+U5nMLm7F9M1q4zkRCs3Zxhsdqd-nBKd9mpXL2pYrzHawpT2w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 21, 2012 at 1:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Jan 20, 2012 at 10:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>> D'oh. You're right. Looks like I accidentally tried to apply this to
>>>> the 9.1 sources. Sigh...
>>>
>>> No worries. It's Friday.
>
> Server passed 'make check' with this patch, but when I tried to fire
> it up for some test runs, it fell over with:
>
> FATAL: no more LWLockIds available
>
> I assume that it must be dependent on the config settings used. Here are mine:
>
> shared_buffers = 8GB
> maintenance_work_mem = 1GB
> synchronous_commit = off
> checkpoint_segments = 300
> checkpoint_timeout = 15min
> checkpoint_completion_target = 0.9
> wal_writer_delay = 20ms

Yes, it was. Sorry about that. New version attached, retesting while
you read this.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
clog_history.v2.patch	text/x-patch	7.1 KB

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-22 22:30:16
Message-ID:	CAMkU=1w5UgzaAg_tNfA08grJYnMQ2XxUDWYbLkd8wvWWbZkZGA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 6:44 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
> OT: It would save lots of time if we had 2 things for the CF app:
>
..
> 2. Something that automatically tests patches. If you submit a patch
> we run up a blank VM and run patch applies on all patches. As soon as
> we get a fail, an email goes to patch author. That way authors know as
> soon as a recent commit invalidates something.

Well, first the CF app would need to reliably be able to find the
actual patch. That is currently not a given.

Also, it seems that OID collisions are a dime a dozen, and I'm
starting to doubt that they are even worth reporting in the absence of
a more substantive review. And in the patches I've looked at, it
seems like the OID is not even cross-referenced anywhere else in the
patch, the cross-references are all based on symbolic names. I freely
admit I have no idea what I am talking about, but it seems like the
only purpose of OIDs is to create bit rot.

Cheers,

Jeff

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-27 22:05:41
Message-ID:	CAMkU=1xmSBJxidW-m5kBAcWTBdvR87=rwLj7Ep6Vsnf-1+Q9bg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
> Yes, it was. Sorry about that. New version attached, retesting while
> you read this.

In my hands I could never get this patch to do anything. The new
cache was never used.

I think that that was because RecentXminPageno never budged from -1.

I think that that, in turn, is because the comparison below can never
return true, because the comparison is casting both sides to uint, and
-1 cast to uint is very large

/* When we commit advance ClogCtl's shared RecentXminPageno if needed */
if (ClogCtl->shared->RecentXminPageno < TransactionIdToPage(RecentXmin))
ClogCtl->shared->RecentXminPageno =
TransactionIdToPage(RecentXmin);

Also, I think the general approach is wrong. The only reason to have
these pages in shared memory is that we can control access to them to
prevent write/write and read/write corruption. Since these pages are
never written, they don't need to be in shared memory. Just read
each page into backend-local memory as it is needed, either
palloc/pfree each time or using a single reserved block for the
lifetime of the session. Let the kernel worry about caching them so
that the above mentioned reads are cheap.

Cheers,

Jeff

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-27 23:16:57
Message-ID:	CAHyXU0wukkdwBkUSFcFUeF_H+cpa_nKJz0d3=FZ9eXzuy2r=XQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 27, 2012 at 4:05 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> Also, I think the general approach is wrong. The only reason to have
> these pages in shared memory is that we can control access to them to
> prevent write/write and read/write corruption. Since these pages are
> never written, they don't need to be in shared memory. Just read
> each page into backend-local memory as it is needed, either
> palloc/pfree each time or using a single reserved block for the
> lifetime of the session. Let the kernel worry about caching them so
> that the above mentioned reads are cheap.

right -- exactly. but why stop at one page?

merlin

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-28 01:21:59
Message-ID:	CAMkU=1x7eeuT1os8r1ccnua6+7T0fqaFmE_=s5DZ313u7tFfjw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 27, 2012 at 3:16 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Fri, Jan 27, 2012 at 4:05 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> Also, I think the general approach is wrong. The only reason to have
>> these pages in shared memory is that we can control access to them to
>> prevent write/write and read/write corruption. Since these pages are
>> never written, they don't need to be in shared memory. Just read
>> each page into backend-local memory as it is needed, either
>> palloc/pfree each time or using a single reserved block for the
>> lifetime of the session. Let the kernel worry about caching them so
>> that the above mentioned reads are cheap.
>
> right -- exactly. but why stop at one page?

If you have more than one, you need code to decide which one to evict
(just free) every time you need a new one. And every process needs to
be running this code, while the kernel is still going to need make its
own decisions for the entire system. It seems simpler to just let the
kernel do the job for everyone. Are you worried that a read syscall
is going to be slow even when the data is presumably cached in the OS?

Cheers,

Jeff

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-28 13:52:52
Message-ID:	CA+U5nMLv4ddV3JoK4hK7zdUb50WMiLJxZEkHmGskctnY-FMXfw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>
>> Yes, it was. Sorry about that. New version attached, retesting while
>> you read this.
>
> In my hands I could never get this patch to do anything. The new
> cache was never used.
>
> I think that that was because RecentXminPageno never budged from -1.
>
> I think that that, in turn, is because the comparison below can never
> return true, because the comparison is casting both sides to uint, and
> -1 cast to uint is very large
>
> /* When we commit advance ClogCtl's shared RecentXminPageno if needed */
> if (ClogCtl->shared->RecentXminPageno < TransactionIdToPage(RecentXmin))
> ClogCtl->shared->RecentXminPageno =
> TransactionIdToPage(RecentXmin);

Thanks, will look again.

> Also, I think the general approach is wrong. The only reason to have
> these pages in shared memory is that we can control access to them to
> prevent write/write and read/write corruption. Since these pages are
> never written, they don't need to be in shared memory. Just read
> each page into backend-local memory as it is needed, either
> palloc/pfree each time or using a single reserved block for the
> lifetime of the session. Let the kernel worry about caching them so
> that the above mentioned reads are cheap.

Will think on that.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-29 18:59:20
Message-ID:	CA+U5nM+Vefd3pz9_WugdTLpia97dwH==DCSW+iQQqZEk6S6rZA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 28, 2012 at 1:52 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>> Also, I think the general approach is wrong. The only reason to have
>> these pages in shared memory is that we can control access to them to
>> prevent write/write and read/write corruption. Since these pages are
>> never written, they don't need to be in shared memory. Just read
>> each page into backend-local memory as it is needed, either
>> palloc/pfree each time or using a single reserved block for the
>> lifetime of the session. Let the kernel worry about caching them so
>> that the above mentioned reads are cheap.
>
> Will think on that.

For me, there are arguments both ways as to whether it should be in
shared or local memory.

The one factor that makes the answer "shared" for me is that its much
easier to reuse existing SLRU code. We dont need to invent a new way
of cacheing/access etc. We just rewire what we already have. So
overall, the local/shared debate is much less important that the
robustness/code reuse angle. That's what makes this patch fairly
simple.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-29 20:18:05
Message-ID:	CA+U5nMLNw3fKg5SctSbzQHr+nm-2Zzp16XWmCkdy3OybjbNg3Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for looking at the patch.

The patch works fine. RecentXminPageno does move forwards as it is
supposed to and there are no uints anywhere in that calculation.

The pageno only moves forwards every 32,000 transactions, so I'm
guessing that your testing didn't go on for long enough to show it
working correctly.

As regards to effectiveness, you need to execute more than 1 million
transactions before the main clog cache fills, which might sound a
lot, but its approximately 1 minute of heavy transactions at the
highest rate Robert has published.

I've specifically designed the pgbench changes required to simulate
conditions of clog contention to help in the evaluation of this patch.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-29 21:41:30
Message-ID:	CAMkU=1whQ4yOsijWMPU2YzboZ+8jQV2NkX5pjsLhzSOQWGWazQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 29, 2012 at 12:18 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>>
>>> Yes, it was. Sorry about that. New version attached, retesting while
>>> you read this.
>>
>> In my hands I could never get this patch to do anything. The new
>> cache was never used.
>>
>> I think that that was because RecentXminPageno never budged from -1.
>>
>> I think that that, in turn, is because the comparison below can never
>> return true, because the comparison is casting both sides to uint, and
>> -1 cast to uint is very large
>>
>> /* When we commit advance ClogCtl's shared RecentXminPageno if needed */
>> if (ClogCtl->shared->RecentXminPageno < TransactionIdToPage(RecentXmin))
>> ClogCtl->shared->RecentXminPageno =
>> TransactionIdToPage(RecentXmin);
>
> Thanks for looking at the patch.
>
> The patch works fine. RecentXminPageno does move forwards as it is
> supposed to and there are no uints anywhere in that calculation.

Maybe it is system dependent. Or, are you running this patch on top
of some other uncommitted patch (other than the pgbench one)?

RecentXmin is a TransactionID, which is a uint32.
I think the TransactionIdToPage macro preserves that.

If I cast to a int, then I see advancement:

if (ClogCtl->shared->RecentXminPageno < (int) TransactionIdToPage(RecentXmin))

...
> I've specifically designed the pgbench changes required to simulate
> conditions of clog contention to help in the evaluation of this patch.

Yep, I've used that one for the testing.

Cheers,

Jeff

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-29 22:11:35
Message-ID:	CAMkU=1zExy_Y55pQETp4YGeN8=eQwDow3Ar9Nfr4nPt+d17cpQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 29, 2012 at 1:41 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Sun, Jan 29, 2012 at 12:18 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>> On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>>>
>>>> Yes, it was. Sorry about that. New version attached, retesting while
>>>> you read this.
>>>
>>> In my hands I could never get this patch to do anything. The new
>>> cache was never used.
>>>
>>> I think that that was because RecentXminPageno never budged from -1.
>>>
>>> I think that that, in turn, is because the comparison below can never
>>> return true, because the comparison is casting both sides to uint, and
>>> -1 cast to uint is very large
>>>
>>> /* When we commit advance ClogCtl's shared RecentXminPageno if needed */
>>> if (ClogCtl->shared->RecentXminPageno < TransactionIdToPage(RecentXmin))
>>> ClogCtl->shared->RecentXminPageno =
>>> TransactionIdToPage(RecentXmin);
>>
>> Thanks for looking at the patch.
>>
>> The patch works fine. RecentXminPageno does move forwards as it is
>> supposed to and there are no uints anywhere in that calculation.
>
> Maybe it is system dependent. Or, are you running this patch on top
> of some other uncommitted patch (other than the pgbench one)?
>
> RecentXmin is a TransactionID, which is a uint32.
> I think the TransactionIdToPage macro preserves that.
>
> If I cast to a int, then I see advancement:
>
> if (ClogCtl->shared->RecentXminPageno < (int) TransactionIdToPage(RecentXmin))

And to clarify, if I don't do the cast, I don't see advancement, using
this code:

elog(LOG, "JJJ RecentXminPageno %d, %d",
ClogCtl->shared->RecentXminPageno , TransactionIdToPage(RecentXmin));
if (ClogCtl->shared->RecentXminPageno <
TransactionIdToPage(RecentXmin))
ClogCtl->shared->RecentXminPageno =
TransactionIdToPage(RecentXmin);

Then using your pgbench -I -s 100 -c 8 -j8, I get tons of log entries like:

LOG: JJJ RecentXminPageno -1, 149
STATEMENT: INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES
(nextval('pgbench_accounts_load_seq'), 1 + (lastval()/(100000)), 0);

Cheers,

Jeff

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-29 23:04:47
Message-ID:	CA+U5nMJOq0nv0xpKZ3ad5Wd+Tp7-M_tSBdo-g54BpxnZYGoKGg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 29, 2012 at 9:41 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> If I cast to a int, then I see advancement:

I'll initialise it as 0, rather than -1 and then we don't have a
problem in any circumstance.

>> I've specifically designed the pgbench changes required to simulate
>> conditions of clog contention to help in the evaluation of this patch.
>
> Yep, I've used that one for the testing.

Most of the current patch is just bookkeeping to keep track of the
point when we can look at history in read only manner.

I've isolated the code better to allow you to explore various
implementation options. I don't see any performance difference between
any of them really, but you're welcome to look.

Please everybody note that the clog history doesn't even become active
until the first checkpoint, so this is dead code until we've hit the
first checkpoint cycle and completed a million transactions since
startup. So its designed to tune for real world situations, and is not
easy to benchmark. (Maybe we could start earlier, but having extra
code just for first few minutes seems waste of energy, especially
since we must hit million xids also).

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
clog_history.v3.patch	text/x-diff	8.2 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-01-30 20:24:25
Message-ID:	CA+Tgmob+xWFeuY4=kYL_sck1F2NfHcOO5cyJn2zaK_vyaqnGHw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 27, 2012 at 8:21 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Fri, Jan 27, 2012 at 3:16 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Fri, Jan 27, 2012 at 4:05 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>> Also, I think the general approach is wrong. The only reason to have
>>> these pages in shared memory is that we can control access to them to
>>> prevent write/write and read/write corruption. Since these pages are
>>> never written, they don't need to be in shared memory. Just read
>>> each page into backend-local memory as it is needed, either
>>> palloc/pfree each time or using a single reserved block for the
>>> lifetime of the session. Let the kernel worry about caching them so
>>> that the above mentioned reads are cheap.
>>
>> right -- exactly. but why stop at one page?
>
> If you have more than one, you need code to decide which one to evict
> (just free) every time you need a new one. And every process needs to
> be running this code, while the kernel is still going to need make its
> own decisions for the entire system. It seems simpler to just let the
> kernel do the job for everyone. Are you worried that a read syscall
> is going to be slow even when the data is presumably cached in the OS?

I think that would be a very legitimate worry. You're talking about
copying 8kB of data because you need two bits. Even if the
user/kernel mode context switch is lightning-fast, that's a lot of
extra data copying.

In a previous commit, 33aaa139e6302e81b4fbf2570be20188bb974c4f, we
increased the number of CLOG buffers from 8 to 32 (except in very
low-memory configurations). The main reason that shows a win on Nate
Boley's 32-core test machine appears to be because it avoids the
scenario where there are, say, 12 people simultaneously wanting to
read 12 different CLOG buffers, and so 4 of them have to wait for a
buffer to become available before they can even think about starting a
read. The really bad latency spikes were happening not because the
I/O took a long time, but because it can't be started immediately.
However, these spikes are now gone, as a result of the above-commit.
Probably you can get them back with enough cores, but you'll probably
hit a lot of other, more serious problems first.

I assume that if there's any purpose to further optimization here,
it's either because the overall miss rate of the cache is too large,
or because the remaining locking costs are too high. Unfortunately I
haven't yet had time to look at this patch and understand what it
does, or machine cycles available to benchmark it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-02-01 02:39:58
Message-ID:	CAMkU=1xgSo6xf4yW+VU7ML_bP424h4-+414iu2v2ZubO6484SA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 30, 2012 at 12:24 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Jan 27, 2012 at 8:21 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> On Fri, Jan 27, 2012 at 3:16 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> On Fri, Jan 27, 2012 at 4:05 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>>> Also, I think the general approach is wrong. The only reason to have
>>>> these pages in shared memory is that we can control access to them to
>>>> prevent write/write and read/write corruption. Since these pages are
>>>> never written, they don't need to be in shared memory. Just read
>>>> each page into backend-local memory as it is needed, either
>>>> palloc/pfree each time or using a single reserved block for the
>>>> lifetime of the session. Let the kernel worry about caching them so
>>>> that the above mentioned reads are cheap.
>>>
>>> right -- exactly. but why stop at one page?
>>
>> If you have more than one, you need code to decide which one to evict
>> (just free) every time you need a new one. And every process needs to
>> be running this code, while the kernel is still going to need make its
>> own decisions for the entire system. It seems simpler to just let the
>> kernel do the job for everyone. Are you worried that a read syscall
>> is going to be slow even when the data is presumably cached in the OS?
>
> I think that would be a very legitimate worry. You're talking about
> copying 8kB of data because you need two bits. Even if the
> user/kernel mode context switch is lightning-fast, that's a lot of
> extra data copying.

I guess the most radical step in the direction I am advocating would
be to simply read the one single byte with the data you want. Very
little copying, but then the odds of the next thing you want being on
the one <chunk of data> you already had in memory is much smaller.

>
> In a previous commit, 33aaa139e6302e81b4fbf2570be20188bb974c4f, we
> increased the number of CLOG buffers from 8 to 32 (except in very
> low-memory configurations). The main reason that shows a win on Nate
> Boley's 32-core test machine appears to be because it avoids the
> scenario where there are, say, 12 people simultaneously wanting to
> read 12 different CLOG buffers, and so 4 of them have to wait for a
> buffer to become available before they can even think about starting a
> read. The really bad latency spikes were happening not because the
> I/O took a long time, but because it can't be started immediately.

Ah, I hadn't followed that closely. I had thought the main problem
solved by that patch was that sometimes all of the CLOG buffers would
be dirty, and so no one could read anything in until something else
was written out, which could involve either blocking writes on a
system with checkpoint-sync related constipation, or (if
synchronous_commit=off) fsyncs. By reading the old-enough ones into
local memory, you avoid both any locking and any writes. Simon's
patch solves the writes, but there is still locking.

I don't have enough hardware to test any of these theories, so all I
can do is wave hands around. Maybe if I drop the number of buffers
from 32 back to 8 or even 4, that would create a model system that
could usefully test out the theories on hardware I have, but I'd doubt
how transferable the results would be. With Simon's patch if I drop
it to 8, it would really be 16 as there are now 2 sets of them, so I
suppose it should be compared to head with 16 buffers to put them on
an equal footing.

Cheers,

Jeff

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-02-08 23:26:52
Message-ID:	CA+Tgmoa38nvEzVPbJJ4RFWQMyAWtzj_Oj6xBcAk8aQ4BguOJew@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 29, 2012 at 6:04 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sun, Jan 29, 2012 at 9:41 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
>> If I cast to a int, then I see advancement:
>
> I'll initialise it as 0, rather than -1 and then we don't have a
> problem in any circumstance.
>
>
>>> I've specifically designed the pgbench changes required to simulate
>>> conditions of clog contention to help in the evaluation of this patch.
>>
>> Yep, I've used that one for the testing.
>
> Most of the current patch is just bookkeeping to keep track of the
> point when we can look at history in read only manner.
>
> I've isolated the code better to allow you to explore various
> implementation options. I don't see any performance difference between
> any of them really, but you're welcome to look.
>
> Please everybody note that the clog history doesn't even become active
> until the first checkpoint, so this is dead code until we've hit the
> first checkpoint cycle and completed a million transactions since
> startup. So its designed to tune for real world situations, and is not
> easy to benchmark. (Maybe we could start earlier, but having extra
> code just for first few minutes seems waste of energy, especially
> since we must hit million xids also).

I find that this version does not compile:

clog.c: In function ‘TransactionIdGetStatus’:
clog.c:431: error: ‘clog’ undeclared (first use in this function)
clog.c:431: error: (Each undeclared identifier is reported only once
clog.c:431: error: for each function it appears in.)

Given that, I obviously cannot test this at this point, but let me go
ahead and theorize about how well it's likely to work. What Tom
suggested before (and after some reflection I think I believe it) is
that the frequency of access will be highest for the newest CLOG page
and then drop off for each page further back you go. Clearly, if that
drop-off is fast - e.g. each buffer further backward is half as likely
to be accessed as the next newer one - then the fraction of accesses
that will hit pages that are far enough back to benefit from this
optimization will be infinitesmal; 1023 out of every 1024 accesses
will hit the first ten pages, and on a high-velocity system those all
figure to have been populated since the last checkpoint. The best
case for this patch should be an access pattern that involves a very
long tail; actually, pgbench is a pretty good fit for that, assuming
the scale factor is large enough. For example, at scale factor 100,
we've got 10,000,000 tuples: choosing one at random, we're almost
exactly 90% likely to find one that hasn't been chosen in the last
1,024,576 tuples (i.e. 32 CLOG pages @ 32K txns/page). In terms of
reducing contention on the main CLOG SLRU, that sounds pretty
promising, but depends somewhat on the rate at which transactions are
processed relative to the frequency of checkpoints, since that will
affect how many pages back you have go to use the history path.

However, there is a potential fly in the ointment: in other cases in
which we've reduced contention at the LWLock layer, we've ended up
with very nasty contention at the spinlock layer that can sometimes
eat more CPU time than the LWLock contention did. In that light, it
strikes me that it would be nice to be able to partition the
contention N ways rather than just 2 ways. I think we could do that
as follows. Instead of having one control lock per SLRU, have N
locks, where N is probably a power of 2. Divide the buffer pool for
the SLRU N ways, and decree that each slice of the buffer pool is
controlled by one of the N locks. Route all requests for a page P to
slice P mod N. Unlike this approach, that wouldn't completely
eliminate contention at the LWLock level, but it would reduce it
proportional to the number of partitions, and it would reduce spinlock
contention according to the number of partitions as well. A down side
is that you'll need more buffers to get the same hit rate, but this
proposal has the same problem: it doubles the amount of memory
allocated for CLOG. Of course, this approach is all vaporware right
now, so it's anybody's guess whether it would be better than this if
we had code for it. I'm just throwing it out there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, simon(at)2ndquadrant(dot)com, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-02-10 19:01:50
Message-ID:	CA+CSw_ubRBKdC2Ue-0RzOwKpFObs4nW=BLv6GJS=eFdvKH6N+w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 9, 2012 1:27 AM, "Robert Haas" <robertmhaas(at)gmail(dot)com>
> However, there is a potential fly in the ointment: in other cases in
> which we've reduced contention at the LWLock layer, we've ended up
> with very nasty contention at the spinlock layer that can sometimes
> eat more CPU time than the LWLock contention did. In that light, it
> strikes me that it would be nice to be able to partition the
> contention N ways rather than just 2 ways. I think we could do that
> as follows. Instead of having one control lock per SLRU, have N
> locks, where N is probably a power of 2. Divide the buffer pool for
> the SLRU N ways, and decree that each slice of the buffer pool is
> controlled by one of the N locks. Route all requests for a page P to
> slice P mod N. Unlike this approach, that wouldn't completely
> eliminate contention at the LWLock level, but it would reduce it
> proportional to the number of partitions, and it would reduce spinlock
> contention according to the number of partitions as well. A down side
> is that you'll need more buffers to get the same hit rate, but this
> proposal has the same problem: it doubles the amount of memory
> allocated for CLOG.

Splitting the SLRU into different parts is exactly the same approach as
associativity used in CPU caches. I found some numbers that analyze cache
hit rate with different associativities:

http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/

Now obviously CPU cache access patterns are different from CLOG patterns,
but I think that the numbers strongly suggest that the reduction in hitrate
might be less than what you fear. For example, the harmonic mean of data
cache misses over all benchmark for 16, 32 and 64 cache lines:
| Size | Direct | 2-way LRU | 4-way LRU | 8-way LRU | Full LRU |
|-------+-------------+-------------+-------------+-------------+-------------|

| 1KB | 0.0863842-- | 0.0697167-- | 0.0634309-- | 0.0563450-- | 0.0533706--
|
| 2KB | 0.0571524-- | 0.0423833-- | 0.0360463-- | 0.0330364-- | 0.0305213--
|
| 4KB | 0.0370053-- | 0.0260286-- | 0.0222981-- | 0.0202763-- | 0.0190243--
|

As you can see, the reduction in hit rate is rather small down to 4 way
associative caches.

There may be a performance problem when multiple CLOG pages that happen to
sit in a single way become hot at the same time. The most likely case that
I can come up with is multiple scans going over unhinted pages created at
different time periods. If that is something to worry about, then a tool
that's used for CPUs is to employ a fully associative victim cache behind
the main cache. If a CLOG page is evicted, it is transferred into the
victim cache, evicting a page from there. When a page isn't found in the
main cache, the victim cache is first checked for a possible hit. The
movement between the two caches doesn't need to involve any memory copying
- just swap pointers in metadata.

The victim cache will bring back concurrency issues when the hit rate of
the main cache is small - like the pgbench example you mentioned. In that
case, a simple associative cache will allow multiple reads of clog pages
simultaneously. On the other hand - in that case lock contention seems to
be the symptom, rather than the disease. I think that those cases would be
better handled by increasing the maximum CLOG SLRU size. The increase in
memory usage should be a drop in the bucket for systems that have enough
transaction processing velocity for that to be a problem.

--
Ants Aasma

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-02-10 19:14:11
Message-ID:	CA+U5nMLNj8ptKb4HeNyuTyde_fR8pzNJXy3CqzK718=p=J7nyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 10, 2012 at 7:01 PM, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:
>
> On Feb 9, 2012 1:27 AM, "Robert Haas" <robertmhaas(at)gmail(dot)com>
>
>> However, there is a potential fly in the ointment: in other cases in
>> which we've reduced contention at the LWLock layer, we've ended up
>> with very nasty contention at the spinlock layer that can sometimes
>> eat more CPU time than the LWLock contention did. In that light, it
>> strikes me that it would be nice to be able to partition the
>> contention N ways rather than just 2 ways. I think we could do that
>> as follows. Instead of having one control lock per SLRU, have N
>> locks, where N is probably a power of 2. Divide the buffer pool for
>> the SLRU N ways, and decree that each slice of the buffer pool is
>> controlled by one of the N locks. Route all requests for a page P to
>> slice P mod N. Unlike this approach, that wouldn't completely
>> eliminate contention at the LWLock level, but it would reduce it
>> proportional to the number of partitions, and it would reduce spinlock
>> contention according to the number of partitions as well. A down side
>> is that you'll need more buffers to get the same hit rate, but this
>> proposal has the same problem: it doubles the amount of memory
>> allocated for CLOG.
>
> Splitting the SLRU into different parts is exactly the same approach as
> associativity used in CPU caches. I found some numbers that analyze cache
> hit rate with different associativities:

My suggested approach is essentially identical approach to the one we
already use for partitioning the buffer cache and lock manager. I
expect it to be equally effective at reducing contention.

There is little danger of all hitting same partition at once, since
there are many xids and they are served out sequentially. In the lock
manager case we use the relid as key, so there is some skewing.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-02-25 19:16:34
Message-ID:	CA+U5nMLdT3ypF9orYAApBS-fW_mDA5zANniq8eDxVQZwQ-AOFA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 8, 2012 at 11:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Given that, I obviously cannot test this at this point,

Patch with minor corrections attached here for further review.

> but let me go
> ahead and theorize about how well it's likely to work. What Tom
> suggested before (and after some reflection I think I believe it) is
> that the frequency of access will be highest for the newest CLOG page
> and then drop off for each page further back you go. Clearly, if that
> drop-off is fast - e.g. each buffer further backward is half as likely
> to be accessed as the next newer one - then the fraction of accesses
> that will hit pages that are far enough back to benefit from this
> optimization will be infinitesmal; 1023 out of every 1024 accesses
> will hit the first ten pages, and on a high-velocity system those all
> figure to have been populated since the last checkpoint.

That's just making up numbers, so its not much help. The "theory"
would apply to one workload but not another, so may well be true for
some workload but I doubt whether all databases work that way. I ask
accept the "long tail" distribution as being very common, we just
don't know how long that tail is "typically" or even if there is a
dominant single use case.

> The best
> case for this patch should be an access pattern that involves a very
> long tail;

Agreed

> actually, pgbench is a pretty good fit for that

Completely disagree, as described in detail in the other patch about
creating a realistic test environment for this patch.

pgbench is *not* a real world test.

pgbench loads all the data in one go, then pretends the data got their
one transaction at a time. So pgbench with no mods is actually the
theoretically most unreal imaginable. You have to run pgbench for 1
million transactions before you even theoretically show any gain from
this patch, and it would need to be a long test indeed before the
averaged effect of the patch was large enough to avoid the zero
contribution from the first million transacts.

The only real world way to test this patch is to pre-create the
database using a scale factor of >100 using the modified pgbench, then
run a test. That correctly simulates the real world situation where
all data arrived in single transactions.

> assuming
> the scale factor is large enough. For example, at scale factor 100,
> we've got 10,000,000 tuples: choosing one at random, we're almost
> exactly 90% likely to find one that hasn't been chosen in the last
> 1,024,576 tuples (i.e. 32 CLOG pages @ 32K txns/page). In terms of
> reducing contention on the main CLOG SLRU, that sounds pretty
> promising, but depends somewhat on the rate at which transactions are
> processed relative to the frequency of checkpoints, since that will
> affect how many pages back you have go to use the history path.

> However, there is a potential fly in the ointment: in other cases in
> which we've reduced contention at the LWLock layer, we've ended up
> with very nasty contention at the spinlock layer that can sometimes
> eat more CPU time than the LWLock contention did. In that light, it
> strikes me that it would be nice to be able to partition the
> contention N ways rather than just 2 ways. I think we could do that
> as follows. Instead of having one control lock per SLRU, have N
> locks, where N is probably a power of 2. Divide the buffer pool for
> the SLRU N ways, and decree that each slice of the buffer pool is
> controlled by one of the N locks. Route all requests for a page P to
> slice P mod N. Unlike this approach, that wouldn't completely
> eliminate contention at the LWLock level, but it would reduce it
> proportional to the number of partitions, and it would reduce spinlock
> contention according to the number of partitions as well. A down side
> is that you'll need more buffers to get the same hit rate, but this
> proposal has the same problem: it doubles the amount of memory
> allocated for CLOG. Of course, this approach is all vaporware right
> now, so it's anybody's guess whether it would be better than this if
> we had code for it. I'm just throwing it out there.

We've already discussed that and my patch for that has already been
rules out by us for this CF.

A much better take is to list what options for scaling we have:
* separate out the history
* partition access to the most active parts

For me, any loss of performance comes from two areas:
(1) concurrent access to pages
(2) clog LRU is dirty and delays reading in new pages

For the most active parts, (1) is significant. Using partitioning at
the page level will be ineffective in reducing contention because
almost all of the contention is on the first 1-2 pages. If we do
partitioning, it should be done by *striping* the most recent pages
across many locks, as I already suggested. Reducing page size would
reduce page contention but increase number of new page events and so
make (2) more important. Increasing page size will amplify (1).

(2) is less significant but much more easily removed - and this is why
it is proposed in this release.
Access to the history need not conflict at all, so doing this is free.

I agree with you that we should further analyse CLOG contention in
following releases but that is not an argument against making this
change now.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
clog_history.v4.patch	text/x-diff	8.2 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-02-26 22:53:53
Message-ID:	CA+TgmoacOFRSOqb5u=g_Zcs-_OwUY0Vrj4nCiTfkdStT8Ai2zw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 25, 2012 at 2:16 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, Feb 8, 2012 at 11:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Given that, I obviously cannot test this at this point,
>
> Patch with minor corrections attached here for further review.

All right, I will set up some benchmarks with this version, and also
review the code.

As a preliminary comment, Tom recently felt that it was useful to
reduce the minimum number of CLOG buffers from 8 to 4, to benefit very
small installations. So I'm guessing he'll object to an
across-the-board doubling of the amount of memory being used, since
that would effectively undo that change. It also makes it a bit hard
to compare apples to apples, since of course we expect that by using
more memory we can reduce the amount of CLOG contention. I think it's
really only meaningful to compare contention between implementations
that use approximately the same total amount of memory. It's true
that doubling the maximum number of buffers from 32 to 64 straight up
does degrade performance, but I believe that's because the buffer
lookup algorithm is just straight linear search, not because we can't
in general benefit from more buffers.

> pgbench loads all the data in one go, then pretends the data got their
> one transaction at a time. So pgbench with no mods is actually the
> theoretically most unreal imaginable. You have to run pgbench for 1
> million transactions before you even theoretically show any gain from
> this patch, and it would need to be a long test indeed before the
> averaged effect of the patch was large enough to avoid the zero
> contribution from the first million transacts.

Depends on the scale factor. At scale factor 100, the first million
transactions figure to have replaced a sizeable percentage of the rows
already. But I can use your other patch to set up the run. Maybe
scale factor 300 would be good?

>> However, there is a potential fly in the ointment: in other cases in
>> which we've reduced contention at the LWLock layer, we've ended up
>> with very nasty contention at the spinlock layer that can sometimes
>> eat more CPU time than the LWLock contention did. In that light, it
>> strikes me that it would be nice to be able to partition the
>> contention N ways rather than just 2 ways. I think we could do that
>> as follows. Instead of having one control lock per SLRU, have N
>> locks, where N is probably a power of 2. Divide the buffer pool for
>> the SLRU N ways, and decree that each slice of the buffer pool is
>> controlled by one of the N locks. Route all requests for a page P to
>> slice P mod N. Unlike this approach, that wouldn't completely
>> eliminate contention at the LWLock level, but it would reduce it
>> proportional to the number of partitions, and it would reduce spinlock
>> contention according to the number of partitions as well. A down side
>> is that you'll need more buffers to get the same hit rate, but this
>> proposal has the same problem: it doubles the amount of memory
>> allocated for CLOG. Of course, this approach is all vaporware right
>> now, so it's anybody's guess whether it would be better than this if
>> we had code for it. I'm just throwing it out there.
>
> We've already discussed that and my patch for that has already been
> rules out by us for this CF.

I'm not aware that anybody's coded up the approach I'm talking about.
You've proposed splitting this up a couple of ways, but AFAICT they
all boil down to splitting up CLOG into multiple SLRUs, whereas what
I'm talking about is to have just a single SLRU, but with multiple
control locks. I feel that approach is a bit more flexible, because
it could be applied to any SLRU, not just CLOG. But I haven't coded
it, let alone tested it, so I might be all wet.

> I agree with you that we should further analyse CLOG contention in
> following releases but that is not an argument against making this
> change now.

No, but the fact that this approach is completely untested, or at
least that no test results have been posted, is an argument against
it. Assuming this version compiles and works I'll try to see what I
can do about bridging that gap.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-02-27 09:03:14
Message-ID:	CA+U5nMJ0hNbQjZ=C+LfL3kp7eGdTYN7WcWh+=EkfY2GVQP1eUA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 26, 2012 at 10:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Feb 25, 2012 at 2:16 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On Wed, Feb 8, 2012 at 11:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> Given that, I obviously cannot test this at this point,
>>
>> Patch with minor corrections attached here for further review.
>
> All right, I will set up some benchmarks with this version, and also
> review the code.

Thanks.

> As a preliminary comment, Tom recently felt that it was useful to
> reduce the minimum number of CLOG buffers from 8 to 4, to benefit very
> small installations. So I'm guessing he'll object to an
> across-the-board doubling of the amount of memory being used, since
> that would effectively undo that change. It also makes it a bit hard
> to compare apples to apples, since of course we expect that by using
> more memory we can reduce the amount of CLOG contention. I think it's
> really only meaningful to compare contention between implementations
> that use approximately the same total amount of memory. It's true
> that doubling the maximum number of buffers from 32 to 64 straight up
> does degrade performance, but I believe that's because the buffer
> lookup algorithm is just straight linear search, not because we can't
> in general benefit from more buffers.

I'm happy if you want to benchmark this against simply increasing clog
buffers. We expect downsides to that, but it is worth testing
nonetheless.

>> pgbench loads all the data in one go, then pretends the data got their
>> one transaction at a time. So pgbench with no mods is actually the
>> theoretically most unreal imaginable. You have to run pgbench for 1
>> million transactions before you even theoretically show any gain from
>> this patch, and it would need to be a long test indeed before the
>> averaged effect of the patch was large enough to avoid the zero
>> contribution from the first million transacts.
>
> Depends on the scale factor. At scale factor 100, the first million
> transactions figure to have replaced a sizeable percentage of the rows
> already. But I can use your other patch to set up the run. Maybe
> scale factor 300 would be good?

Clearly if too much I/O is induced by the test we will see the results
swamped. The patch is aimed at people with bigger databases and lots
of RAM, which is many, many people because RAM is cheap.

So please use a scale factor that the hardware can cope with.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-02-28 18:11:38
Message-ID:	CA+TgmoYa7+aUF2HCohZPb86WShrG2SnqwOHkhaV75K_nUj0ZiQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 27, 2012 at 4:03 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> So please use a scale factor that the hardware can cope with.

OK. I tested this out on Nate Boley's 32-core AMD machine, using
scale factor 100 and scale factor 300. I initialized it with Simon's
patch, which should have the effect of rendering the entire table
unhinted and giving each row a different XID. I used my usual
configuration settings for that machine, which are: shared_buffers =
8GB, maintenance_work_mem = 1GB, synchronous_commit = off,
checkpoint_segments = 300, checkpoint_timeout = 15min,
checkpoint_completion_target = 0.9, wal_writer_delay = 20ms. I did
three runs on master, as of commit
9bf8603c7a9153cada7e32eb0cf7ac1feb1d3b56, and three runs with the
clog_history_v4 patch applied. The command to initialize the database
was:

~/install/clog-contention/bin/pgbench -i -I -s $scale

The command to run the test was:

~/install/clog-contention/bin/pgbench -l -T 1800 -c 32 -j 32 -n

Executive Summary: The patch makes things way slower at scale factor
300, and possibly slightly slower at scale factor 100.

Detailed Results:

resultslp.clog_history_v4.32.100.1800:tps = 14286.049637 (including
connections establishing)
resultslp.clog_history_v4.32.100.1800:tps = 13532.814984 (including
connections establishing)
resultslp.clog_history_v4.32.100.1800:tps = 13972.987301 (including
connections establishing)
resultslp.clog_history_v4.32.300.1800:tps = 5061.650470 (including
connections establishing)
resultslp.clog_history_v4.32.300.1800:tps = 4871.126457 (including
connections establishing)
resultslp.clog_history_v4.32.300.1800:tps = 5861.124177 (including
connections establishing)
resultslp.master.32.100.1800:tps = 13420.777222 (including connections
establishing)
resultslp.master.32.100.1800:tps = 14912.336257 (including connections
establishing)
resultslp.master.32.100.1800:tps = 14505.718977 (including connections
establishing)
resultslp.master.32.300.1800:tps = 14766.984548 (including connections
establishing)
resultslp.master.32.300.1800:tps = 14783.026190 (including connections
establishing)
resultslp.master.32.300.1800:tps = 14567.504887 (including connections
establishing)

I don't know whether this is just a bug or whether there's some more
fundamental problem with the approach.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CLOG contention, part 2
Date:	2012-02-28 19:21:41
Message-ID:	CA+U5nM+A24Pt3iAMNqzubezd3r7--Pt2TK1RZDpgDjH2r0A6kw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 28, 2012 at 6:11 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Feb 27, 2012 at 4:03 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> So please use a scale factor that the hardware can cope with.
>
> OK. I tested this out on Nate Boley's 32-core AMD machine, using
> scale factor 100 and scale factor 300. I initialized it with Simon's
> patch, which should have the effect of rendering the entire table
> unhinted and giving each row a different XID.

Thanks for making the test.

I think this tells me the only real way to do this kind of testing is
not at arms length from a test machine.

So time to get my hands on a machine, but not for this release.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services