Quick Links

Re: Page-at-a-time Locking Considerations

Lists:	pgsql-hackers

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Page-at-a-time Locking Considerations
Date:	2008-02-04 16:04:43
Message-ID:	1202141084.4252.480.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

In heapgetpage() we hold the buffer locked while we look for visible
tuples. That works well in most cases since the visibility check is fast
if we have status bits set. If we don't have visibility bits set we have
to do things like scan the snapshot and confirm things via clog lookups.
All of that takes time and can lead to long buffer lock times, possibly
across multiple I/Os in the very worst cases.

This doesn't just happen for old transactions. Accessing very recent
TransactionIds is prone to rare but long waits when we ExtendClog().

Such problems are numerically rare, but the buffers with long lock times
are also the ones that have concurrent or at least recent write
operations on them. So all SeqScans have the potential to induce long
wait times for write transactions, even if they are scans on 1 block
tables. Tables with heavy write activity on them from multiple backends
have their work spread across multiple blocks, so a SeqScan will hit
this issue repeatedly as it encounters each current insertion point in a
table and so greatly increases the chances of it occurring.

It seems possible to just memcpy() the whole block away and then drop
the lock quickly. That gives a consistent lock time in all cases and
allows us to do the visibility checks in our own time. It might seem
that we would end up copying irrelevant data, which is true. But the
greatest cost is memory access time. If hardware memory pre-fetch cuts
in we will find that the memory is retrieved en masse anyway; if it
doesn't we will have to wait for each cache line. So the best case is
actually an en masse retrieval of cache lines, in the common case where
blocks are fairly full (vague cutoff is determined by exact mechanism of
hardware/compiler induced memory prefetch).

The copied block would be used only for visibility checks. The main
buffer would retain its pin and we would pass references to the block
through the executor as normal. So this would be a change completely
isolated to heapgetpage().

Was the copy-aside method considered when we introduced page at a time
mode? Any reasons to think it would be dangerous or infeasible? If not,
I'll give it a bash and get some test results.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-04 18:27:42
Message-ID:	5973.1202149662@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> In heapgetpage() we hold the buffer locked while we look for visible
> tuples.

It's a share lock though. Do you have any direct proof that this
behavior is as nasty as you claim?

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-04 18:45:23
Message-ID:	1202150723.4252.529.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-02-04 at 13:27 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > In heapgetpage() we hold the buffer locked while we look for visible
> > tuples.
>
> It's a share lock though.

Which conflicts with write locks.

> Do you have any direct proof that this
> behavior is as nasty as you claim?

No, but I've been thinking about how to get some, for this and other
situations. This one is difficult to track down because it moves from
buffer to buffer reasonably quickly. Starting another thread on that.

We still have a higher than desirable variability in response times and
I'm looking at possible causes.

I'll try patching it, unless you can think of a reason why its a
complete non-starter? I'm not saying we'd want it yet, just that it
seems worth trying.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-04 20:03:05
Message-ID:	87odaw5w1y.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:

> We still have a higher than desirable variability in response times and
> I'm looking at possible causes.

I agree we have a problem with this. My feeling is that the problems have more
to do with higher level things like stats being toasted, or checkpoints or wal
file changes, or a myriad of other things. But clog lru thrashing while
holding other locks is a definite possibility too.

I wonder how hard it would be to shove the clog into regular shared memory
pages and let the clock sweep take care of adjusting the percentage of shared
mem allocated to the clog versus data pages.

> I'll try patching it, unless you can think of a reason why its a
> complete non-starter? I'm not saying we'd want it yet, just that it
> seems worth trying.

Sure, but a good experiment needs af theory to test. I think you have to find
a way to measure this first. Otherwise you're going to write a patch and then
have two trees and be searching around in the dark for a difference.

This strikes me as something dtrace might be able to help measure.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-04 20:44:35
Message-ID:	20080204204435.GJ16380@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:

> I wonder how hard it would be to shove the clog into regular shared memory
> pages and let the clock sweep take care of adjusting the percentage of shared
> mem allocated to the clog versus data pages.

Hmm, this is an interesting idea. I wonder what would happen if we let
other SLRU users go into shared buffers too -- for example it has been
reported several times that pg_subtrans thrashing can cause severe
problems in case of long running transactions. (I wonder whether
pg_subtrans would occupy a big portion of shared buffers if we let it go
unchecked).

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-04 20:54:30
Message-ID:	1202158470.4252.597.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-02-04 at 20:03 +0000, Gregory Stark wrote:

> I wonder how hard it would be to shove the clog into regular shared
> memory pages and let the clock sweep take care of adjusting the
> percentage of shared mem allocated to the clog versus data pages.

There is a reason that's not been done... try it and see.

Plus it doesn't fully resolve the main issue as described.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-04 21:05:18
Message-ID:	8203.1202159118@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Gregory Stark wrote:
>> I wonder how hard it would be to shove the clog into regular shared memory
>> pages and let the clock sweep take care of adjusting the percentage of shared
>> mem allocated to the clog versus data pages.

> Hmm, this is an interesting idea. I wonder what would happen if we let
> other SLRU users go into shared buffers too -- for example it has been
> reported several times that pg_subtrans thrashing can cause severe
> problems in case of long running transactions.

My recollection is that we didn't do that because the standard buffer
manager has some assumptions that are violated by clog/etc pages ---
notably the lack of LSNs on the pages. Not sure how hard that is to
fix. I also note that we'd not really be removing any contention,
rather just pushing it into the bufmgr. Maybe the bufmgr is now
scalable enough that it could take the extra load better than SLRU can,
but this is hardly a given.

It sounds worth experimenting with, but it's not a slam-dunk win.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-04 21:08:29
Message-ID:	20080204210829.GK16380@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Mon, 2008-02-04 at 20:03 +0000, Gregory Stark wrote:
>
> > I wonder how hard it would be to shove the clog into regular shared
> > memory pages and let the clock sweep take care of adjusting the
> > percentage of shared mem allocated to the clog versus data pages.
>
> There is a reason that's not been done... try it and see.

What is it?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-04 21:10:13
Message-ID:	47A77F35.8030409@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera wrote:
> Gregory Stark wrote:
>
>> I wonder how hard it would be to shove the clog into regular shared memory
>> pages and let the clock sweep take care of adjusting the percentage of shared
>> mem allocated to the clog versus data pages.
>
> Hmm, this is an interesting idea. I wonder what would happen if we let
> other SLRU users go into shared buffers too -- for example it has been
> reported several times that pg_subtrans thrashing can cause severe
> problems in case of long running transactions. (I wonder whether
> pg_subtrans would occupy a big portion of shared buffers if we let it go
> unchecked).

Presumably we would have a fair way of accounting cache hits, and
increase the usage_count accordingly. It should occupy just the right
amount, in proportion of how often it's used vs. other buffers.

That definitely seems worthwhile to me. Not only because of any possible
performance gains you might get, but perhaps even more importantly it
would eliminate an option (clog_buffers) that you may need to tune
manually otherwise.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-04 21:10:30
Message-ID:	20080204211030.GL16380@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:

> > Gregory Stark wrote:
> >> I wonder how hard it would be to shove the clog into regular shared memory
> >> pages and let the clock sweep take care of adjusting the percentage of shared
> >> mem allocated to the clog versus data pages.

> My recollection is that we didn't do that because the standard buffer
> manager has some assumptions that are violated by clog/etc pages ---
> notably the lack of LSNs on the pages. Not sure how hard that is to
> fix. I also note that we'd not really be removing any contention,
> rather just pushing it into the bufmgr. Maybe the bufmgr is now
> scalable enough that it could take the extra load better than SLRU can,
> but this is hardly a given.

Well, in the case of pg_subtrans, I don't think the problem is
contention -- rather, the fact that the number of buffers is fixed and
small.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-05 09:31:00
Message-ID:	1202203860.4252.642.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-02-04 at 18:08 -0300, Alvaro Herrera wrote:
> Simon Riggs wrote:
> > On Mon, 2008-02-04 at 20:03 +0000, Gregory Stark wrote:
> >
> > > I wonder how hard it would be to shove the clog into regular shared
> > > memory pages and let the clock sweep take care of adjusting the
> > > percentage of shared mem allocated to the clog versus data pages.
> >
> > There is a reason that's not been done... try it and see.
>
> What is it?

Time to locate a block differs in the two cases. clog requires a search
of data on 1 cache line, which isn't often changed. shared_buffers
requires a hash table search on a volatile data structure.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-06 16:57:25
Message-ID:	47A9E6F5.3040202@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark napsal(a):
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>

I tried to use memory mapped files (mmap) for clog and I think it should be also
possible way. I got about 2% better performance, but it needs more testing.

Zdenek

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-06 17:28:51
Message-ID:	12333.1202318931@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> I tried to use memory mapped files (mmap) for clog and I think it should be also
> possible way. I got about 2% better performance, but it needs more testing.

If you only got 2% out of it, it's not even worth thinking about how to
fix the serious bugs that approach would create (primarily, lack of
control over when pages can get flushed to disk).

regards, tom lane

From:	Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-06 19:36:18
Message-ID:	47AA0C32.1030807@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
>> I tried to use memory mapped files (mmap) for clog and I think it should be also
>> possible way. I got about 2% better performance, but it needs more testing.
>
> If you only got 2% out of it, it's not even worth thinking about how to
> fix the serious bugs that approach would create (primarily, lack of
> control over when pages can get flushed to disk).

You can flush a pages by msync() function which writes dirty pages on
disk. I don't see any other problem. Originally I tried to fix problem
with a lot of parallel issues reported by Jignesh. However, it needs
more testing if it really helps.

Zdenek

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-06 19:58:07
Message-ID:	16114.1202327887@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> Tom Lane wrote:
>> If you only got 2% out of it, it's not even worth thinking about how to
>> fix the serious bugs that approach would create (primarily, lack of
>> control over when pages can get flushed to disk).

> You can flush a pages by msync() function which writes dirty pages on
> disk. I don't see any other problem.

Then you need to learn more. The side of the problem that is hard to
fix is that sometimes we need to prevent pages from being flushed to
disk until some other data (typically WAL entries) has reached disk.
With mmap'd data we have no control over early writes.

regards, tom lane

From:	Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-06 21:07:00
Message-ID:	47AA2174.3060409@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
>> Tom Lane wrote:
>>> If you only got 2% out of it, it's not even worth thinking about how to
>>> fix the serious bugs that approach would create (primarily, lack of
>>> control over when pages can get flushed to disk).
>
>> You can flush a pages by msync() function which writes dirty pages on
>> disk. I don't see any other problem.
>
> Then you need to learn more. The side of the problem that is hard to
> fix is that sometimes we need to prevent pages from being flushed to
> disk until some other data (typically WAL entries) has reached disk.
> With mmap'd data we have no control over early writes.

I see. Thanks for explanation.

Zdenek

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-07 00:01:15
Message-ID:	87y79xd48k.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:

> Tom Lane wrote:
>> Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
>>> Tom Lane wrote:
>>>> If you only got 2% out of it, it's not even worth thinking about how to
>>>> fix the serious bugs that approach would create (primarily, lack of
>>>> control over when pages can get flushed to disk).
>>
>>> You can flush a pages by msync() function which writes dirty pages on disk.
>>> I don't see any other problem.
>>
>> Then you need to learn more. The side of the problem that is hard to
>> fix is that sometimes we need to prevent pages from being flushed to
>> disk until some other data (typically WAL entries) has reached disk.
>> With mmap'd data we have no control over early writes.
>
> I see. Thanks for explanation.

In theory mlock() ought to provide that facility. The kernel people know it's
used by crypto software to avoid having disk copies of sensitive keys, so
there's at least a fighting chance it actually works for this too. But I
wouldn't put too much money on it working this purpose on every platform that
has it.

It's entirely conceivably that some platforms have mlock avoid swapping out
pages but not avoid syncing them but leaving them in RAM. Or that some might
sync mlocked pages when the process which had the page locked dies, especially
if it crashes. Or that some versions of some OSes are simply buggy. It's not
like it's a case that would ever be tested or even noticed if it failed.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-07 04:36:14
Message-ID:	200802070436.m174aE808723@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Zdenek Kotala wrote:
> Tom Lane wrote:
> > Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> >> Tom Lane wrote:
> >>> If you only got 2% out of it, it's not even worth thinking about how to
> >>> fix the serious bugs that approach would create (primarily, lack of
> >>> control over when pages can get flushed to disk).
> >
> >> You can flush a pages by msync() function which writes dirty pages on
> >> disk. I don't see any other problem.
> >
> > Then you need to learn more. The side of the problem that is hard to
> > fix is that sometimes we need to prevent pages from being flushed to
> > disk until some other data (typically WAL entries) has reached disk.
> > With mmap'd data we have no control over early writes.
>
> I see. Thanks for explanation.

This is mentioned in the TODO list:

* Consider mmap()'ing files into a backend?

Doing I/O to large tables would consume a lot of address space or
require frequent mapping/unmapping. Extending the file also causes
mapping problems that might require mapping only individual pages,
leading to thousands of mappings. Another problem is that there is no
way to _prevent_ I/O to disk from the dirty shared buffers so changes
could hit disk before WAL is written.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-07 08:15:33
Message-ID:	87tzklchcq.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Bruce Momjian" <bruce(at)momjian(dot)us> writes:

>> >> You can flush a pages by msync() function which writes dirty pages on
>> >> disk. I don't see any other problem.
>> >
>> > Then you need to learn more. The side of the problem that is hard to
>> > fix is that sometimes we need to prevent pages from being flushed to
>> > disk until some other data (typically WAL entries) has reached disk.
>> > With mmap'd data we have no control over early writes.
>>
>> I see. Thanks for explanation.

Another possibility for the CLOG would be having two on-disk copies of it. One
temporary file which would serve purely as the filesystem swap space for the
in-memory pages and would be synced and/or flushed from memory based purely on
memory pressure. The second would be the persistent store which we would write
with copies of pages to when it was time to sync them. On boot we would throw
away the old filesystem back and copy the persistent store.

One downside of using mmap though would be that we would be sacrificing
address space. Regardless of how much of the clog is actually being used we
would be losing address space large enough to cover all the clog we might
need.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-07 08:53:32
Message-ID:	1202374412.29242.191.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-02-04 at 20:54 +0000, Simon Riggs wrote:
> On Mon, 2008-02-04 at 20:03 +0000, Gregory Stark wrote:
>
> > I wonder how hard it would be to shove the clog into regular shared
> > memory pages and let the clock sweep take care of adjusting the
> > percentage of shared mem allocated to the clog versus data pages.
>
> There is a reason that's not been done... try it and see.
>
> Plus it doesn't fully resolve the main issue as described.

On further thought, there may be a way to do as Greg suggests.

We keep clog pages in shared buffers, but maintain a vestigial slru
structure that provides fast lookup to the N most recently accessed
pages. So we don't keep a physical slru buffer space anymore, we just
keep pointers to shared buffers. Slru "I/O" then becomes a swapping of
entries on the slru fast lookup structure, but hopefully not I/O out of
shared_buffers.

When we move out of clog buffers we *may* need to write the page
immediately because of async LSNs, but that seems OK.

That solution sounds weird at first, but seems much less yuck than
mmap() style solutions.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-02-07 09:28:57
Message-ID:	47AACF59.70802@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark napsal(a):
> "Bruce Momjian" <bruce(at)momjian(dot)us> writes:
>
>>>>> You can flush a pages by msync() function which writes dirty pages on
>>>>> disk. I don't see any other problem.
>>>> Then you need to learn more. The side of the problem that is hard to
>>>> fix is that sometimes we need to prevent pages from being flushed to
>>>> disk until some other data (typically WAL entries) has reached disk.
>>>> With mmap'd data we have no control over early writes.
>>> I see. Thanks for explanation.
>
> Another possibility for the CLOG would be having two on-disk copies of it. One
> temporary file which would serve purely as the filesystem swap space for the
> in-memory pages and would be synced and/or flushed from memory based purely on
> memory pressure. The second would be the persistent store which we would write
> with copies of pages to when it was time to sync them. On boot we would throw
> away the old filesystem back and copy the persistent store.

The idea to have two CLOG files copy is also good for better reliability. CLOG
is currently single point of failure. One bad block causes a big data lost.

Zdenek

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-03-23 00:37:06
Message-ID:	200803230037.m2N0b6c19764@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

With no concrete patch or performance numbers, this thread has been
removed from the patches queue.

---------------------------------------------------------------------------

Simon Riggs wrote:
>
> In heapgetpage() we hold the buffer locked while we look for visible
> tuples. That works well in most cases since the visibility check is fast
> if we have status bits set. If we don't have visibility bits set we have
> to do things like scan the snapshot and confirm things via clog lookups.
> All of that takes time and can lead to long buffer lock times, possibly
> across multiple I/Os in the very worst cases.
>
> This doesn't just happen for old transactions. Accessing very recent
> TransactionIds is prone to rare but long waits when we ExtendClog().
>
> Such problems are numerically rare, but the buffers with long lock times
> are also the ones that have concurrent or at least recent write
> operations on them. So all SeqScans have the potential to induce long
> wait times for write transactions, even if they are scans on 1 block
> tables. Tables with heavy write activity on them from multiple backends
> have their work spread across multiple blocks, so a SeqScan will hit
> this issue repeatedly as it encounters each current insertion point in a
> table and so greatly increases the chances of it occurring.
>
> It seems possible to just memcpy() the whole block away and then drop
> the lock quickly. That gives a consistent lock time in all cases and
> allows us to do the visibility checks in our own time. It might seem
> that we would end up copying irrelevant data, which is true. But the
> greatest cost is memory access time. If hardware memory pre-fetch cuts
> in we will find that the memory is retrieved en masse anyway; if it
> doesn't we will have to wait for each cache line. So the best case is
> actually an en masse retrieval of cache lines, in the common case where
> blocks are fairly full (vague cutoff is determined by exact mechanism of
> hardware/compiler induced memory prefetch).
>
> The copied block would be used only for visibility checks. The main
> buffer would retain its pin and we would pass references to the block
> through the executor as normal. So this would be a change completely
> isolated to heapgetpage().
>
> Was the copy-aside method considered when we introduced page at a time
> mode? Any reasons to think it would be dangerous or infeasible? If not,
> I'll give it a bash and get some test results.
>
> --
> Simon Riggs
> 2ndQuadrant http://www.2ndQuadrant.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page-at-a-time Locking Considerations
Date:	2008-03-23 11:04:32
Message-ID:	1206270272.4285.765.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2008-03-22 at 20:37 -0400, Bruce Momjian wrote:
> With no concrete patch or performance numbers, this thread has been
> removed from the patches queue.

I agree since there is no patch.

However, I think recent performance reports around the cost of
visibility checks such as "Very slow seq scan" by Craig Ringer on
Perform list on 10 Mar shows that this remains an area of concern. We
may have tuned some parts of the visibility checks, but not all.

So I think it should be a TODO to investigate further.

> Simon Riggs wrote:
> >
> > In heapgetpage() we hold the buffer locked while we look for visible
> > tuples. That works well in most cases since the visibility check is fast
> > if we have status bits set. If we don't have visibility bits set we have
> > to do things like scan the snapshot and confirm things via clog lookups.
> > All of that takes time and can lead to long buffer lock times, possibly
> > across multiple I/Os in the very worst cases.
> >
> > This doesn't just happen for old transactions. Accessing very recent
> > TransactionIds is prone to rare but long waits when we ExtendClog().
> >
> > Such problems are numerically rare, but the buffers with long lock times
> > are also the ones that have concurrent or at least recent write
> > operations on them. So all SeqScans have the potential to induce long
> > wait times for write transactions, even if they are scans on 1 block
> > tables. Tables with heavy write activity on them from multiple backends
> > have their work spread across multiple blocks, so a SeqScan will hit
> > this issue repeatedly as it encounters each current insertion point in a
> > table and so greatly increases the chances of it occurring.
> >
> > It seems possible to just memcpy() the whole block away and then drop
> > the lock quickly. That gives a consistent lock time in all cases and
> > allows us to do the visibility checks in our own time. It might seem
> > that we would end up copying irrelevant data, which is true. But the
> > greatest cost is memory access time. If hardware memory pre-fetch cuts
> > in we will find that the memory is retrieved en masse anyway; if it
> > doesn't we will have to wait for each cache line. So the best case is
> > actually an en masse retrieval of cache lines, in the common case where
> > blocks are fairly full (vague cutoff is determined by exact mechanism of
> > hardware/compiler induced memory prefetch).
> >
> > The copied block would be used only for visibility checks. The main
> > buffer would retain its pin and we would pass references to the block
> > through the executor as normal. So this would be a change completely
> > isolated to heapgetpage().
> >
> > Was the copy-aside method considered when we introduced page at a time
> > mode? Any reasons to think it would be dangerous or infeasible? If not,
> > I'll give it a bash and get some test results.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk