Re: MMAP Buffers

From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Joshua Berkus <josh(at)agliodbs(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MMAP Buffers
Date: 2011-04-16 11:50:10
Message-ID: 201104161350.10576.rsmogura@softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> Saturday 16 April 2011 13:00:19
> On Sat, Apr 16, 2011 at 7:24 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > The OP says that this patch maintains the WAL-before-data rule without
> > any explanation of how it accomplishes that seemingly quite amazing
> > feat. I assume I'm going to have to read this patch at some point to
> > refute this assertion, and I think that sucks. I am pretty nearly 100%
> > confident that this approach is utterly doomed, and I don't want to
> > spend a lot of time on it unless someone can provide me with a
> > compelling explanation of why my confidence is misplaced.
>
> Fwiw he did explain how he did that. Or at least I think he did --
> it's possible I read what I expected because what he came up with is
> something I've recently been thinking about.
>
> What he did, I gather, is treat the mmapped buffers as a read-only
> copy of the data. To actually make any modifications he copies it into
> shared buffers and treats them like normal. When the buffers get
> flushed from memory they get written and then the pointers get
> repointed back at the mmapped copy. Effectively this means the shared
> buffers get extended to include all of the filesystem cache instead of
> having to evict buffers from shared buffers just because you want to
> read another one that's already in filesystem cache.
>
> It doesn't save the copying between filesystem cache and shared
> buffers for buffers that are actually being written to. But it does
> save some amount of other copies on read-only traffic and it can even
> save some i/o. It does require a function call before each buffer
> modification where the pattern is currently <lock buffer>, <mutate
> buffer>, <mark buffer dirty>. From what he describes he needs to add a
> <prepare buffer for mutation> between the lock and mutate.
>
> I think it's an interesting experiment and it's good to know how to
> solve some of the subproblems. Notably, how do you extend files or
> drop them atomically across processes? And how do you deal with
> getting the mappings to be the same across all the processes or deal
> with them being different? But I don't think it's a great long-term
> direction. It just seems clunky to have to copy things from mmapped
> buffers to local buffers and back. Perhaps the performance testing
> will show that clunkiness is well worth it but we'll need to see that
> for a wide variety of workloads to judge that.

In short words, I swap, exchange (clash of terms) VM pages to prevent pointers
(only if needed). I tried to directly point to new memory area, but I saw that
some parts of code really depends on memory pointed by original pointers, e.g.
Vaccumm uses hint bits setted by previous scan (it depends on this if bit is
set or not! so for it it's not only hint). Just from this case I can't assume
there is no more such places, so VM pages swap does it for me.

Stand alone tests shows for me that this process (with copy from mmap) is
2x-3x time longer then previous. But until someone will not update whole
table, then benefit will be taken from pre-update scan, index scans, larger
availability of memory (you don't eat cache memory to keep copy of cache in
ShMem). Everything may be slower when database fits in ShMem, and similarly
(2nd level bufferes may increase performance slightly).

I reserve memory for whole segment even if file is smaller. Extending is by
wirte one byte at the end of block (here may come deal with Unfiorm Buffer
Caches, if I remember name well). For current processors, and current
implementation database size is limited to about 260TB (no dynamic segment
reservation is performed).

Truncation not implemented.

Each buffer descriptor has tagVersion to simple check if buffer tag has
changed. Descriptors (partially) are mirrored in local memory, and versions
are checked. Currently each re-read (is pointed to smgr/md), but introduce
shared segment id, and assuming each segment has constant maximum number of
blocks, will make it faster (this will be something like current buffer tag),
even version field will be unneeded.

I saw problems with vacuum, as it reopens relation and I got mappings of same
file twice (minor problem). Important will be about deletion, when pointers
must invalidated in "good way".

Regards,
Radek.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Kreen 2011-04-16 11:53:07 Re: MMAP Buffers
Previous Message Noah Misch 2011-04-16 11:19:19 Re: Broken HOT chains in system catalogs