Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: James Bottomley <James(dot)Bottomley(at)HansenPartnership(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-13 22:44:53
Message-ID: 20140113224453.GE9762@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-01-13 14:19:56 -0800, James Bottomley wrote:
> > Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
>
> Is this because of problems in the mmap_sem?

It's been a while since I looked at it, but yes, mmap_sem was part of
it. I also seem to recall the amount of IPIs increasing far too much for
it to be practical, but I am not sure anymore.

> > Also, that will mark that page dirty, which isn't what we want in this
> > case.
>
> You mean madvise (page_addr)? It shouldn't ... the state of the dirty
> bit should only be updated by actual writes. Which MADV_ primitive is
> causing the dirty marking, because we might be able to fix it (unless
> there's some weird corner case I don't know about).

Not the madvise() itself, but transplanting the buffer from postgres'
buffers to the mmap() area of the underlying file would, right?

> We also do have a way of transplanting pages: it's called splice. How
> do the semantics of splice differ from what you need?

Hm. I don't really see how splice would allow us to seed the kernel's
pagecache with content *without* marking the page as dirty in the
kernel.
We don't need zero-copy IO here, the important thing is just to fill the
pagecache with content without a) rereading the page from disk b)
marking the page as dirty.

> > One major usecase is transplanting a page comming from postgres'
> > buffers into the kernel's buffercache because the latter has a much
> > better chance of properly allocating system resources across independent
> > applications running.
>
> If you want to share pages between the application and the page cache,
> the only known interface is mmap ... perhaps we can discuss how better
> to improve mmap for you?

I think purely using mmap() is pretty unlikely to work out - there's
just too many constraints about when a page is allowed to be written out
(e.g. it's interlocked with postgres' write ahead log). I also think
that for many practical purposes using mmap() would result in an absurd
number of mappings or mapping way too huge areas; e.g. large btree
indexes are usually accessed in a quite fragmented manner.

> > Oh, and the kernel's page-cache management while far from perfect,
> > actually scales much better than postgres'.
>
> Well, then, it sounds like the best way forward would be to get
> postgress to use the kernel page cache more efficiently.

No arguments there, although working on postgres scalability is a good
idea as well ;)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Kara 2014-01-13 22:47:40 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Previous Message Jan Kara 2014-01-13 22:38:44 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance